Overall Star Ratings – Challenges to Credibility: New Insights

Omar Lateef Bala Hota Thomas Webb
Tracey Hoke Stephen Weber Russell Howerton

Executive Summary

This post will address multiple concern about the Feb 2019 release of CMS’ Overall Rating based on the following points:

1.Patients with frequent readmissions, though rare, disproportionately affect the readmission score and hospital star rating.
2.Readmission scores are adjusted for hospital volume.  This adversely impacts the scores for some large hospitals.
3.Socioeconomic status is not adjusted for in the Star rating, but is adjusted for in the HRRP.  This adversely affects urban hospitals.
4.The use of a Latent Variable Model in the Star ratings introduces variability and inconsistency, making changes in rating hard to interpret.

We believe the Overall Rating should be held until these concern can be addressed.


We have crowd sourced data from multiple hospitals and worked with health care quality leaders from around the country.  We have openly presented and shared these findings directly with stakeholders at request.  We are posting this with hopes of starting a respectful discussion around creating a fair, transparent and easy to understand ranking of hospitals that makes sense to consumer and providers.  We believe the current system, as you will read below, is exceptionally complex.  With complexity often comes unintended consequences.  We are hopeful that a conversation can be had to foster continued improvement of our ranking systems.  This is extremely important as physicians are being judged and society is drawing conclusions from those judgments that we do not believe are accurate. While we used the data primarily of Rush University in the heart of this analysis, we worked with colleagues from the University of Chicago, University of Virginia, and Wake Forest University to better understand the impact of this data.

As health care workers, we view quality care as a promise – to patients, to family of patients, and to the community.  In this, we share a common goal with all participants in our healthcare system.  At the federal level, many talented researchers and policy development leaders have designed systems to incentivize high quality care which contributes to a shared goal of a high-value healthcare system.  At Rush University, we have sought to understand the connection of policy to the care we provide to our patients.  We have found in our analyses that some unintended consequences may be resulting from the current national policies to measure healthcare quality.  These findings align with some of the recent public debate over increased mortality being linked to readmission reduction programs.  In our view, we are at a critical juncture in how we view hospital quality rating, and have a terrific opportunity to improve the way we measure hospital quality.

In this post, we will describe issues with the current CMS approach to measurement of hospital quality of care, as described by the CMS Stars rating and the Hospital Readmissions Reduction Program (HRRP).  These issues arise from:

  1. Outlier patients, with frequent readmissions
  2. Adjustment of readmission scores based on hospital volume, and star rating effect
  3. Socioeconomic status adjustment
  4. Variability in ratings due to the Latent variable model.

1.Patients with frequent readmissions, though rare, disproportionately affect the readmission score and hospital star rating.

The Readmission Domain in CMS’ Overall Rating accounts for 22% of the total score. Despite nine measure evaluated by the Latent Variable Model, only one was chosen by the model to calculate this portion of the Overall Rating. The one measure is the Hospital-Wide All-Cause Unplanned Readmission measure.  Table 1, from CMS’ Hospital Specific Report, confirms that the Loading Coefficient, determined by the Latent Variable Model, for HWR has perfect correlation (Loading Coefficient = 1.0) to the Readmission Domain score and further supported by Chart 1.

Table 1. Loading Coefficients for Readmission Domain – Feb 2019 Release

Table obtained from the Feb 2019 Overall Rating Hospital Specific Report.

Chart 1. Correlation between Readmission Domain Score and HWR Measure – Feb 2019 Release

Data from 20 Hospital Specific Reports confirm the perfectly linear relationship identified from the loading coefficients between the Readmission Domain score and the HWR measure.

Rush University Medical Center (RUMC), a tertiary care program, accepts complex, critically ill patients. Many times, the patients are referred to our hospital for a higher level of care. Accepting and treating these acuity outliers put RUMC, and hospitals like RUMC, at a risk for lower performance in the HWR measure and the Overall Rating.

Chart 2. Histogram of Patients by Number of Readmissions

This histogram shows the distribution of patients by number of readmissions during the period of July 2016 through June 2017. Four (4) patients accounted for 36 total 30-day readmissions.

Without these four patients, RUMC’s raw (un-adjusted) HWR would drop from 17.3% to 16.9%, enough to change RUMC from a 4-star to a 5-star hospital in the Feb 2019 release, if the Dec 2017 cutoffs are consistent.

Patient Profiles

Patient 1: Decompensated Liver Transplant did not make to transplant. Managed complications of recurrent bleeding that could only be treated with transplant. Clinically reviewed readmissions as unavoidable.

Patient 2: Routinely misses dialysis and comes to ED when confused. Readmitted for HD and management of renal encephalopathy that resolves after HD. Clinically reviewed readmissions as unavoidable.

Patient 3: Patient with suprapubic catheter, recurrent UTIs, ulcers non-healing. Clinically reviewed readmissions as unavoidable.

Patient 4: Patient with end stage renal disease and NO access obtainable at outside hospitals, transferred and managed with a Hero catheter requiring multiple hospitalizations to maintain graft. Clinically reviewed readmissions as unavoidable.

  • The Readmission Domain is linked to the Hospital Wide Readmission (HWR) measure exclusively. For tertiary care centers, the treatment of high acuity outliers, which are not excluded from HWR, can negatively impact performance relative to centers with lower acuity.

2.Readmission scores are adjusted for hospital volume.  This adversely impacts the scores for some large hospitals.

The use of Hierarchical Logistic Regression Models for mortality, readmissions, and complications and PSI-90 reliability adjustment adversely impacts rankings of large vs small hospitals.

It has been previously shown that volume adjustment leads to lower thresholds for reporting poor performance for larger hospitals(1,2)

  1. Sosunov EA, Egorova NN, Lin H-M, McCardle K, Sharma V, Gelijns AC, et al. The Impact of Hospital Size on CMS Hospital Profiling. Med Care. 2016 Apr 1;54(4):373–9.
  2. Joynt KE, Jha AK. Characteristics of Hospitals Receiving Penalties Under the Hospital Readmissions Reduction Program. JAMA. 2013 Jan 23;309(4):342.

Volume adjustment is employed by HRRP as a strategy to minimize the effect of variability seen in low volume centers.  This approach, also called “shrinkage” is a well-accepted approach to reduce the chance that identified outliers are not simply the result of variability due to low volumes of cases.  There is a difference, however, in adjusting for volume to detect true poor performers – the objective of the HRRP – and ranking based on the results of scoring – which is the goal of the stars program.

Charts 3a-3e (Appendix) show varying linear relationship between CMS corrected readmission rates and raw readmission rates depending on hospital size.

In an attempt to adjust results for statistical variability in small volumes, corrections done by the Hierarchical Logistic Regression Models have unintended and confusing consequences.  By adjusting for low volume in the measures, low volume hospitals, as a group, are adjusted toward the mean, displacing high volume hospitals to the high and low extremes. What is counter intuitive is that low volumes are typically associated with poorer outcomes in the medical literature.  As shown below, when comparing low and high volume centers, the lower volume center with a worse raw 30-day readmission rate is ultimately rated higher than a high volume center with a better raw 30-day readmissions rate. 

Table 2. Heart Failure Readmission Rates (July 2013 – June 2016)

Data obtained from Hospital Compare files at data.medicare.gov

Despite a 43.2% raw readmission rate, the small hospital in Texas is ranked ahead of large hospitals in Chicago and Detroit for Heart Failure.

Graphic 1. Volume and Acuity Correction of HF 30-Day Readmissions for small hospital in Texas

Table 3. Heart Failure Readmission Rates (July 2013 – June 2016) – Estimated

Data obtained from Hospital Compare files at data.medicare.gov

Excluding volume correction, small hospital in Texas’ readmission rate improves while integrity of ranking is maintained.  Large hospitals in Chicago and Detroit retain a higher ranking.

While unable to test HWR directly due to suppression of actual readmissions, the same model principals are employed in HWR, as with Heart Failure. In the Dec 2017 Release, the small hospital in Texas was corrected more than the large hospital in Detroit based on CMS’ adjusted measures, despite the larger hospital having better raw 30-day readmission rates. This results in the large hospital in Detroit receiving a worse Readmission Domain score, as shown in Table 4.

Table 4. Results from Readmission Domain from Dec 2017 Release

Data obtained from Hospital Compare files at data.medicare.gov
* Small Hospital in Texas ranks in the Bottom 1% for HF, Bottom 1% for AMI, and Bottom 35% of PN based on raw readmissions

On a larger scale, the Hierarchical Logistic Regression Model’s impact on ranking can be seen in the following two charts. Smaller hospitals are compressed to the middle and larger hospitals are displaced to the extremes.

Charts 4a-4b. Ranking Adjustments for COPD Readmissions by Hospital Size

Data obtained from Hospital Compare files at data.medicare.gov

Volume adjustment of outcome scores propagate through the entire star system as these models influence three domains and 66% of the total score.

Table 5 shows no small hospitals (based on HWR volume) have a 1-star and 8% have a 2-star, where 37% of large hospitals have 1 or 2 stars.

Table 5. Distribution of Stars by Hospital Size

Star Large Medium Small
1 11% 7% 0%
2 26% 20% 8%
3 25% 33% 44%
4 25% 32% 42%
5 13% 8% 6%

This difference isn’t due to many more large hospitals providing poor quality but a measurement system that when used for ranking creates winners and losers based on size alone.

  • The Overall Rating is heavily based on Hierarchical Logistic Regression Models. These models create bias in results based on hospital size.

3.Socioeconomic status is not adjusted for in the Star rating, but is adjusted for in the HRRP.  This adversely affects urban hospitals.

The association of low socioeconomic status and readmission outcomes has been well established, and many have advocated for adjustment of readmission rates for socioeconomic status(ref 3–6).

The 21st Century Cures Act legislated the requirement of inclusion of socioeconomic status (SES) into the calculation of financial penalties within HRRP.

Bernheim et al(ref 7) showed a statistically significant relationship of socioeconomic factors, such as median income, to readmission rates for AMI, HF, and PN. SES factors were of higher impact than over 1/3rd of medical comorbidities included in the readmission models.

3. Boozary AS, Manchin J, Wicker RF. The Medicare Hospital Readmissions Reduction Program: Time for Reform. JAMA. 2015 Jul 28;314(4):347–8.

4. Carey K, Lin M-Y. Hospital Readmissions Reduction Program: Safety-Net Hospitals Show Improvement, Modifications To Penalty Formula Still Needed. Health Affairs. 2016 Oct;35(10):1918–23.

5. Figueroa JF, Joynt KE, Zhou X, Orav EJ, Jha AK. Safety-net Hospitals Face More Barriers Yet Use Fewer Strategies to Reduce Readmissions. Medical Care. 2017 Mar;55(3):229.

6. Refining the hospital readmissions reduction program [Internet]. [cited 2019 Jan 16]. Available from: http://www.medpac.gov/docs/default-source/reports/jun13_ch04.pdf

7. Bernheim SM, Parzynski CS, Horwitz L, Lin Z, Araas MJ, Ross JS, et al. Accounting For Patients’ Socioeconomic Status Does Not Change Hospital Readmission Rates. Health Aff (Millwood). 2016 Aug 1;35(8):1461–70.

CMS’ Overall Rating program exclusion of SES from the Readmission domain creates inconsistency from CMS’ HRRP.

Our own research found that the Summary score of the Dec 2017 Overall Rating had statistically significant correlation with the proportion of dual eligible patients, data supplied by the HRRP program.

The following are a few examples of Illinois hospitals that would change star ratings based on socioeconomic status correction based on proportion of dual eligible patients.

Tables 6a-6b. Changes to Overall Rating from SES Inclusion

* SES Correction would change RUMC’s Feb 2019 preview 4 star to a 5 star
Data obtained from FY2019 IPPS Final Rule Data Tables and Overall Rating SAS code from qualitynet.org
Data obtained from FY2019 IPPS Final Rule Data Tables and Overall Rating SAS code from qualitynet.org
  • Socioeconomic status was legislated to be included when calculating readmission penalties because SES matters. SES impacts outcomes and should be addressed in the Overall Rating model.

4.The use of a Latent Variable Model in the Star ratings introduces variability and inconsistency, making changes in rating hard to interpret.

The Latent Variable Model has created confusion and contradictions in interpretation of a safe hospital. CMS runs three separate programs which evaluate hospital safety: Value Based Purchasing (VBP), Hospital Acquired Condition Reduction Program (HACRP), and Overall Rating.

These three programs largely use the exact same measures, yet there are inconsistent results on which hospitals are safe or not.

Table 7. Safety Measures for CMS Programs

For Overall Ratings, the latent variable model continues to peg PSI-90 as the overwhelming favorite for measuring safety.

Table 8. Loading Factors for Safety Domain by Release

Loading Factors obtained from Hospital Specific Reports

Chart 5. Feb 2019 Safety Domain score vs PSI-90 score

20 Hospital Specific Reports confirm the perfectly linear relationship identified from the loading coefficients between the Safety Domain score and the PSI-90 score. Hospital Acquired Infections are insignificant.

This trend was identified in the Dec 2017 release; however, the LVM switched to THA/TKA Complications during the unreleased Jun 2018 version, but back to PSI-90 for Feb 2019.

Charts 6a and 6b show very little to no correlation between HACRP and the VBP Safety domain from the Dec 2017 release. 284 hospitals received a 1% HACRP payment penalty, yet had above average safety scores in Overall Star Rating.

Chart 6a-6b. Correlation of Overall Rating Safety with HACRP and VBP Safety

Data obtained from data.medicare.gov
  • Inconsistency of safety measurement creates confusion between results of various CMS programs. Patients and hospitals don’t know what to believe as safe.


We believe the overall star rating, at this time, does not achieve the aim of a transparent measure of quality and safety that is easy to understand by consumers and healthcare quality leaders in hospitals.  We also believe that those pushing for a refresh of these measures would rather wait for an accurate measure rather than one so dramatically affected by math as described above.  Because of the cumulative effect of biases due to inadequate or inappropriate adjustment for socioeconomic status, hospital size, and outlier patients given heroic care, the star ratings inadvertently penalize large hospitals and academic medical centers.  In academic arguments, these individual effects may be perceived as small.  As we and other authors – including Bernheim, et al – have described, the effect of socioeconomic status on hospital measures is stronger than many chronic disease measures, and may account for more than a quarter of all hospitals changing rating.  Heroic care, as we’ve shown, may adversely impact rating.  Finally, simply being a large hospital may adversely affect rating and may have a financial penalty impact.  

These issues could be mitigated with four changes to the current star ratings and HRRP program.  First, aligning adjustment for Socioeconomic status in the Stars program to that of the HRRP, would be a logical and consistent method for measuring quality.  Second, capping the impact of volume on adjustment and incorporating confidence intervals would address issues with volume impacting rates.  Third, removal of the impact of outlier readmissions on the readmission measure would eliminate the undue influence of individual patients on rates and, we speculate, reduce the risk of adverse outcomes due to unintended consequences of policy. Finally, abandoning the latent variable model in the composite rating for the Overall Rating would address its lack of consistency.

We also believe the time has arrived for 21st century methods to measure quality care.  Tremendous progress in the use of electronic data has enabled high quality information to be captured by our electronic record systems.  Patient access to data has similarly been transformed through the use of standards, like FHIR, and inclusion of these data in our mobile devices like the iPhone.  Patients deserve high quality methods that are not one-size-fits-all, and are personalized and precise.  The next evolution of measurement should be accurate and personalized which guides patients to the best care possible.  The science behind ranking hospitals and providers of one versus the other is complicated.  We are hopeful that those doing these rankings listen to the medical community when information is provided and misleading findings can be held.  Without correcting for the factors described above, releasing Stars could very well have a detrimental effect on both providers and consumers.

We encourage you to comment below so that we can continue to refine our understanding and insights.


Charts 3a-3e. CMS Readmission vs Raw Readmissions – By Hospital Size – Heart Failure

Data obtained from data.medicare.gov

Return to Post

4 thoughts on “Overall Star Ratings – Challenges to Credibility: New Insights

  1. What particular methodologies have you considered for the recommendation below? Have you tried volume -specific (or other facility-level factors) shrinkage targets instead of the overall mean? This was proposed in the report linked below.

    “capping the impact of volume on adjustment and incorporating confidence intervals would address issues with volume impacting rates”


    1. Robert, thank you for your comment. The resource that you provided is an excellent read and addresses a number of these issues. We had a few thoughts on ways to do this. First, we were thinking of doing something similar as in the new HRRP by creating cohorts/groups based on size. CMS even mentions this in their comment period document. As for confidence intervals, we also thought about what US News tried for Safety scores. Everyone receives 1, 2 or 3 points per measure. Everyone within +/- 2 sigma receives 2 points, then outside +/- 2 sigma receives the 1 or 3 points. We didn’t love the implementation in US News because it basically made the measure meaningless but it was a start of an idea. This is all about iterating on ideas to make it better.

  2. Reply to CMS request for public input on changing the methodology used to assign Star-Ratings to hospitals.

    • The Star Ratings system is inadequate to the complexity of patient care.
    • The five categories are too few, not specific, probably inflated, and therefore of questionable value.
    • Why would anyone choose to get care at a smaller hospital if it has three stars because it isn’t a big hospital, even though it provides excellent-superior care for what that medical staff routinely manage?
    • Why would anyone trust the care of a tertiary facility that has no choice but to bear the burden of greater numbers of high-risk problems, less-fortunate populations, greater age and infirmity, and therefore statistically poorer outcomes, despite best efforts by all concerned?
    • Worse, this discussion begins “Frequent Readmissions though rare, disproportionately impact profiles”
    • Let’s reward the effectiveness and competency of hospitals based on how well they handle the patient care in their demographic reality.
    • Since the problem is “Rare Readmissions” why not create a separate category for managing those circumstances?
    • Our Armed Forces have learned the uselessness of trying to acknowledge diversity with universal metrics. It is inappropriate, unfair and does great harm.
    • An Aircraft Carrier and an Operational Division are large organizations, they might be equated to the Tertiary Facilities in Healthcare. They are not measured against the capabilities of smaller vessels in the first case and special-forces/small-units-and-tactics in the second case.
    • Diverse, but standardized symbology: Organizational symbols, rank/rating, skill devices, ribbons/medals awarded for deployment and achievement displayed on individual uniforms and/or buildings, naval vessels, and aircraft provide nearly instant insight for any interested observer.
    • If we are going to spend tax-payer dollars for some activity, can we please ask the basic questions about the goals of any program we fund?
    • That includes the basic question about where similar experience might exist in order to save time and money.
    • What is obvious here is that a valid statistical observation is being wrongfully applied to the detriment of hospitals in the eyes of their communities!
    • Hospitals do not practice medicine, even though federal regulations try to force them to do so.
    • They have NO INFLUENCE, NOR SHOULD THEY, over what physicians do for individual patients, unless there is evidence of misbehavior, incompetency, and or fraud, etc.
    • Shame on all of us that numerous contracts are authorized ($83,000,000 to Yale University) and the unknown cost of Rush University’s efforts. Are we really going to accept the waste of 100s of Millions of dollars that inevitably decrease funding for those in need right NOW?!
    • We have no right to do this, nor permit this kind of abuse of scarce financial resources!

    1. There are numerous confounding variables that could factor in the star model. In order to eliminate them we need to consider: hospital size, location (urban vs rural), services (major tertiary care/ referral center or teaching vs not). You need to run 3 analyses with these variables in mind and the analysis needs to consider hospital wide measures and measures specific to the bulk of what they do or treat. Different weights should be used to reflect the relative importance of these factors. I believe by now CMS has amassed an arsenal of data that they can use to produce and validate a better predictive model for safety and outcomes for all hospital types above. Generalizing irreproducible findings using untested one size fits all model is not only statistically but also morally erroneous as it will mislead the public and decision makers.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.