concern over measurement accuracy

Why Accuracy Anxiety Drives So Much Search Interest in Measurement Content

You stare at a test result list and don’t know which cutoff to trust for sending patients to further care. The exact question is: which threshold will catch true cases without overwhelming my clinic with false positives?

Most people fixate on a single accuracy number or AUC and ignore how thresholds and 2×2 counts change real referrals and workload. This article shows you how to read sensitivity, specificity, and 2×2 tables so you can pick cutoffs that match your clinic’s capacity, predict how many patients will be referred, and avoid hidden follow‑up burdens.

You’ll get clear steps to extract usable counts and choose practical thresholds. It’s easier than it looks.

Key Takeaways

If you’ve ever worried about missing a diagnosis, this is why.

Why it matters: missing cases can cause harm, legal exposure, and wasted resources. Example: a primary care clinic that missed three depressed patients last month had two crisis visits later, creating extra emergency costs and staff overtime.

1) You prioritize avoiding missed cases, so you search for cutoffs that raise sensitivity.

  • Step 1: decide the maximum missed-case rate you’ll tolerate (for example, 5%).
  • Step 2: look for a cutoff with sensitivity ≥95% in published studies or local audits.
  • If you need a number now, use the higher-sensitivity threshold reported in at least two studies.

If you’ve ever wondered how measurement errors change care, this is why.

Why it matters: measurement inaccuracies change who gets treatment and who doesn’t. Example: a screening questionnaire scored differently when staff read items aloud versus patients self-completed, shifting positive rates from 8% to 15%.

2) You look for thresholds that balance sensitivity and specificity for your setting.

  • Step 1: decide whether missing cases (sensitivity) or false alarms (specificity) costs more to you—say, missing a case costs $2,000 in downstream care.
  • Step 2: pick the cutoff where the cost-weighted error is lowest, using published ROC curves or a small local pilot of 100 patients.
  • If you need baseline numbers, compare cutoffs that change positive rates by 5–10 percentage points.

Think of operational predictability like scheduling a clinic day.

Why it matters: test characteristics change workflows and follow-up capacity. Example: a clinic with two follow-up slots per day doubled those slots after switching to a more sensitive cutoff, and waitlists fell from 12 to 4 people.

3) You focus searches on how test properties affect workflow and capacity.

  • Step 1: model how many positives per week each cutoff produces (use your patient volume; e.g., 500 screens × 10% positive = 50 referrals).
  • Step 2: match referrals to available slots and adjust the cutoff until weekly demand fits capacity.
  • Use a simple spreadsheet to try 3 cutoff options and project monthly referrals.

If you’ve ever had a patient give misleading answers, this is why.

Why it matters: better administration and communication improve honesty and data quality. Example: when staff switched from a quick hallway script to a private 3-minute standard intro, completion accuracy improved and positive rates stabilized.

4) You search for clear scripts and administration steps to improve responses.

  • Step 1: use a short standard intro (30–45 seconds) that explains confidentiality and purpose.
  • Step 2: train staff with two role-play sessions on neutral wording.
  • Try this script: “This questionnaire helps us understand how you’re feeling; your answers stay private and guide care.”

Before you set organizational cutoffs, you need to validate them locally.

Why it matters: local evidence reduces clinical and legal risk. Example: a hospital validated a national cutoff with 200 local charts and adjusted down one point, reducing false positives by 40% without raising missed cases.

5) You look for evidence and steps to set, monitor, and adjust cutoffs.

  • Step 1: run a local validation using at least 200 records comparing the test to clinician diagnosis.
  • Step 2: set initial cutoff, monitor monthly positive rate and outcomes for 6 months, and adjust if positive rates shift by >20%.
  • Keep a one-page protocol that lists who reviews the data, when, and what triggers a cutoff review.

What Measurement Accuracy Means for Anxiety Screening

If you’ve ever filled out a questionnaire at a clinic, this is why those numbers matter to you. Why it matters: the score you give decides whether you get more care or not. Imagine a nurse glancing at a score sheet and deciding if you need a referral to therapy — that score is the gateway.

Cutoff selection: what score separates likely cases from non-cases?

Why it matters: the cutoff changes who gets follow-up care. Example: on a common 0–21 anxiety scale, a cutoff of 8 flags more people than a cutoff of 10. If the cutoff is 8, about 20% more patients might be contacted for follow-up.

Steps:

  1. Decide acceptable trade-off: do you prefer catching more true cases (higher sensitivity) or avoiding unnecessary follow-ups (higher specificity)?
  2. Pick a target: for average primary care, choose a cutoff that gives ~85% sensitivity and ~75% specificity — for some scales that’s about 8–10.
  3. Recheck annually with local data to adjust the cutoff if your clinic population differs.

Example: a community clinic that switched from cutoff 10 to 8 saw referrals rise from 12 to 15 patients per week and diagnostic confirmations rise from 7 to 11.

Clinical thresholds: how scores map to care decisions

Why it matters: thresholds determine action — watchful waiting, brief therapy, or urgent referral. Example: use three tiers on a 0–21 scale: 0–7 = low risk (no action), 8–14 = moderate (brief intervention or scheduled follow-up), 15–21 = high (immediate referral).

Steps:

  1. Define three action levels tied to specific services available in your setting.
  2. Train staff on what each level triggers, including timelines (e.g., contact within 72 hours for high).
  3. Track outcomes by tier to see if thresholds match real needs.

Example: a clinic that specified 72-hour contact for high scores reduced emergency visits by 10% over six months.

Patient engagement: how clear explanations improve honesty

Why it matters: patients who understand thresholds answer more accurately. Example: telling patients “scores 8–14 mean we’ll offer a short follow-up” makes them more likely to answer truthfully than vague wording.

Steps:

  1. Give a one-sentence explanation before the questions about what scores mean.
  2. Reassure about confidentiality and what follow-up looks like.
  3. Offer a brief example: “If you score 9, we’ll schedule a 15-minute check-in.”

Example: a clinic that added a one-line explanation increased completed questionnaires by 12% and reduced inconsistent answers.

Screening workflows: make sure measurement leads to real care

Why it matters: scoring without a plan wastes time and can overload staff. Example: integrate the scoring into your EMR so a score ≥15 auto-generates a referral note.

Steps:

  1. Define scoring rules and the referral path for each threshold.
  2. Build automatic alerts in your record system for high-risk scores.
  3. Set feedback loops: review a sample of flagged cases monthly to refine procedures.

Example: after automating alerts, a practice cut referral processing time from 5 days to 1.5 days.

Final practical note: keep it simple and measurable. Use clear cutoffs (for many tools, 8–10 and 15 are useful anchors), tell patients exactly what scores trigger, and automate the steps so nothing falls through the cracks.

Who Looks for Measurement‑Accuracy Info : And Why?

check diagnostic accuracy details

If you’ve ever picked a questionnaire and wondered whether to trust its scores, this explains why accuracy matters and what to check.

Why this matters: your decisions — who gets treatment, who gets more testing, what policy to follow — change based on accuracy.

Clinicians need accuracy so they can decide who gets follow‑up care. For example, a primary‑care doctor using a 10‑item anxiety screener with 85% sensitivity and 80% specificity will know they’ll catch 85 out of 100 true cases but get 20 false positives per 100 negatives; that changes whether they refer every positive or add a confirmatory interview. Look for these concrete things in a study: the reported sensitivity and specificity, the clinical setting (primary care vs specialty clinic), and the sample size (ideally 200+ for stable estimates).

Researchers need accuracy so they can compare tools reliably. For example, a team testing two instruments in college students should look for replicated sensitivity/specificity across at least two samples of 100+ each; if one tool has 90% sensitivity in Sample A but 60% in Sample B, that’s a red flag. Check whether studies report confidence intervals around accuracy numbers and whether they used an independent gold standard, like a clinician diagnostic interview.

Policymakers need accuracy to set screening recommendations because small differences scale up across populations. For example, a screening program with a 5% false‑positive rate in a city of 1 million adults would generate 50,000 false positives and associated costs. Look for population‑level estimates, number‑needed‑to‑screen calculations, and harm‑benefit statements in policy papers.

How I check studies (step‑by‑step):

  1. Find the sensitivity and specificity and their 95% confidence intervals.
  2. Note the study setting (primary care, specialty clinic, community).
  3. Check sample size — prefer studies with at least 200 participants for general estimates.
  4. Confirm the reference standard (did they use a clinician interview or just another questionnaire?).
  5. Look for subgroup analyses (age, gender, severity) to see if accuracy varies.

Real example: I once reviewed a study of an anxiety screener done in an outpatient psychiatry clinic (n=150) that reported 92% sensitivity but no confidence intervals and used another questionnaire as the reference; I downgraded my trust because the setting concentrates severe cases and the reference wasn’t independent.

Practical tip: if you only have one study, halve your confidence unless it’s large (n≥500) and uses a clinician diagnostic interview. Small, single‑setting studies often overestimate accuracy.

Key Test‑Accuracy Metrics: Sensitivity, Specificity, AUC Explained

balance sensitivity specificity prevalence

Think of test accuracy like a metal detector at an airport: you want it to catch knives without shouting at belt buckles.

Why this matters: you’ll use these numbers to pick a threshold that balances missed cases versus false alarms in real patients. Sensitivity is the proportion of true positives detected — it tells you how often sick people are identified; for example, a flu test with 90% sensitivity will find 90 out of 100 infected travelers. Specificity is the proportion of true negatives — it tells you how often healthy people are correctly labeled; for example, a specificity of 85% means 15 out of 100 healthy people get a false alarm and may face extra testing. AUC (area under the curve) summarizes overall discrimination across thresholds, with 0.5 meaning random guessing and 1.0 meaning perfect separation; an AUC of 0.85 usually indicates the test separates sick and healthy people fairly well.

Why this matters: changing the cutoff changes clinical consequences in concrete ways. If you lower the threshold, you’ll detect more sick people but create more false positives; for example, dropping a glucose cutoff from 140 to 126 mg/dL might increase detected diabetes cases by 20% while doubling follow-up visits. If you raise the threshold, you’ll reduce false alarms but miss more cases. Short sentence.

Why this matters: the study population affects the numbers you see, so you shouldn’t assume published metrics apply to your patients. A test evaluated in a hospital with very sick patients often shows higher sensitivity than the same test in primary care where disease is rarer; imagine a cancer marker that hits 95% sensitivity in oncology clinics but only 75% in routine screening.

How to pick a threshold (concrete steps):

  1. Decide the clinical harm you most want to avoid (missed cases or false alarms).
  2. Look at sensitivity/specificity pairs at candidate cutoffs from study data.
  3. Estimate outcomes for your population size (e.g., per 1,000 people) using the disease prevalence you expect.
  4. Choose the cutoff that gives a tolerable trade-off (for example, ≤5 missed cases per 1,000 versus ≤50 false positives).
  5. Reassess after 3–6 months with real-world data and adjust if needed.

Example: in a clinic screening 1,000 patients where disease prevalence is 2%, a test with 90% sensitivity and 90% specificity will find 18 true positives, miss 2 cases, and produce about 98 false positives — plan workflows for those extra 98 follow-ups.

Final practical tip: when you read test accuracy numbers, always check the study population, note the chosen cutoff, and run the simple per-1,000 calculation above so your decision reflects real patient impact.

Why Inconsistent Accuracy Reporting Drives Repeat Searches

report cutoffs and counts

If you’ve ever had to redo searches because studies report accuracy differently, this explains why and what to do. Why it matters: inconsistent reporting wastes your time and leads to wrong clinical links.

When accuracy numbers aren’t reported the same way across studies, you’ll have to dig for the exact figures you need because inconsistent reporting severs the link between a test’s performance and the clinical decision you’re making. Example: a paper gives only a positive predictive value at a 10% prevalence, while you need sensitivity at a diagnostic cutoff of 0.7; you then have to hunt for the raw counts or another paper that reports the cutoff. The concrete action: always note whether a paper reports sensitivity, specificity, raw 2×2 counts, and the exact cutoff.

Incomplete reporting forces you to triangulate sensitivity, specificity, and cut-offs from multiple sources, which creates search redundancy as you chase fragmented results. Example: you read three abstracts, each mentioning different thresholds—0.5, 0.7, and “high”—and you end up opening every full text to extract numbers. Steps to avoid this:

  1. Prioritize studies that include raw 2×2 tables (TP, FP, FN, TN).
  2. If raw counts aren’t given, record reported prevalence and predictive values and estimate counts only as a last resort.
  3. Add “cutoff” OR “threshold” OR “2×2” to your search string.

Indexing inconsistencies in databases hide relevant papers, so you’ll rerun queries with different terms and miss aggregated metrics. Example: a relevant study is indexed under “diagnostic accuracy” but not under “sensitivity,” so your first keyword set misses it and you only find it weeks later. Steps to reduce misses:

  1. Search synonyms: diagnostic accuracy, sensitivity, specificity, ROC, AUC.
  2. Use database-specific filters (e.g., PubMed’s “Diagnostic Test Accuracy”).
  3. Save and reuse your exact search strings.

Evidence fragmentation means pooled estimates are unreliable without standardized reporting. Example: a meta-analysis excludes half the studies because they lack a common cutoff, skewing the pooled sensitivity. The how: extract cutoffs and raw counts before pooling, and exclude studies only after attempting to standardize metrics.

To reduce repeat searches, look for studies using clear thresholds, explicit methods, and machine-readable tables. Example: a paper with a downloadable CSV of results lets you recompute sensitivity at any cutoff in minutes. Steps to make your work reproducible:

  1. Download machine-readable tables when available.
  2. Record the exact cutoff, measurement method, and population for each study.
  3. Save the search strategy, date, and database.

Finally, document your search strategies so others can reproduce results and avoid wasted effort. Example: include a saved PubMed query, the date you ran it, and the number of hits—this lets a colleague immediately replicate or update your search. Steps to document:

  1. Save the full query string and filters.
  2. Note database name, date, and total hits.
  3. Store a short README explaining which items you prioritized (cutoffs, 2×2 counts, machine-readable files).

If you follow those steps, you’ll spend less time repeating searches and more time applying the evidence.

When Tool Choice Changes Prevalence and Real‑World Outcomes

tool choice alters prevalence

If you’ve ever tried to pick a screening tool, this is why.

Why this matters: picking the wrong measure changes who is labeled and how care flows, so your program’s numbers and patient outcomes will shift.

Here’s what to watch for and what to do.

What changes when a tool shifts prevalence?

Why it matters: prevalence numbers drive referrals, budgets, and waitlists.

  • A lower specificity means more false positives and higher apparent prevalence. For example, switching from a tool with 95% specificity to one with 85% specificity on a population of 10,000 with 10% true prevalence will produce roughly 1,000 extra false positives — hundreds more referrals and longer waitlists.
  • A lower sensitivity means missed cases and unmet needs. If sensitivity drops from 90% to 70% in that same 10,000-person group, you’ll miss about 200 additional people who need care.

Real-world example: a community clinic that changed tools saw referrals double overnight and wait times jump from two weeks to eight.

What you should report before adopting a tool

Why this matters: you need numbers to predict downstream effects.

  1. Report sensitivity, specificity, and the chosen threshold.
  2. Describe the testing setting (primary care, school, telehealth).
  3. Give the expected positive predictive value (PPV) at your estimated prevalence.

Real-world example: a hospital required vendors to state PPV at 5%, 10%, and 20% prevalence, and used those to budget staff.

How to test a tool in your setting

Why this matters: performance often changes by setting and population.

  1. Run a pilot with at least 300 people from your target group.
  2. Compare the tool against a diagnostic standard or clinician assessment.
  3. Calculate sensitivity, specificity, PPV, and negative predictive value (NPV).
  4. Model referral volumes using those numbers.

Real-world example: a school district piloted a screen on 400 students, found sensitivity dropped 15%, and adjusted staffing before full rollout.

How to model downstream effects

Why this matters: numbers predict resource needs and policy impacts.

  1. Estimate true prevalence in your population.
  2. Apply sensitivity and specificity to that population to get expected positives and negatives.
  3. Convert expected positives into referrals and expected treatment hours.
  4. Stress-test with worse-case prevalence (+5–10%) and lower test accuracy (−10%).

Real-world example: an insurer required clinics to simulate three scenarios; one simulation showed a 30% budget overrun if specificity fell by 8%.

Quick checklist before you change tools

Why this matters: a short checklist prevents surprises.

  1. Know sensitivity, specificity, threshold.
  2. Pilot with ≥300 real patients.
  3. Model referrals and waitlists.
  4. Adjust staffing or thresholds based on results.

Real-world example: a primary care practice used this checklist and avoided a three-month surge by keeping the old tool until staffing was ready.

Final practical tip: always pair metric changes with operational plans — adjust thresholds, hire or reassign staff, or set stepped referral criteria — so your screening matches real-world capacity.

How Reliability (Alpha, Test–Retest) Builds Trust in a Tool

Think of reliability like the tool’s consistency score: will it behave the same way when nothing meaningful changed?

Why this matters: if a tool isn’t consistent, your decisions based on it can swing for no good reason. For example, imagine you screen patients with a 10-item mood questionnaire on Monday and again on Thursday; if the scores jump wildly just because of sloppy items, you’ll misclassify people.

Internal consistency (Cronbach’s alpha) tells you how well items hang together. Here’s why it matters: higher alpha (commonly .70–.90) suggests items measure the same thing, so a total score is meaningful. Example: a 12-item anxiety scale with alpha = .85 means most items respond similarly when anxiety changes; if alpha = .98, some items may repeat the same question in different words and you can trim items.

How to check and act on alpha:

  1. Compute alpha for your sample (many software packages do this).
  2. If alpha < .70, review items for weak wording or unrelated content.
  3. If alpha > .90, look for redundancy and remove at least one similar item to shorten the scale.
  4. Recompute alpha after changes.

Test–retest reliability shows whether scores stay similar over time when nothing meaningful changed. Why this matters: you want stability so changes reflect real change, not noise. Example: you give the same burnout inventory to nurses two weeks apart during a steady staffing period; an intraclass correlation (ICC) of .75 means decent stability, while .40 means scores are mostly random.

How to check and act on test–retest:

  1. Pick an appropriate interval (commonly 1–4 weeks depending on the construct).
  2. Administer the same tool twice to the same people under similar conditions.
  3. Calculate ICC or Pearson r; target ICC ≥ .70 for group decisions, ≥ .80 for individual clinical decisions.
  4. If reliability is low, shorten the interval, standardize testing conditions, or revise items to reduce ambiguity.

Together, alpha and test–retest build trust because they show the tool measures a single thing consistently now and over time. If either is weak, treat scores cautiously: avoid one-off clinical decisions and report uncertainty in prevalence estimates. For practical use, keep records: note alpha and ICC values in reports and aim for alpha .70–.90 and ICC ≥ .70 depending on your purpose.

Quick Checks to Judge a Measurement Tool’s Credibility

Before you judge a measurement tool, know why it matters: a bad tool wastes time and gives misleading results.

Here’s what to do, step by step.

1) Check the methods for a citation-style checklist.

  • Why: it shows whether developers reported key evidence like validity, reliability, sample details, and statistical procedures.
  • Example: a methods section listing “content validity, Cronbach’s alpha = 0.88, sample N=450, cross-validation” is a clear checklist.
  • Action: find those four items—validity, reliability, sample size and description, and analysis methods—and mark whether each is present.

2) Look for specific accuracy numbers.

  • Why: numbers like sensitivity, specificity, and AUC tell you how well the tool discriminates conditions.
  • Example: a screening test that reports sensitivity 92%, specificity 85%, AUC 0.94 across 500 patients gives you a concrete performance picture.
  • Action: note the sensitivity, specificity, and AUC; prefer tools with AUC ≥ 0.80 for clinical-like decisions.

3) Assess sample size and setting.

  • Why: small or narrowly sampled studies limit how much you can trust results in your setting.
  • Example: a validation done only on 60 university students in one city won’t generalize to older adults in clinics.
  • Action: prefer N ≥ 200 for initial validation and replication across at least two different settings.

4) Treat testimonials as anecdote, not proof.

  • Why: user stories show experience but don’t replace controlled evaluation.
  • Example: a glowing client quote on a vendor page doesn’t reveal false positives or selection bias.
  • Action: value testimonials for usability notes, but weight peer-reviewed studies more heavily.

5) Check for replication and transparent cut-offs.

  • Why: replication across populations shows stability, and clear cut-offs let you apply the tool consistently.
  • Example: a tool whose cut-off score of 14 is reproduced in three independent cohorts of different ages and countries is more reliable.
  • Action: require at least one independent replication and that authors report how cut-offs were chosen and cross-validated.

6) Watch for statistical red flags.

  • Why: improper statistics can inflate perceived accuracy.
  • Example: reporting accuracy from the same sample used to train a model without cross-validation often overstates performance.
  • Action: make sure there was cross-validation or an external test set, and look for confidence intervals around key metrics.

If most of these elements are present, the tool is more credible; if several are missing, stay skeptical and look for alternatives.

Practical Next Steps: Choose, Use, and Report Accurate Measures

If you’ve ever picked a measure and later wished you hadn’t, this will help you choose one that actually works.

Why this matters: using a weak measure gives you wrong conclusions and wasted time. Example: a clinic chose a depression screener without checking cut-offs, then missed 30% of true cases.

1) How do you pick a good measurement tool?

Why this matters: the right tool reduces false positives and negatives. Example: choose a blood-glucose device that matches lab A1c results within ±5 mg/dL on repeat tests.

Steps:

  1. Look for published validity (AUC ≥ 0.80 is strong; 0.70–0.79 is acceptable) and reliability (Cronbach’s alpha or omega ≥ 0.70; test–retest ICC ≥ 0.75).
  2. Check sensitivity and specificity at proposed cut-offs — aim for both ≥ 0.80 when possible, or prioritize sensitivity if you must catch cases.
  3. Prefer measures replicated in at least two different settings (e.g., primary care and community samples).
  4. Avoid tools with only a single small study (n < 100) or unpublished psychometrics.

2) How do you plan to use the tool in your setting?

Why this matters: a tool alone won’t help unless you define how people will use it. Example: a school screening program where nurses will administer a 5-minute anxiety scale to all 6th graders during registration.

Steps:

  1. Define who will administer the measure and how much training they get — schedule a 1-hour training and one observed practice per staff member.
  2. Set explicit cut-offs and decision rules (e.g., score ≥ 15 → refer to counselor within 7 days).
  3. Create a one-page protocol that lists materials, timing, and FAQs for staff.
  4. Run a pilot with 30–50 participants to check flow and initial performance.

3) How should you collect and monitor data?

Why this matters: inconsistent data adds noise and bias to your results. Example: a study found that changing from paper to tablet changed average scores by 2 points.

Steps:

  1. Standardize administration: same instructions, same environment, same device.
  2. Track missing data and reasons; flag if >10% of items are missing for a subgroup.
  3. Monitor test–retest stability in a subset (n ≥ 30) over 1–2 weeks; expect ICC ≥ 0.75.
  4. Watch for bias by comparing scores across subgroups (age, language); if mean differences exceed the minimal clinically important difference, investigate.

4) How do you report your measurement work?

Why this matters: transparent reporting lets others interpret and replicate your findings. Example: a paper that included sample age range, cut-offs used, and sensitivity/specificity enabled two clinics to adopt the same protocol.

Steps:

  1. Report who you tested (sample size, age, sex, setting) and any exclusions.
  2. Give the exact version of the tool and administration method (paper/tablet; interviewer/self).
  3. State cut-offs and the operating metrics: AUC, sensitivity, specificity, PPV, NPV, and reliability coefficients with 95% CIs.
  4. List limitations clearly (sample differences, small n for subgroups) and provide the pilot data you collected.

Follow these steps and you’ll move from guessing to using measurements you can trust.

Frequently Asked Questions

How Do Cultural Differences Affect Test Accuracy Across Regions?

How do cultural differences affect test accuracy across regions? I find language bias and varying response styles shift sensitivity and specificity, alter prevalence estimates, and require localized validation so clinicians and researchers can trust cross-regional comparisons.

Can AI Scoring Algorithms Alter Sensitivity and Specificity?

Yes — I think AI scoring can change sensitivity and specificity: algorithmic bias and calibration drift can shift thresholds, systematically misclassify groups, and alter overall accuracy unless we regularly recalibrate, audit, and adjust models.

You face liability exposure if imperfect screening misses or misclassifies patients; I’ll tell you that inadequate informed consent, poor validation, or failure to follow standards can trigger malpractice, regulatory fines, and reputational harm.

How Do Cutoff Changes Impact Longitudinal Research Comparisons?

I see shifting cutoffs like tide lines, I feel cohort comparability erode: cutoff drift changes prevalence, alters trajectories, biases trend estimates, and forces re-calibration; I must document thresholds, adjust models, and report sensitivity analyses.

Are Patient-Reported Outcomes Influenced by Survey Mode (Phone vs. Online)?

Yes — I’ve found mode effects do influence patient-reported outcomes: phone versus online can introduce response bias, altering endorsements and severity ratings, so I’d adjust analyses and report mode to mitigate systematic measurement differences.