Five Ways to Lie With Statistics— Or at Least Tell a Better Story
May 2021 Issue
Using the Statistical Fog to Tell a Compelling Story
Certainly, statistics promise to extend our understanding beyond our individual knowledge and experience. But have you ever gotten the feeling that you do not know how to judge if a presentation or paper you are seeing or reading is accurate? You may be reluctant to question the conclusions because you are not familiar with what a p-value is exactly, let alone the difference between ANOVA and MANOVA, correlation versus regression, or the statistical instrument that was used.
When we see numbers and quantitative analysis, by nature we surrender to the power of quantitative data, referred to as an overconfidence/overprecision bias or, informally, number bias.1 We suspend our observational clinical judgement or healthy skepticism of the discussion and assume the researcher's quantitative data is unbiased and untainted. However, people involved with research, science, and engineering know that is not necessarily the case. I remember early in my career an older researcher made an auspicious comment to me: "I find it quite interesting, and even a bit humorous, that you naively think that researchers are inherently immune to bias."
We know that researchers may, knowingly or unknowingly, use statistics to enhance, fudge, optimize, misrepresent, and yes, maybe cheat a little, as much as the stereotypical used car salesman. And just like buying a used car, the axiom "buyer beware" holds true. The consumer must always consider if the research being sold is to be believed or not.
In 2011, Diederick Stapel, a Dutch former professor of social psychology at Tilburg University, had his doctoral degree retracted because he was found to have altered or fabricated data for over 30 peer-reviewed publications going back to 2004. It was found out that he was directly altering data, which was regarded as a black eye for his institution and the field of social psychology. At the expense of the scientific process, his research told a well-constructed and elegant story about carnivores being more selfish than vegetarians.2
While this thankfully may be an isolated case, the implication is that peer reviewers and clinicians would not know how data has been altered, manipulated, or quietly omitted. The resourceful researcher, frustrated by the lack of any statistical result, may continue to use various instruments until something is conjured up to validate the study. Many reputable and well-meaning researchers (including me) can make unintentional errors and apply comparative measures incorrectly. However, some study authors overstate their conclusions, embellish the content, and creatively enhance their results so they can receive more research money, awards, and accolades. It is somewhat forgivable when you consider the extremely competitive nature of research. The paradox is that many well-constructed research projects yield no significant results, which is just as valuable to the scientific process. Basic research in the more mundane, but critical areas of reliability and validity measures are largely ignored for funding. Under pressure, researchers may succumb to enhancing and amplifying their results to achieve the maximum impact at the lowest cost.
Choose a Small and Convenient Sample
One way to ensure your study turns out the way want is by cherry picking a small group of people that fit the characteristics you desire. The reader is oblivious to who wasexcluded from the study. Small sample sizes are often a limitation in O&P because groups of patients with similar clinical presentations and device use may be difficult to find and recruit for a study. Small sample size is not inherently a problem unless the researcher generalizes or derives the results with statistics developed for larger groups. Most statistics use distribution models of average, frequency, and correlation to discover characteristics of the group, but you must have a minimum sample size to provide sufficient probability of the result. This probability is also called "the power of," using frequentist models for a population. It is difficult to derive these results using distributive group statistics like average, standard deviation, and distribution with small groups that do not have sufficient data points.
Ideally, at least 90 percent of a population should be represented, but 95 percent is even better. Slovin's formula can help you know if the study has enough power. For example, if there is a sample of 30,000 patients, you would need 395 patients to have 95 percent confidence. For 90 percent you would need 100. Since it may take more time and money to find that many subjects, delaying research, a smaller sample size may be chosen. The research presentation should acknowledge the power and confidence as well as discuss any group that is over/underrepresented.3 If the group is very small, the study author must be clear that any conclusion found with the sample may not be representative of the total population because he or she lacks sufficient power to describe the entire population with confidence or significance. Sometimes at the end of the paper or presentation, researchers will make earth-shaking conclusions, but not temper their statements with the power of the sample size.
Another element to consider about the sample is its heterogeneity or homogeneity—how alike or how different the group is. It can be good or bad depending on what is being studied. If the question is about the acceptance of upper-limb prostheses, then it may be good to narrow the focus to adult acceptance rather than including subjects of dissimilar ages from seven to 85. If it is to look at a broader question like career satisfaction of orthotists, you may want to have a range of ages represented. As a consumer of research, the important thing to know is whether the researcher provides this self-critical information, so you can consider if the sample was flawed or at least know the limitation of the study.3 Also, the type of sample chosen is related to the nature of the research question. For example, if you are asking about the cosmetic needs of upper-limb prostheses, are prosthetists or prosthetic users going to answer more critically?
Small numbers are not inherently bad, but they should be bolstered by additional qualitative descriptions and comparisons. These are often left out by those who only consider numeric data and do not consider comments relevant. However, if they restrict the data to averages, standard deviations, correlations, and regressions drawn from only 14 people, the number "look" is very similar to that of 40, 400, or 4,000 people. We need the additional context of comments to tell us more of their qualitative opinions, concerns, expectations, or needs to move our understanding further. Ignoring these is simply bad science.4
Give the Average Values (but Nothing Else)
Many researchers only give the average values, but do not provide the standard +/- deviations. This tells what the range is for the information. It could be that out of ten people only two were near the average, and the others were scattered about so the average may not be a great model of the group. In a sense, the average is a representation of the group, and if it used for comparisons the distribution should be given. For example, if two students got a 15 percent, but eight students got an 88 percent, the class average would be 73.4, but in reality, the class did much better.3
Any graphs should show the "whiskers," a range +/- one standard deviation at the top of the bar to tell you what the distribution range was. This shows whether there is a real difference between groups and how much overlap there was with the data. If the data is very similar, the difference between the groups may not be significant even though the averages may be different. An ethical researcher points this out and indicates when the averages may have differences but are not significant. Variables should be considered within the conditions and context. In an exceptional year like 2020, for example, a researcher may not want to derive "normal behavior" of patient visits because of COVID but instead look at a longer period. Some researchers may not note the context of the data or are oblivious to the conditions because the result is more dramatic. The ethical researcher may choose to exclude aberrant data and tell you the reason for the exclusion.3
Use p-Mining to Find False Correlations and Causation
One of the most common mistakes is to confuse correlation and causation. Simply put, correlation is the strength of the relationship between two or more variables. Causation goes deeper and says one variable produced a measurable effect on the other, or in other words, causes it. The critical thinker must always consider if correlation is being interpreted as a false causation. This is done frequently, often to tell a more compelling story. A classic example would be: As ice cream sales increase, so does the number of sunburns. A correlation as a causation error would say ice cream causes sunburns. The key is that there are many factors like sunny days, humidity, and higher temperatures that are the real factors of causation.
Correlations can be strong or weak. For example, patient height may have a positive correlation with weight. Meaning, it is typical that the greater the height of the patient the greater the weight. A negative correlation is the opposite. So, the older a man is, the less hair on his head he may have. These correlations are measured with an r-factor of -1, 0, or 1 that shows the strength of the relationship. A relationship of -.10 would be a very weak negative correlation, while a .91 shows a strong positive correlation.3
Researchers should define the question prior to evaluating the data for a correlation. If you just plop several factors together and run a correlation test with Excel or a statistics program and look for unanticipated relationships, there are bound to be a few. These are said to be significant if the p-value of significance is less than 0.05, which means that there is only a 5 percent probability that the experiment does not fit the research question.3
One technique is to put a bunch of factors together and rummage about looking for significant relationships that were not anticipated in the experimental design, called p-mining. Ethical researchers will note the unanticipated relationships found but indicate that they would require a specific experimental and methodological design to further study and understand.3
So, there must be a specific hypothesis where there is a prediction and when there is a null result. For example, the research question may be, "Is there an increased knee extension moment in an orthosis with an extended foot plate?" The hypothesis would be an emphatic "There is a relationship with the knee extension moment and the length of a footplate with a KAFO." The null would be "There is no significant relationship between a foot plate and a KAFO." This would be said to be a one-way experimental design since it is just checking for a relationship. A two-way would be if the relationship we are trying to predict is if there is a related increase or decrease of the extension moment.3
Consumers of research can be guilty of snatch-and-grab practice and assign false correlations. For example, in the vitamin supplement industry, an herb that has shown to benefit a small group of people leads to a variety of products, but it may not have the same effect for a majority of people. An older prosthetics research paper showed patients with Pelite liners wore their prostheses longer than those with cushion gel liners. The researcher was careful not to overstate her results, but many misinterpreted this as Pelite was more comfortable because they did not consider questions such as: Was it because Pelite was custom shaped for unique limb profiles? Or was it because laborers, who could not afford to expose expensive cushion gel liners, used foam liners as cheaper alternative? Those questions are worth investigating further but could not be inferred from the results.4
The experimental design needs to have an independent predictive variable and a dependent outcome variable. In the previous KAFO example, the independent predictive variable would be altering the length of the foot plate. The dependent outcome variable would be the extension moment measured with a moment sensor. However, the research question needs to lay the clear groundwork for the experimental design. A significantly low p-value is meaningless if the question is not clinically relevant or vague.3 Researchers will sometimes intentionally ask a vague research question to expand the statistical net to find a result. Worse yet, they find the relationships first, then form a question that fits the data before publishing. Doing it backwards may get instant results but is misleading and unethical.
Throw the (Qualitative) Baby Out With the (Methodological) Bathwater
Good research should be guided by clinical experience. Conclusions from numerical data should be validated with practical experience. Positive previous results are also forms of data, and practices should not be abandoned unless there is evidence to the contrary. It is human nature for research to make dramatic change to clinical practice since there is an ever-present bias to overstate research results.5
However, when someone indicates there is no evidence to support a clinical observation, there may be no evidence that it is wrong either. A biomechanist once told me the equinovarus deformity did not exist for partial foot Chopart and Lisfranc amputations. This didn't align with my clinical observation that almost all patients presented with equinovarus unless they had Achilles lengthening. He explained that this deformity may exist, but he did not see evidence of it with the instruments he was using. He also admitted he wasn't looking for that either. 6
A common mistake is to mix types of variables that are not compatible, such as nominal and scalar. However, in many areas of healthcare, qualitative variables are transformed into numerical scalar or interval numbers. Just like a five-point Likert scale that asks a subject to choose between definitely disagree, somewhat disagree, neutral, somewhat agree, and definitely agree, orthopedic surgeons judge spondylolisthesis as grade I, II, III, IV, or V based on visual inspection of the vertebral body. Nominal variables as numbers look like hard data but are really just a qualitative visual inspection. Sometimes healthcare workers will try to give a plus or minus to refine their view, but it is still a qualitative value.7
Even the Wong-Baker Faces ten-point pain index and the often-used Manual Muscle Test are considered qualitative assessments with nominal variables and errors.7 In and of itself this is not an issue, but when compared with frequentist methods of distribution and causation represented as scalar or interval data, it becomes problematic.3
Different variables can measure the same thing. Transradial amputations can be classified as long, medium, short, and very short limbs as a nominal variable. A scalar variable would be measuring the actual length since this has a 0-point. The scalar variable would allow comparison with correlation, but the nominal would not.
Qualitative experience is important, and one should not be data-blind to it. Qualitative analysis is confined to the comment section, but many researchers find the greatest value in terms of the conditions of the research, and they should be shared.8 The National Transportation Safety Board's "Go Teams" employ a qualitative evaluation process they have used for 35 years that uses quantitative and qualitative assessments when examining airplane accidents. Since they do not have an aggregated and distributed number of accidents that fit the same circumstances, they need to take in all data. Obviously, they cannot use frequentist methods of central tendency, deviation, and inference due to few instances of accidents, so they use reports and narratives from professionals to piece together the probable cause.
Surgeons will employ techniques for a special set of circumstances that are not based on statistical outcome measures to achieve positive results when no other option is available. Orthopedic surgeon Andrew Cappuccino, MD, used moderate hypothermia cold therapy on the spinal cord to decrease swelling and regain arm and leg function for the Buffalo Bills lineman Kevin Everett. A true quantitative double-blind outcome study was not available for this therapy at the C3-C4 level. However, based on previous clinical experience, the surgeon made the decision that would render the most positive outcome.
Even when designed and conducted properly, statistical methods still have errors that should be considered.
A type I error is when a statistically significant result was found, but there really wasn't one. This is often due to poor sampling, such as where a group of prosthetists have low experience with scanning limb shapes, but it is found that 38 percent of the respondents had over 25 years of experience. A type II error is when there was no statistically significant result, but there really was one. For example, if there was no difference in motion or velocity of an energy-response foot measured within the gait lab, but the patient could still consistently detect the difference in movement and preferred it. A type III error is where the wrong question was asked that produced an incorrect result. Type III is an error when "you don't know what you don't know." An example may be measuring energy expenditure of an orthosis, when the main benefit to the patient was upright walking.3,5
Confuse and Conquer
The last way to prevaricate with statistics is perhaps the most difficult to detect. It requires knowing how the researcher should have designed the experiment, recognizing the inherent weaknesses or issues. Every experimental and methodological design has inherent weaknesses, and these should be explored and discussed. The type III error mentioned previously is very difficult to detect because we do not know what the author does not report. The ethical researcher is very transparent with the inherent issues and data sets or what has been excluded.
Also, you need to have some knowledge of what tests should be used to evaluate the sample, compare the results, and derive relationships. For example, if you are just looking for a relationship, this would be a one-way correlation; if you wanted to know if it was positive or negative then it would be a two-way directional correlation. Regression or Analysis of Variance (ANOVA) is if you wanted to know if one factor was predictive or caused another. A multivariate analysis of variance (MANOVA) is if there were multiple contributing factors.
There are many more types of analyses, but as a consumer you need to be familiar with the comparisons and tests. Just like food labels, you need to know what is inside and look up the statistical test or simply ask the researcher to explain it. No reputable researcher can resist the opportunity to describe his or her statistical practices.
Gerald Stark, PhD, MSEM, CPO/L, FAAOP(D), is a senior clinical specialist at Ottobock Healthcare, Austin, Texas.
1. Moore, Don A., D. Schatz. 2017. The three faces of overconfidence. Social and Personality Psychology Compass 11:8. doi:10.1111/spc3.12331. ISSN1751-9004.
2. Bhattacherjee, Y. 2013. The mind of a con man. New York Times Magazine. https://www.nytimes.com/2013/04/28/magazine/diederik-stapels-audacious-academic-fraud.html
3. Field, A. 2009. Discovering statistics using SPSS (3rd ed.). London: Sage Publications.
4. Coleman, K. 2004. Quantification of prosthetic outcomes with locking pin suspension versus with neoprene sleeve suspension. JRRD 41:4 (591-602).
5. Patton, M. 2015. Qualitative research and evaluation methods (4th ed.). Thousand Oaks, California: Sage.
6. Dillon, M. 2007. Biomechanics of ambulation after partial foot amputation: A systematic literature review. American Academy of Orthotists & Prosthetists 19:8 (2-61).
7. Orthotic Lower Limb and Spinal Manuals. 2016. Northwestern University Prosthetic Orthotic Center, Chicago, Illinois.
8. Higgins, M. 2007. Doctors on the scene acted quickly to treat Everett with cold therapy. The New York Times https://www.nytimes.com/2007/09/16/sports/football/16everett.html.
9. Wong-Baker Faces Foundation. 2016. The Wong-Baker FACES® Pain Rating Scale: Instructions for use. https://wongbakerfaces.org/instructions-use.
10. NTSB. 2012.The investigative process. https://www.ntsb.gov/investigations/process/Pages/default.aspx.