

RESEARCH METHODOLOGY: STATISTICS IN MEDICAL RESEARCH 

Year : 2014  Volume
: 1
 Issue : 1  Page : 6266 

Understanding the basic statistical questions that disturb a medical researcher
Amir Maroof Khan^{1}, Rajeev Kumar^{2}, Pranab Chatterjee^{1}
^{1} Department of Community Medicine, University College of Medical Sciences, University of Delhi, Delhi, India ^{2} Department of Biostatistics and Medical Informatics, University College of Medical Sciences, University of Delhi, Delhi, India
Date of Web Publication  3May2014 
Correspondence Address: Dr. Amir Maroof Khan Room No. 409A, Department of Community Medicine, University College of Medical Sciences, University of Delhi, Delhi India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/23490977.131867
Medical research does not deal with only medical sciences; it is also dependent on other disciplines, and statistics is an integral part in its conduct. It is challenging for a medical researcher to grasp the importance of statistics and also to decide the types of statistical issues in the various phases of his/her medical research. There are inherent variations within and between the human/animal subjects used in medical research and these uncertainties can only be grasped using statistical tools. Initiating a medical research while taking into account the statistical aspects right at the planning stage is one of the best ways to conduct better evidencebased research. The validity of the results of a medical research depends not only on the methodology of conducting the study but also on the analysis of data collected. As opposed to the general perception, statistics not only deals with analysis of data but is also intricately interwoven with the methodology section of the research where sample sizes, inclusion/exclusion criteria, and others are mentioned. Although statistical softwares simplify the computational aspect of statistics, the confusing conceptual aspects make interpretation of the outputs difficult and incorrect. Seemingly simple terms such as population, sample, parameters, and variables have been explained keeping the medical researchers' perspective in mind. This first article in the series "Statistics in Medical Research" makes an attempt to facilitate the medical researcher to overcome the initial questions that challenge him/her with regard to statistics. Keywords: Estimation, hypothesis testing, population, parameter, sample, statistic, statistical test, variable
How to cite this article: Khan AM, Kumar R, Chatterjee P. Understanding the basic statistical questions that disturb a medical researcher. Astrocyte 2014;1:626 
How to cite this URL: Khan AM, Kumar R, Chatterjee P. Understanding the basic statistical questions that disturb a medical researcher. Astrocyte [serial online] 2014 [cited 2019 Aug 17];1:626. Available from: http://www.astrocyte.in/text.asp?2014/1/1/62/131867 
Introduction   
Initiating a medical research brings with it some statistical aspects for consideration beyond the scope of pure medical science. It seems daunting for the medical researcher to grasp the terminologies, concepts, and utility of this domain, which has become essential not only to conduct a reliable research but also to understand the published medical research. Application of statistics has increased in published medical literature both in quantity and complexity. ^{[1]} Our teaching experiences revealed that basic statistical terminologies, such as population, sample, parameter, statistic, data types, and so on, proved to be difficult to comprehend statistical terminology by a medical researcher. An incorrect understanding of these basic concepts leads to conflict with the statistician further damaging the researchers' comprehensive understanding of his/her research. Although availability of userfriendly statistical softwares makes the statistical computations easy for clinicians and medical researchers, insufficient knowledge of underlying statistical terms and data type may lead to wrong inference. In the first article of this series on "Statistics in Medical Research," an attempt is being made to help the medical researcher appreciate the value of statistics as a facilitating tool in medical research and explain some basic terms for conceptualizing a valid medical research.
Why to have a Statistical Consideration in Medical Research?   
The practical relevance of medical research lies in generalization of the findings by studying a sample rather than the information gained about those particular individuals. Also, researchers are interested in predicting the results if other groups of patients receive the same treatment that was used in the research. ^{[2]} For the clinician treating a patient, the question can be expressed as: "Is my patient so different from those in the trial that its results cannot help me make my treatment decision?". ^{[3]} The characteristic that makes these questions difficult to answer is the inherent variation that exists among human beings. Measurements of the various characteristics, for example, weight, blood pressure, serum creatinine, of the human beings vary between each other. Even the effect of the same drug or surgical procedure has different outcomes in different patients. For example, a cholesterollowering drug may have different side effects in different patients.
Statistics is a tool that enables researchers to gain meaningful insights into these types of data where inherent variations are present both within and between the groups. It helps in estimating the degree of uncertainty associated with the results from the data. Unless statistical thinking is embedded at the planning stage, trying to push it at the final stages of research becomes not only a daunting task but may even lead to wrong conclusions. The importance of statistics in medical research is widely published. ^{[4],[5]}
What issues do Statistics Resolve in Medical Research?   
Broadly, statistics resolves the following two issues associated with medical research.
Estimation
Estimating a population parameter based on sample statistic is called statistical estimation. The value of sample statistic is known as point estimation and is described later in the article. The point estimate may be incidence/prevalence/mean/correlation/standard deviation/so on and best approximation of a certain unknown population parameter. The range of the possible values of point estimate is called interval estimate, which is popularly known as confidence interval.
Sample mean reflects the estimate of the population mean. Sample mean will not be exactly similar to the population mean and there will be a degree of uncertainty associated with it. This will be given by confidence interval that provides the range of minimum and maximum possible values of population mean taking uncertainty into account. These interval estimates of population are estimated from the sample by using normal distribution theory or bootstrapping. ^{[6]}
Hypothesis Testing
Hypothesis is a supposition made by the researcher based on the previous experiences or pilot study/ies, or scientific literature and has limited evidence and requires further investigation to provide more convincing evidence. A hypothesis contains two statements; one is a null hypothesis and the other is an alternative hypothesis.
For example, a study was planned to compare the mean change in the mean arterial pressure (MAP) with a new therapy as against conventional therapy in newly diagnosed hypertensive subjects from baseline to after 1 month of treatment. The null hypothesis is "there is no significant difference in the mean change in MAP between the two therapies" and the alternative hypothesis is "there is statistically significant difference in the mean change in MAP between the two therapies."
The P value attempts to quantify this uncertainty. A critical analysis about P value will be done in a future article in this series.
What is a Population in Medical Research?   
Population for a nonstatistician refers to all the persons or animals living in a particular area; but for a statistician, a population can very well be beyond human beings or animals. It can be the blood samples collected, blood banks in a city, tuberculosis patients attending an outpatient department, surgeries performed using a particular technique, and so on. The challenge for a medical researcher is to define the correct population of his/her study. It is important that the population be clearly defined, although one may not be able to enumerate it exactly. ^{[7]} For example, in a study to estimate the incidence of preeclampsia among pregnant females attending the antenatal clinic of a government hospital, the population for the study will be all the pregnant females attending the antenatal clinic of the hospital. It is also necessary that as many dimensions of this population are specified, which may be relevant to interpret the findings of the study. In this case, it may be the age group of the pregnant women under consideration in the study, as this may impact the incidence of preeclampsia. If not specified, then the reader will be at a loss with regard to this important aspect. The reader needs precise information on such matters to draw valid inferences from the sample that was studied to the population being considered. ^{[7]} Furthermore, any exclusion criteria decided by the researcher should also be considered while defining the population. For example, if the researcher excludes those pregnant women with a history of preeclampsia in a previous pregnancy, then the incidence rate can be generalized only to the pregnant women not having a history of preeclampsia in a previous pregnancy. The population can change depending on the objectives of the study.
What is a Sample?   
The number of units included in the study, which are selected from the representative population under consideration, is called as the sample. Measurement derived from the sample is called a "statistic" or a "point estimate." It is the best guess for the unknown population parameter. For example a "mean" of a sample will be a "statistic." As this "statistic" will be used to project the population estimates, the sample should be chosen in a manner that it is representative of the population. A sample, even if chosen in a random manner, will not represent the exact situation of the population. Samples can vary in terms of size, methods in which they are drawn from the population, and so on. There will also be a deviation between the estimates of samples of same sizes drawn from the same population, which in the statistical jargon is referred to as sampling variation and combination of two errors: random and systematic errors. Random error is unavoidable as we do not study the whole population and draw a sample from it, which forms the core basis of the whole biostatistical component. Random error has inverse relationship with sample size; systematic errors are systemgenerated errors and can be controlled with proper care.
A sample drawn from an unrepresentative population may produce a biased estimate. For example, if a researcher is interested in finding the smoking status among the teenagers in a community and collects the data only from the schools located in that area, the sample is not representative because it excludes the dropouts or those not attending school. Prevalence may be biased because smoking may be more prevalent in the dropouts and teenagers who are not attending the school. A representative sample of adequate size, randomly drawn from a welldefined population forms the most significant basis on which the concepts of biostatistics in quantitative studies are based on.
What are the Parameters?   
A great deal of misconception exists around this term. Most confuse the variables in a particular study as the parameters, whereas in statistical terminology, it has an altogether different meaning. Population estimates, which are arrived at from the sample statistics, are known as parameters. For example, population mean (μ) estimated from sample mean (x ^{}) will be known as a parameter. A population parameter is denoted by a Greek letter, whereas a statistic with a Latin letter. A sample may not give an exact value for a parameter, that is, a sample mean is just an estimate of population and will not be exactly equal to population mean, but interval estimate can provide a range for the plausible population mean (parameter). However, as sample size increases, likelihood of sample estimate to become closer to population parameter increases. The parameters, which usually are of interest to a medical researcher, are proportion, mean, correlation, standard deviation, and regression coefficient. The objective of sampling is to provide a value of the statistic that is an unbiased estimate of the parameter. ^{[8]} [Table 1] shows the symbols of commonly used population parameters and sample statistic.  Table 1: Symbols of Population Parameter and their Corresponding Sample Statistic
Click here to view 
What are the Variables Involved in a Study?   
Variables are those characteristics that have the potential to vary between the subjects and may be within the subjects. The subject may be a human, an animal, or things (ie, area, culture media, schools, and so on). Suppose a study is being planned to find out the risk factors of diabetes among adults (>20 years) in a particular area and the risk factors that have to be analyzed using statistical methods are as follows: Age, socioeconomic status, waist circumference, and family history of diabetes. In this example, diabetic status will be a variable as it can vary from subject to subject, that is, either diabetic or nondiabetic. Similarly age, socioeconomic status, and others will also vary between the subjects; all the characteristics that can vary from subject to subject are the variables involved in this particular example/study, that is, diabetes status, age, socioeconomic status, and others are between subjects. Listing of variables and also the type of variables is another important yet often missed step in initiating a study. Withinsubject variables usually arise in longitudinal where same subjects are repeatedly measured or in cluster design studies where subjects are correlated within a cluster.
Variables by Type of Measurement Scales Used   
Variables can be classified in different ways; one is by the type of measurement, where they can be of three main types: (1) categorical (also known as nominal), (2) ordinal (also called rank data), and (3) metric (continuous).
Categorical Scale (Nominal)
These can be either those having two categories (dichotomous variables) or more than two categories (polytomous). Nominal type do not have any order and all distinct levels have equal status, and characteristic is referred by the name, for example, malefemale, healthydiseased, blood group (A, B, AB, O), marital status (married, single, widower). Diabetic status of the subject is a dichotomous variable as it can have two outcomes either diabetic or nondiabetic, whereas type of treatment taken by diabetics can have more than two categories, namely, on hypoglycemic drugs, insulin, and nonpharmacological treatment. Proportion is the summary statistic that is computed for a categorical variable.
Ordinal Scale
If there is some inherent order in the response categories, the variable type is known as an ordinal variable. For example, pain can be categorized as mild, moderate, and severe or socioeconomic status can be categorized as lower, middle, and upper socioeconomic status. In this, we can observe an inherent order, but we cannot quantify the difference; for example, intensity of pain between the moderate and mild is not equal to intensity of pain between severe and moderate. The categories defined for a particular variable will depend on the objective of the study as decided by the researcher.
Another type of ordinal scale can emerge by the categorization of continuous variable for the sake of understanding, comparisons, and/or analytical purposes. For example, converting the body mass index (BMI) into <25, 2530, >30.
Sometimes a variable has the responses in the form of a scale, such as Visual Analog Scale pain score, Appearance, Pulse, Grimace, Activity, Respiration (APGAR) score. A fivepoint Likert scale item, such as "What do you feel is the doctor's attitude toward his/her patient?" may have the following response categories such as very bad, bad, indifferent, good, and very good. Such type of response categories may be made to range from 0 to 10, 0 to 100, −2 to +2, and so on. Commonly data collected using Likert scale items are summarized on the basis of a Likertscale score, which contains the sum or average of a set of questions having high consistency. The consistency means they must be in similar direction and related to a desired outcome domain, for example, SF36 health survey that measures quality of life of a patient based on the eight domains and the sum or average of score of each domain questions is treated as a continuous variable. However, reporting of single Likertscale item results seldom appears in medical research and it should be avoided.
Continuous Scale
These are variables that can take any numeric value and the relative magnitude of the values is also important, for example, age, systolic blood pressure, serum creatinine, and so on. A subset of continuous variable is a discrete variable as they are usually counts; for example, number of children a woman had number of diarrheal episodes. However, when the number of discrete values is large, such as age in years, highdensity lipoprotein, lowdensity lipoprotein, and so on, it is considered as continuous. Continuous variables can be categorized into categorical if the objective of the study demands such. For example, hemoglobin levels in blood is a continuous variable, which can be categorized into anemic and nonanemic status if the objective of the study is to find out the prevalence of anemia or to test the association of some other variable with anemic status. Usually medical researchers have a tendency to record the continuous variables in the form of preformed categories rather than as the raw data obtained by them. Converting the categorical variables into numerical variables later on is impossible whereas a numerical/continuous variable can be categorized in various categories later on. Hence, this tendency should be avoided and data should not be categorized at the data collection stage itself, but this conversion of continuous to categorical type may be done at the analysis stage. The categorization of continuous variables leads to loss of information, ^{[9],[10]} and the researcher should know the reason for categorization and the rationale behind determination of the cutoff points considering its advantages and disadvantages.
Dependent and Independent Variables   
Another way in which variables can be classified is as dependent and independent variables. These terms are used in regression analysis where the dependent variable is a function of one or more than one independent variable. A dependent variable represents outcome or response or effect of a study, whereas the independent variables are considered to observe the influence on the dependent variable. In the example mentioned earlier, risk factors of diabetes among adults in a given area, we see that the diabetic status of a subject is dependent on various factors such as age, socioeconomic status, waist circumference, and so on. Thus diabetic status is the dependent variable or outcome variable and the other variables are called the independent variables. In a study to find out the association of smoking and lung cancer, it can be said that lung cancer is the dependent variable and smoking the independent variable.
Being aware of the measurement type of variables at the start of the study helps the medical researcher to decide on the summary statistics that will be generated in the study and the statistical tests that will be applicable to it.
Are there Statistics without the Statistical Tests?   
Some medical researchers are of the opinion that biostatistics does not come into picture until and unless statistical tests are used in the study. The principles of biostatistics are relevant and start playing a role since the planning stage of the study itself, that is, designing of the study, determination of sample size, sampling technique, data collection, and later in summary statistics, estimation, and so on.
Any point estimate arrived at from sample(s) will be known as statistics, whereas statistical tests, such as ttests, Chisquare tests, ztest, Ftest, and others, come into picture when hypothesis testing has to be done. Biostatistics can take the form of summary statistics (descriptive), inferential statistics (estimation), and/or statistical tests (hypothesis testing). Thus it is important to understand that "statistical tests" is just a part of the many concepts of statistics that are applied in research.
Conclusions   
Variations or uncertainties associated with various characteristics form the basis of statistics. Defining a population and then drawing a random sample from it helps make inferences regarding the population parameters from the sample statistic using the principles of statistics. Being aware of the types of variables present in the research study, aids in understanding the summary statistics that will be generated and also in deciding the statistical tests that can be applied.
Statistical errors comprise errors in study design, wrong selection of statistical methods, incorrect interpretation of results, and errors in reporting statistical summaries leading to loss of credibility of the inferences and recommendations arising there from. ^{[11],[12]} To reduce the flaws and pitfalls in statistical errors, it is preferable that a statistician is involved from the beginning of study as an incorrect analysis of the study data can be reanalyzed but a study design flaw, such as selecting incorrect sampling technique or committing mistakes in identification of the type/s of variables, cannot be corrected later. Adequate knowledge of statistical terms not only helps medical researchers to communicate with a statistician but also enhances their understanding in reducing errors, better reporting, and correct interpretation of the study results.
References   
1.  Altman DG. Statistics in medical journals: Some recent trends. Stat Med 2000;19:327589. [ PUBMED] 
2.  Altman DG, Bland JM. Generalisation and extrapolation. BMJ 1998;317:40910. 
3.  Jacobson LD, Edwards A, Granier SK, Butler CC. Evidencebased medicine and general practice. Br J Gen Pract 1997;47:44952. 
4.  Sprent P. Statistics in medical research. Swiss Med Wkly 2003;133:5229. [ PUBMED] 
5.  Mandrekar JN, Mandrekar SJ. Biostatistics: A toolkit for exploration, validation and interpretation of clinical data. J Thorac Oncol 2009;4:14479. 
6.  Davison AC. Bootstrap methods and their application. Cambridge: Cambridge University Press; 1997. 
7.  Swinscow TD. Populations and samples. Br Med J 1976;1:1513. [ PUBMED] 
8.  Indrayan A. Medical biostatistics: CRC Press; 3rd ed. Boca Raton: Chapman and Hall; 2012. 
9.  Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med 2006;25:12741. 
10.  Altman DG, Royston P. The cost of dichotomising continuous variables. Br Med J 2006;332:1080. 
11.  Lang T. Twenty statistical errors even you can find in biomedical research articles. Croat Med J 2004;45:36170. [ PUBMED] 
12.  Strasak AM, Zaman Q, Pfeiffer KP, Gobel G, Ulmer H. Statistical errors in medical researcha review of common pitfalls. Swiss Med Wkly 2007;137:449. 
[Table 1]
