Evaluation and statistics: Growing confidence while being comfortable with uncertainty

Much money and resources are invested in social development in South Africa. Ultimately, we want to know the impact of all this effort and investment.  We need to know if the programmes and initiatives we implement and champion make the difference in people’s lives they are intended to make.  Evaluation and statistics are tools to help us answer questions about programme impact and effectiveness, which in turn guide decisions about programme implementation, management, funding and so forth.  Not everyone working in civil society has a background in research and statistics though so in this article we try to give access to some of the key evaluation and statistics concepts that those working in civil society will find helpful to know and understand.    

How and why we evaluate

Besides for gauging the impact of social development initiatives/programmes, evaluation studies help us to better understand the challenges we are addressing; to decide whether we are on track or whether we need to change direction; and whether there are unintended consequences – either good or bad – that need to be taken into account as we move forward.

The best form of evaluation depends on what question you want to answer.   For example, if you want to understand user perceptions of a specific programme, you may choose to use open-ended focus groups (qualitative[1] research) or close-ended Likert-scale questionnaires (quantitative[2]  research) or a combination of both.  If your aim is to use the evaluation to drive quality improvement within the existing programme, you may choose a more participatory form of evaluation, involving staff members or service users in the research.  On the other hand, you may opt for entirely external review if you want to assess whether the impact of the programme is generalizable (has ‘external validity’).

Process evaluation (formative research) can take a variety of shapes, depending on the question that needs to be addressed.  Process evaluation helps to expand one’s understanding of the problem and why an intervention may work or not work.  It is often a critical strategy for incremental improvement and change.  The main limitations of formative research are that: i) its findings are specific to the intervention being evaluated i.e. you cannot generalise, and ii) you can’t be sure whether any observed effect can be attributable to the intervention itself.  A common mistake is to regard process evaluations as impact evaluations.  Bringing in an external consultant after the fact to review the programme and comment on the monitoring data does not constitute an impact evaluation.

On the other hand, impact evaluation asks whether the intervention works or not, but may contribute little to our understanding of why it works.  It is inductive in that it tries to generalise from specific findings to a general understanding. However, a common error among programme implementers is over-confidence in the generalizability (external validity) of the findings of the impact study.  People assume that the intervention caused the observed finding, without excluding the possibility that other factors may have been responsible for the change. Or people assume that the programme participants are just like anybody else, when in fact they may not be.  They may have come to the programme because they were particularly motivated, or had greater access to information, or were more educated.

These two factors – the possibility of confounding factors and systematic biases – are the most important factors to eliminate in impact evaluation.  If you can say with a high degree of confidence that i) your findings show a real change associated with your intervention; ii) that these changes are independent of other possible causal factors and; iii) are not systematically biased, you can attribute the changes to your intervention.   In order to have this confidence, you need to be able to understand and interpret basic statistics.

Basically, how statistics work

The aim of statistics is to come up with the best estimate of what is really happening within a group of people. In technical-speak, statistics try to describe the real distribution of ‘events’ within a defined population.  An event (occurrence) might be the mathematics test score obtained by a Grade 3 student.  The ‘real distribution of events’ would be the pattern of scores obtained by all the Grade 3 learners in South Africa (See Figure 1).   In South Africa, we would expect the distribution pattern to be skewed towards the end with lower scores because most South African children don’t perform well in maths, but for the sake of illustration, we are using a normal distribution[3] (‘bell-curve’).  To compare one distribution pattern to another, it needs to be described in a standard statistical term, which will most commonly be the mean (average), the median and the mode[4].

Figure 1. A schematic distribution of maths scores for the universe of Grade 3 learners in South Africa

If the results of every student were obtained and plotted, there would be no uncertainty as to the distribution of the scores.  If a baseline were done in 2011, and then repeated again in 2013, and all scores plotted, you could see whether the distribution of scores has changed over the two-year period (Figure 2).  It so happens that the Department of Basic Education conducts such a study, and so there is no need to worry about the external validity of the findings i.e. do these findings apply to all Grade 3 children in South Africa?  We can say with absolute confidence that the answer is YES (assuming the reliability of data collection).  But for most events, it is not possible to measure each and every occurrence, and so we need to estimate the distribution of events from a sample of the population.  The distribution is then described in statistical terms from the results of the survey, which generates both point estimates for each measure as well as estimates of the range of measures observed.  Each observed event is then called an observation. Figure 2. Hypothetical changes in the maths scores for the universe of Grade 3 learners in South Africa 2011 – 2013

Let us assume that our aim is still the same, namely to determine the distribution of Grade 3 maths scores in the country.  But this time, there is no national survey and we have to base our estimates on a sample of Grade 3 children.

Gaining confidence in study results

If we decide to do a sample of 100 children across the country, drawn from a variety of different school settings, we will get a wide spread of scores. But because we have so few children in the sample, we will only be able to choose a maximum of nine children per province if we wanted to take children from all provinces.  In effect, a sample of 100 means that we expect nine children per province (plus one ‘wild card’) to represent the scores of all their classmates, and 100 children in South Africa to represent all Grade 3 children.  The result is likely to be a distribution with a lot of variability, and a lot of uncertainty as to whether the observed distribution really represents the score pattern for all Grade 3’s (Figure 3).

Figure 3. A sample of 100 provides little confidence that the observed pattern represents the real distribution of events

In order to increase our confidence in the results, we need to increase the sample size.  If we increase the sample size to about 1 000 children, we will roughly approximate the real distribution of Grade 3 maths scores – give or take 3%.  This can be called the ‘margin of error’ (Figure 4).Figure 4. Increasing the sample size to 1000 is more likely to produce a pattern that reflects the true distribution of Grade 3 scores

As stated above, you need to decide how much confidence you require to believe that the findings are a true reflection of the real distribution of scores.  If you decide that a margin of error of ± 3% is still too much, you could increase the sample size to 2 000.  But, as Figure 5 demonstrates, the difference in the results between the sample of 1 000 and the sample of 2 000 are likely to be relatively small (because you had already accounted for 94% of the variability in the sample of 1 000).  Your margin of error would be reduced to ± 2% at a cost of double the money.Figure 5. Increasing the sample size to 2000 would only add a bit of extra confidence that the observed results reflected the true distribution of Grade 3 scores

However, if you wanted to establish, with the same degree of confidence, whether boys and girls have the same distribution of scores, you would need to increase the sample size to 2 000.  If you wanted to disaggregate the findings by province – yet retain the same confidence in the results – you would need to increase the sample size substantially.[5]

The graphs above show that your level of confidence depends on the degree of variability between each event that you measure.   In Figure 6, let’s consider each percentage point score as ‘the event’ and the number of children as the variable factor.  The graph shows the scatter of learner scores for each percentage.  If, for example, the number of children scoring 45% is similar to that receiving 46% and those receiving 47%, the scatter plot will be tight.  But if there is great variation among the number of children receiving 45%, 46% and 47% respectively, the plot will be more scattered and the graph will have less form.  Consequently, we will be less certain that the scatter plot truly represents the true population.

Figure 6. Hypothetical scatter plot of the frequency of Grade 3 test scores

There are a variety of statistical tests that measure the variation between each and every observation.  For our purposes, it is not necessary to detail these any further, except to note that there are different tests for distributions that are normal (bell-curved) and those that are skewed.  Those that describe normal distributions are called parametric tests, while those that describe other asymmetrical distributions are called non-parametric tests.  We need to be alert to this fact when assessing the impact of interventions, as sometimes statistical tests are used inappropriately.  A common error is to assume that the underlying distribution is bell-curved, when in fact it is skewed.

Used correctly, these tests generate a ‘degree of confidence in the results’, known as ‘p’.  Essentially, ‘p’ is a decimal fraction determined from the degree of variability between events. If p = 0.05, it means that you would be likely to obtain the same results 95 times out of 100.  If p < 0.001, it means that you would be likely to obtain the same results 999 times out of 1 000 such surveys.  So, if you find a difference in maths scores between two groups of learners, and the statistical test used generates a ‘p = 0.02’, you can say with a high degree of confidence (98 times out of 100) that the outcomes between the two groups reflect a true difference – and is not simply an artefact produced by the sampling process.

Sometimes, the degree of confidence is presented as a range between which observations are likely to occur.  A finding will often be expressed as a point estimate plus its 95% confidence intervals (e.g. average score of 35.4%; 32.1-39.2).  In effect, the 95% confidence limit is twice the radius of the margin of error mentioned above. This is a very helpful way of expressing research findings because it gives a good idea of how much ‘play’ there is before the findings lose their validity (based on internationally accepted standards).  If the 95% confidence intervals overlap between the findings for an intervention and control group, it means that the differences cannot be regarded as significant.

Faking confidence – the abuse of statistics to make a bad case

This famous statement was attributed by Mark Twain to the 19th Century British Prime Minister Disraeli.

It is true that statistics are often abused to bolster a weak argument.  But this does not mean that statistics are ‘dodgy’; rather that people often take advantage of a lack of knowledge of sound methodology to make statements that cannot be supported by fact.  For that reason, it is important that we understand the common abuses of statistics.  Some of these have been mentioned above, but are repeated here for emphasis.

When observed changes are attributed to an intervention post hoc

 This is sometimes called data dredging.  A range of statistical tests are employed to try and find meaning in results after the intervention has been conducted.  While data dredging may reveal some interesting and useful findings, it can never be used to say that the intervention caused the change.

When observed changes are attributed to an intervention without excluding possible biases

Let us return to the example of the Grade 3 cohort of learners.  If we were to conduct a survey of schools but decided to use an on-line questionnaire for school principals in order to reduce the survey’s travel and field worker costs.  Most probably, we would find that the average maths scores were much higher than expected – and the entire distribution of results would be shifted to the right.  The reason for this would be that schools that have access to the Internet tend to be better resourced and produce better results.  If we wanted to conduct a survey that represented all Grade 3 learners of South Africa, we would need to take a random sample of schools across the country to eliminate the systematic bias caused by resource differentials in South Africa.  We would also need to have a survey methodology that ensured that every person selected as part of the sample had an equal chance to participate.  If we don’t, we may end up with systematic biases such as ‘non-response bias’ (where some people are less likely to participate) or ‘self-selection bias’ (where some people are more likely to participate).

Another common form of systematic bias is ‘missing variable bias.’  This is particularly important in regression analysis, where researchers try to establish what factors are independently associated with changes in a particular variable.  For example, you may want to establish what factors are associated with the Grade 3 maths scores observed above.  Based on international evidence, we can hypothesise that variability in these scores is a function of: antenatal and perinatal well-being; early childhood development, including good nutrition; family support and child protection; access to early maths and language literacy strategies; well-trained and motivated teachers; and properly resourced and managed schools.

We could categorise these factors differently into those that affect the supply of teaching (the latter) and those that affect the ability to learn (the former).  The latter is more easily measured – and it is therefore no surprise that they emerge consistently as critical success factors for Grade 3 maths scores.  But, these critical factors are also far more prevalent in wealthier schools.  In the absence of measures of learning ability, we attribute the variability in scores to the supply side factors.  We could be guilty of ‘missing variable bias.’  The possibility even exists that, if these missing variables were properly accounted for (through regression analysis, for example), the effects of supply-side strategies may be shown to be insignificant!

It is not always possible to exclude all systematic biases.  In South Africa, a common systematic bias is lower rates of survey participation by wealthier people, who live behind high walls and are less accessible to fieldworkers.  Furthermore, some factors are not easy to measure – such as social and economic marginalisation.   A lot of the variability in the prevalence of HIV among white and black South Africans can simply not be explained – even when self-reported sexual behaviour and proxies of socio-economic status are taken into account.  Clearly, there are missing variables (possibly related to cultural practice or the psychological effects of socioeconomic marginalisation) that account for most of the variability. The point is to understand these biases and to consider their possibility in evaluation design and when interpreting findings.

Some biases are more variable and can be mitigated or eliminated altogether. For example, a survey of school attendance may be affected by weather patterns and conducting the survey over a longer period of time, through both sun and rain, may eliminate this variation. (You could also include weather variation as a factor in your regression analysis!).  But if the variability is random and not systematic, it will not affect the validity of your results.

When observed changes are attributed to an intervention without excluding possible confounding factors

The preliminary findings of a home- and community-based intervention show that children score significantly higher on psycho-social and cognitive tests after participating in early childhood development programmes.  At first glance, that’s cause for celebration.  But, similar tests conducted on the control groups of children (i.e. those who received no intervention) found the same improvements.  In other words, there was no statistically significant difference between the scores of children who participated in the programme and those who did not.  The observed improvement in scores was a function of child maturation.  In other words and unsurprisingly, as the children in both the intervention and control groups grew older, they scored higher.

These preliminary findings highlight the risk of ignoring factors that may confound the interpretation of results.  If the evaluation strategy did not include control groups, champagne corks would have popped prematurely.[7]  When the purpose of the evaluation is to establish the impact of an intervention, it is always necessary to compare the findings of participants in the programme to a similar group of people who had no exposure.  The best way to pick a similar group of people is to randomly assign people to participate in the programme or to be part of the control group.  This is called a randomised controlled trial, and is often called the Gold Standard of Impact Evaluation.  In real life, it’s often impossible to assign people randomly as programme participants either self-select or the programme’s effects cannot be restricted to the benefit of a single individual and may spill over into the broader community[8].  Often the best that can be achieved is to identify a ‘matching community’ that has similar characteristics but is well away from the intervention site, and to randomly select the control group from that community.  This is known as a quasi-experimental design (as opposed to a true experimental design).

False confidence – when no impact does not mean programme failure 

The influential philosopher of science, Karl Popper, argued you can never prove that an intervention works; you can only show that it doesn’t work.  However, failure to demonstrate impact does not necessarily mean that the programme is a failure.  There are a number of reasons why an evaluation may fail to show an effect when in reality there is impact:

When the evaluation is insufficiently powered to show differences between an intervention and control group

Statistical power is the ability of a statistical test to detect true differences between a group that has participated in an intervention/programme and one that has not – ‘true difference’ meaning that the difference is caused by the intervention and is not due to chance. Generally, the statistical power increases with sample size (the number of people included in the study).   For example, say the intervention is providing extra maths classes to struggling Grade 3 learners. If one sees a slight improvement in the scores of a few participating learners (versus those who did not participate in the classes), it might be as a result of the extra classes or it might be because of other reasons.  If, however, one sees a slight improvement in the scores in a large number of participating learners, it becomes more likely that the extra classes are making the difference.

The size of the effect that you expect a successful intervention to have also matters for statistical power. In the extra maths classes example effect size will describe whether the classes has a small, medium or large effect on math scores. If you expect the effect to be large, for example, if you expect learners who participated to pass with distinction, significantly improving from barely passing, you will need a smaller sample size to achieve the same statistical power than if the effect size is smaller (i.e. you are only hoping for a slight improvement).  Because it is so unlikely that a number of struggling learners will be able to pass with distinction from a very low baseline as a result of anything but the intervention, one only need to see this happen in a relatively small number of cases to be quite confident that it is due to the extra classes (if you control for things like cheating etc.).

In his book, The Tipping Point, Malcolm Gladwell points out that a small effect achieved across an entire population can have a large cumulative impact. For example, if many of us start to do something relatively small like being extra kind to pregnant women, it will change how new mothers experience their pregnancy, which might have positive effects on developing babies. This means that it may be quite warranted to expect to see a small effect only.  A small effect size doesn’t necessarily imply that the intervention is insubstantial, but it does require a large sample size to show the effect with confidence.

When the real impact is difficult to measure

It is easy to measure changes in Grade 3 math’s scores or the percentage of children who have access to child support grants.  But many aspects related to human potential are far harder to measure – such as a heightened sense of personal motivation or community cohesion.  Yet there is growing recognition that these factors are intrinsic to both social and economic development.  There is an increasing knowledge of the measurement of these less tangible factors and a number of internationally validated scales are emerging.  But not all impact assessment can be reduced to a set of measurable indicators, and the limitations of quantitative[9] survey designs should be made as explicit as should the limitations of qualitative[10] research.

Often too, there are unexpected consequences of interventions – which may be good, bad or indifferent to the study population. For example, because the students in the psychosocial support example had to meet their mentor/coach at the campus library, they became more familiar and at ease in the library and were, therefore, more likely to take out books and read after the intervention. Sometimes these unexpected consequences are detectable through the survey measurements, but they tend to go undetected.  Supplementary qualitative research is often very useful in identifying possible unintended consequences and enriching our understanding of the impact of interventions.

When a good programme is implemented poorly

It is not unusual for a programme to be written-off on the basis of insufficient evidence of success when it could have had real impact were the intervention properly implemented.  Programme fidelity means the programme is implemented with as much rigour as it was designed to be implemented.  Programme fidelity starts with a clear conceptual framework (‘intervention construct’ or ‘theory of change’) of how the anticipated changes are to be brought about.  Where relevant, it also spells out the ‘intervention dose’ that would be required before any changes could be expected.  If, for example, an intervention aims to achieve teacher development through sustained and intensive in-service support and mentorship, the intervention construct should quantify what is meant by ‘sustained and intensive’ so as to enable monitoring of programme implementation.  This monitoring allows for the assessment of ‘dose-response’ i.e. whether the impact changed with different levels of intensity of intervention.

When a well-designed evaluation study is badly administrated

We often assume that researchers do a good job.  But that is not necessarily the case.  Poorly trained field workers, lax supervisors and shoddy data capturers can all reduce the reliability of the findings. The fidelity of the research is as crucial as the fidelity of implementation, and the study should be designed with sufficient systems of quality control.

Confidence is good, but uncertainty is not bad

Uncertainty must be embraced not as an excuse for failure to meet our goals, but as the condition through which change can happen.  We should be thrilled to struggle through the sticky goo of public innovation rather than feel cemented into ossified systems of state or society that will not change. 

Both commercial and civil society innovators face uncertainty as to whether their products (new technologies or programmes) will work and whether there will be a demand for them.  Both types of innovators can reduce the uncertainty of product viability by predicting their future markets, rigorous design and testing, and thorough development and enterprise management.  Innovators that disrupt markets set aside the space and resources for exploration and discovery – without trying to tie every product to the existing demand. They create markets where none existed.  There must always be space for exploration and discovery – without trying to link it explicitly to an established market/issue or demand/need.

The social value of a commercial innovation is signalled through its price, set by an efficient market of demand and supply.  On the other hand, public benefit organisations operate in a space of market failure, where, from the start, either the price signals the interests of only part of a tiered market or where there is little individual incentive to pay for products that lead to communal benefit. In economic theory, the government therefore intervenes to pool financial capital and redistribute national resources to pay for public goods.  The interposition of government as the mediator of national scale and coverage increases both the potential impact and the uncertainty of civil society work.  In other words, the net expected benefit is bigger, but civil society has far less in control of the outcomes.

There are other forms of capital that cannot be expressed in classical economic terms – social and cultural capital, for example, that are increasingly recognised as crucial factors for development. These ‘relational inputs’ can be generated by government, corporate or civil society, but cannot always be quantified, yet are essential for national well-being, growth and productivity. In this regard, civil society typically plays a critical role in ensuring that the poorest people in society have access to, and can themselves produce these forms of capital.    The relational aspects of civil society work – in building social and cultural capital and trying to promote justice by changing the ‘terms of recognition’ of people – cannot be reduced to numbers.  As economists increasingly recognise, ‘trust’ is a crucial factor in national development.

Even though we cannot measure everything and in some cases, we need to be comfortable with that, at the same time, even in such circumstances we must find ways to ensure proper and meaningful accountability.  There is a tendency to chew on what’s already known and ignore what we don’t know.  Yet, the new strides in human development often rest as much as in what we don’t know as in what we do.

For more information on how to do effective Measurement and Evaluation please visit our toolbox which can be found here. Once you have effective M&E running for your programme, you will eventually need to report on the findings. Read our article on how to communicate your work and achievements to funders here.


References:

[1] Referring to descriptive data that answers questions like why something happened or not, how it worked when it did etc.

[2] Referring to measurable data, for example, did something happen or not, to what extent and how often.

[3] A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. Many things closely follow a normal distribution, for example, the heights of people. This means most people are of average height with smaller numbers of people being much taller or much shorter.

[4] For a normal bell-curve distribution, the mean, median and mode fall on top of one another, but for skewed distributions, they may be quite far apart.

[5] Your sample size would not necessarily increase nine times, because it would be weighted by the population size in each province. There are calculations that determine the exact sample size related to a specific margin of error.

[6] In fact, according to Wikipedia, there is no record of Disraeli having made this statement, and current evidence suggests that it was first stated by Sir Charles Wentworth Dike.

[7] These findings do not exclude the possibility that the intervention made a difference, especially because the sample sizes were small.  This will be discussed in the next section.

[8] In the disinterested language of science, this is called programme contamination!

[9] Referring to measurable data, for example, did something happen or not, to what extent and how often.

[10] Referring to descriptive data that answers questions like why something happened or not, how it worked when it did etc.