Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools (2024)

Article Navigation

Volume 134 Issue 661 July 2024

Article Contents

  • Abstract

  • 1. Study Design

  • 2. Results

  • 3. Discussion

  • 4. Conclusions

  • Footnotes

  • Notes

  • References

  • < Previous

Journal Article

Abhijeet Singh

Stockholm School of Economics

,

Sweden

Corresponding author: Abhijeet Singh, Department of Economics, Stockholm School of Economics, Sveavägen 65, Stockholm 113 83, Sweden. Email: abhijeet.singh@hhs.se

Search for other works by this author on:

Oxford Academic

The Economic Journal, Volume 134, Issue 661, July 2024, Pages 2207–2223, https://doi.org/10.1093/ej/ueae017

Published:

09 March 2024

Article history

Received:

07 July 2022

Accepted:

01 March 2024

Published:

09 March 2024

Corrected and typeset:

20 May 2024

  • PDF
  • Split View
  • Views
    • Article contents
    • Figures & tables
    • Video
    • Audio
    • Supplementary Data
  • Cite

    Cite

    Abhijeet Singh, Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools, The Economic Journal, Volume 134, Issue 661, July 2024, Pages 2207–2223, https://doi.org/10.1093/ej/ueae017

    Close

Search

Close

Search

Advanced Search

Search Menu

Abstract

Large-scale student assessments are a cornerstone of proposed educational reforms to improve student achievement from very low levels in low- and middle-income countries. Yet, this promise relies on their presumed reliability. I use direct audit evidence from a large Indian state (Andhra Pradesh) to show that, as currently administered, official learning assessments substantially overstate proficiency and understate the ‘learning crisis’ of low student achievement. In an experiment covering over 2,400 schools, I evaluate whether digital tablet-based testing could reduce distortion. Although paper-based assessments proctored by teachers severely exaggerate achievement, tablet-based assessments closely match independent test data and are much less likely to be flagged for cheating. Furthermore, I use the direct audit-based retest to directly validate existing (indirect) statistical procedures for detecting cheating at scale and establish that it would be feasible to monitor data integrity cheaply and at scale with such methods. Overall, these results suggest that well-designed technology-aided interventions may improve data integrity at scale, without which these learning assessments are unlikely to serve as a catalyst for policy action.

There is a widespread ‘learning crisis’ in developing countries where, despite substantial increases in school enrolment and average years of schooling, student learning remains very low (Pritchett, 2013; Glewwe and Muralidharan, 2016). Pivoting education systems to prioritise learning, and not just access, has therefore emerged as one of the most important development challenges that faces low- and middle-income countries today (World Bank, 2018).

One important element in this challenge is to collect reliable official measures of student achievement at scale. These are necessary to underpin a common policy understanding of the level and trends in achievement, for prioritising support to those schools and students most in need, and for instituting reforms that target information or accountability. Unsurprisingly, instituting official data systems focused on student testing has featured prominently in the policy agendas of national governments and international aid agencies.1 Yet, especially in settings of weak governance where such reforms might be most necessary, large-scale learning assessments are often undermined by wide-spread student copying and misreporting by teachers. This calls the usefulness of such assessments into question and illustrates how unreliability of official data on student learning acts directly as a constraint to effective policy making.2

In this paper, I evaluate the potential of digital testing to address the challenge of data integrity at scale. Digital testing on computers or tablets may improve data integrity in several ways: it is possible to make student copying harder by providing different questions to students and designing the user interface appropriately, automatic grading removes the discretion of individual teachers to inflate grades and detailed item-level information, which is not typically available with manual grading, allows for a more detailed scrutiny of cheating. However, digital tests also carry risks: the unfamiliarity of computerised tests in this setting, combined with logistical constraints and the lack of appropriate infrastructure, may lead the assessments to be uninformative or infeasible to implement. Despite increasing policy interest, however, there is little direct evidence on reliability of digital testing for measuring student learning in low- and middle-income countries (LMICs).

I study this question using a large-scale field experiment covering grade 4 students in over 2,400 schools in one district in Andhra Pradesh state. The experiment randomly allocated schools to two treatment arms. In the benchmark case, covering 768 schools, school staff administered paper-based tests to students in three subjects; these assessments used multiple test booklets and were graded centrally to replicate the ‘best-case’ scenario for official tests in India. In the intervention arm, covering 1,694 schools, students took the same tests, but on tablets brought to the school by a cluster resource person. These tests were designed to be diagnostic and carried no formal stakes for students or teachers. Both paper and tablet assessments were conducted by government officials, as they would be in any census-based assessment, and not external staff. Thus, this study represents an evaluation at scale likely to identify a policy-relevant treatment effect (see Muralidharan and Niehaus, 2017; Vivalt, 2020).

I measure distortion in two ways. First, an independent research team conducted an externally proctored paper-based retest to serve as an external benchmark in 117 schools. Absolute differences between the two assessments in the percentage of correct responses, to the same test items by the same students, provide a direct measure of cheating. Second, I use the item-level data on students’ responses in the official assessments to implement the procedure used by Angrist etal. (2017) (ABV hereafter) to flag classrooms with potential manipulation in Italy.3 This allows me to compare the magnitude of cheating with the international literature on detecting cheating, but also, importantly, to provide direct validation of such procedures for use in a very different context. Since statistical procedures for detecting cheating are much more scalable than retest-based audits, establishing their validity is an important step for their use in such settings.

There are three main results. First, there is evidence of substantial distortion in business-as-usual tests in this setting. Using the ABV procedure, 38%–43% of classrooms in the paper-based testing arm are flagged for cheating, a prevalence rate three times higher than documented by Battistin etal. (2017) in southern Italy. In the retest sample, students score 16–20 percentage points higher, on average, in the teacher-administered paper-based tests than in the independently proctored retest. Second, distortion is substantially reduced in tablet-based testing. In contrast to paper-based tests, only 2%–5% of tablet-based test classrooms are flagged as potentially cheating by the ABV procedure. The difference between the official assessment and the retest is much smaller in the tablet-test arm, and not statistically distinguishable from zero in most cases. Third, I find substantial correspondence between direct audit-based measures of distortion and the ABV procedure, which indicates that procedures to detect cheating could be scaled up effectively with item-level data, even where independent audits are not possible or too expensive.

This paper contributes to three areas. First, it complements a large literature that evaluates interventions to improve student learning in developing countries (see Glewwe and Muralidharan, 2016). Many of the most promising reforms in this literature take reliable data on test scores as a pre-requisite. A prime example is performance-based pay in education: while this has been shown to have positive effects in large-scale experimental studies in India, East Africa and China (Muralidharan and Sundararaman, 2011; Loyalka etal., 2019; Mbiti etal., 2019; Leaver etal., 2021), in each of these experiments, the test data triggering payments were collected either directly by research teams or their NGO partners. Similar concerns arise with, for example, report-card interventions using high-quality researcher-administered assessments where experimental studies report large gains (Andrabi etal., 2017; Afridi etal., 2020). Ensuring data integrity can often be a binding constraint to scaling these reforms; my results suggest that digital testing could reduce this risk sharply, even in settings of weak governance.

Second, it adds to a body of work that evaluates technology-led solutions for system-wide public sector reform in developing countries. Recent papers have, for instance, looked at technology-aided solutions in procurement (Lewis-Faupel etal., 2016), social security payments (Muralidharan etal., 2016; Banerjee etal., 2020) and voting (Fujiwara, 2015). In these applications, technology plays three crucial roles: (a) it improves the timeliness, standardisation and detail with which information becomes available, allowing for timely verification, (b) it reduces the risk of manipulation by circumventing corrupt actors and (c) it allows for scale-up with fidelity. Tablet-based assessments play the same role here: they circumvent potential grade manipulation and make copying harder by randomising questions, they make detailed item-level data available for analysing suspicious response patterns and they may have the potential for universal scale-up.

Third, this paper also extends, in both methods and scope, an extensive literature that has examined manipulation of test scores by teachers or cheating by students.4 Specifically, I provide the first direct audit measures of distortion, in a new setting, and present a large experimental evaluation of potential reform. I also directly validate the efficacy of indirect procedures to flag suspicious response patterns to detect cheating as opposed to audits. Results in this paper on the use of technology are similar to Borcan etal. (2017), who documented a sharp reduction in cheating with the use of CCTV cameras in Romania (in a much higher-stake setting). Substantively, my results highlight that the banality of cheating in some settings forces a shift in focus from trying to detect (relatively rare) cheaters, such as in Jacob and Levitt (2003), to reducing manipulability in the system as a whole. In Section4, I further discuss the implications of these results for policy and for the potential use of such data in research.

1. Study Design

1.1. Background and Setting

This study is based in the state of Andhra Pradesh, which had a population of 52.8 million in 2022. Education outcomes in this state are better than the all-India average, but still poor in absolute terms: in 2018, 59.7% of grade V students, across private and government schools, could read a grade II level text and 39.3% could divide (ASER, 2019).

The Government of Andhra Pradesh wanted to test and implement a report card intervention informing parents about learning levels of their children, and average achievement in local schools.5 At scale, this requires the underlying information collected by government systems to be reliable. Thus, the government authorised a large-scale pilot evaluation to assess if tablet-based assessments could, at scale, measure learning levels more accurately than traditional paper-based tests. I helped design the evaluation in collaboration with the Central Square Foundation. The experiment was conducted in February 2019 in the Prakasam district (see Online Appendix Figure A1). This district is close to state-level averages across several educational indicators (ASER, 2019).

1.2. Experiment Design

1.2.1. Intervention

The evaluation focused on children in primary schools, the principal focus of policies targeting foundational learning. All schools with at least five students in grade 4 were assigned to administer either paper-based or tablet-based tests. All grade 4 students present on the day were tested in mathematics, Telugu (language) and English.6 Both treatment arms used centrally designed test papers that were identical in test questions and ordering across treatment arms.7 All questions in the test booklets were multiple choice items intended to capture a wide range of variation. Three test booklets (‘sets’) were created in each subject. Each booklet contained thirty questions, of which twenty-four were common across all sets (and underlie all analyses here).

In schools assigned to paper-based testing, booklets were sent with clear instructions for schools to administer the tests and for the answer scripts to be returned to the Mandal Education Office (the administrative unit above schools) for grading. This reference mode of testing, with multiple sets and external grading, resembles the best-case scenario for large-scale paper-based assessments in India.

In the digital testing arm, the tests were staggered across schools. Tablets were taken to the school by a cluster resource person (CRP). The CRP is a frontline official who acts as a link in program delivery between schools and the (higher) education bureaucracy. Each student was given an individual tablet to work on. The test booklet for each subject was decided by the software directly. In both treatment arms, students could use their own scrap paper for calculations if they wanted. All schools were notified of testing at the same time.8

In comparison to paper-based testing, tablet-based tests as administered here may reduce manipulation through several channels. First, they make student copying harder since students only see one question at a time and it is much harder to copy from the booklets of nearby students. Second, for the same reason, it is harder for teachers to help all students, or to see whether they have answered correctly (and provide the right answers). Teachers also cannot retrospectively erase and correct answers after the test has ended. Third, the mode of administration, since it required the CRP to take the tablets to the school and collect them after the test, ensured that external observers were actually present at the time of testing to invigilate.9 My goal here is not to distinguish between these potential channels, but rather to evaluate whether tablet-based tests, as delivered at scale by government officials, reduce distortion. This composite policy effect includes both the direct effects of the technology in inhibiting cheating and the indirect effect from complementing existing monitoring capacity in the system. Program implementation by government staff reduces the concerns of external validity that accompany high-fidelity implementation by motivated NGOs or research teams (Bold etal., 2018; Vivalt, 2020).

1.2.2. Data

We have data from three sources. First, we have administrative data on enrolment, staffing, infrastructure and monitoring of all recognised schools in the state, obtained from the Unified District Information System on Education (U-DISE). These data were used to construct the sampling frame for the randomised experiment and will be used for heterogeneity analyses.

Second, central to the analysis in this paper, we have item-level data from the official tests administered to 38,857 students in 2,443 schools. Of these, 12,741 students were tested in the paper-based arm and 26,116 in the tablet-based arm.

The third source of data is a retest-based audit that serves as an external benchmark to detect cheating. We randomly sampled 120 schools, spread equally across the paper- and tablet-based testing arms, to retest students using a traditional paper-based assessment, but with external proctoring and supervision by the research team. Schools were only informed about 1–2days in advance about the independent test. The anonymity of results was guaranteed. This retest was conducted within 2–3 weeks of the original assessment.10

1.2.3. Randomisation and experiment validity

Randomisation was carried out at the level of an academic cluster, which typically covers multiple villages/urban wards, to keep the mode of testing unchanged within a single educational market. Out of 284 clusters, 196 were assigned for tablet-based testing and the remaining 88 to paper-based testing.11 Thus, the experiment is an ‘evaluation at scale’ in all three respects highlighted by Muralidharan and Niehaus (2017), i.e., representative of large populations, studying implementation across a large number of treated units, and studying implementation at a large unit (to estimate effects that are net of spillovers). The randomisation was stratified by sub-districts (mandals) and mandal fixed effects are included in all regressions.

Our design had one pre-determined deviation from the initially assigned status. In rare cases, academic clusters do not nest villages/urban wards. If schools in the same village/ward were assigned to different treatment arms, all schools in the village were reassigned to be in the same arm; 191 schools (out of 2,462) were reassigned. In the final sample, 768 schools were assigned to paper-based testing and 1,694 to tablet-based testing. Observed characteristics are balanced between the testing arms (Table1) by initial assignment (columns (1)–(4)), final assignment (columns (5)–(8)) and in the retesting sample (columns (9)–(12)).

Table 1.

Open in new tab

Balance of Observables across Paper and Tablet Testing Arms.

Initial assignmentFinal assignmentRetest sample
(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)
VariableTabletPaperDiffDiff SETabletPaperDiffDiff SETabletPaperDiffDiff SE
Enrolment
Class 418.0817.110.97(1.34)17.9417.400.54(1.31)20.2218.222.00(3.71)
Primary (class I–V)88.5184.823.70(6.41)88.0385.832.20(6.32)102.1796.185.98(19.72)
School characteristics
Government or aided0.800.80−0.01(0.05)0.800.790.01(0.05)0.580.580.00(0.13)
Private unaided0.200.200.01(0.05)0.200.21−0.01(0.05)0.420.42−0.00(0.13)
Rural0.860.850.02(0.07)0.860.850.01(0.07)0.770.87−0.10(0.13)
English medium0.170.18−0.02(0.05)0.160.18−0.02(0.05)0.270.40−0.13(0.13)
Telugu medium0.830.820.02(0.05)0.840.820.02(0.05)0.730.600.13(0.13)
Infrastructure
No. of classrooms3.603.490.11(0.16)3.623.450.17(0.16)3.673.87−0.20(0.49)
No. of toilets2.932.730.20(0.18)2.902.810.09(0.18)3.332.830.50(0.53)
Electricity0.980.980.00(0.01)0.980.980.01(0.01)0.970.98−0.02(0.03)
Headmaster room0.240.24−0.00(0.04)0.230.25−0.02(0.04)0.380.350.03(0.11)
Playground0.570.57−0.00(0.03)0.560.59−0.02(0.03)0.620.70−0.08(0.09)
No boundary wall0.450.47−0.02(0.03)0.460.46−0.00(0.04)0.420.47−0.05(0.10)
Inspections
Visits by BRC1.921.910.02(0.20)1.931.890.04(0.20)1.921.520.40(0.53)
Visits by CRC3.163.060.11(0.37)3.202.980.22(0.36)2.982.270.72(0.85)
Observations1,6857772,4621,6947682,4626060120
Initial assignmentFinal assignmentRetest sample
(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)
VariableTabletPaperDiffDiff SETabletPaperDiffDiff SETabletPaperDiffDiff SE
Enrolment
Class 418.0817.110.97(1.34)17.9417.400.54(1.31)20.2218.222.00(3.71)
Primary (class I–V)88.5184.823.70(6.41)88.0385.832.20(6.32)102.1796.185.98(19.72)
School characteristics
Government or aided0.800.80−0.01(0.05)0.800.790.01(0.05)0.580.580.00(0.13)
Private unaided0.200.200.01(0.05)0.200.21−0.01(0.05)0.420.42−0.00(0.13)
Rural0.860.850.02(0.07)0.860.850.01(0.07)0.770.87−0.10(0.13)
English medium0.170.18−0.02(0.05)0.160.18−0.02(0.05)0.270.40−0.13(0.13)
Telugu medium0.830.820.02(0.05)0.840.820.02(0.05)0.730.600.13(0.13)
Infrastructure
No. of classrooms3.603.490.11(0.16)3.623.450.17(0.16)3.673.87−0.20(0.49)
No. of toilets2.932.730.20(0.18)2.902.810.09(0.18)3.332.830.50(0.53)
Electricity0.980.980.00(0.01)0.980.980.01(0.01)0.970.98−0.02(0.03)
Headmaster room0.240.24−0.00(0.04)0.230.25−0.02(0.04)0.380.350.03(0.11)
Playground0.570.57−0.00(0.03)0.560.59−0.02(0.03)0.620.70−0.08(0.09)
No boundary wall0.450.47−0.02(0.03)0.460.46−0.00(0.04)0.420.47−0.05(0.10)
Inspections
Visits by BRC1.921.910.02(0.20)1.931.890.04(0.20)1.921.520.40(0.53)
Visits by CRC3.163.060.11(0.37)3.202.980.22(0.36)2.982.270.72(0.85)
Observations1,6857772,4621,6947682,4626060120

Note: All characteristics presented in the table are taken from the administrative data on school characteristics (U-DISE). Columns (1), (5) and (9) present the mean of characteristics in schools assigned to tablet-based testing; columns (2), (6) and (10) present the mean of characteristics for schools assigned to paper-based testing; columns (3), (7) and (11) present the difference between these two groups; columns (4), (8) and (12) present SEs of the difference. SEs are clustered at the academic cluster level.

Table 1.

Open in new tab

Balance of Observables across Paper and Tablet Testing Arms.

Initial assignmentFinal assignmentRetest sample
(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)
VariableTabletPaperDiffDiff SETabletPaperDiffDiff SETabletPaperDiffDiff SE
Enrolment
Class 418.0817.110.97(1.34)17.9417.400.54(1.31)20.2218.222.00(3.71)
Primary (class I–V)88.5184.823.70(6.41)88.0385.832.20(6.32)102.1796.185.98(19.72)
School characteristics
Government or aided0.800.80−0.01(0.05)0.800.790.01(0.05)0.580.580.00(0.13)
Private unaided0.200.200.01(0.05)0.200.21−0.01(0.05)0.420.42−0.00(0.13)
Rural0.860.850.02(0.07)0.860.850.01(0.07)0.770.87−0.10(0.13)
English medium0.170.18−0.02(0.05)0.160.18−0.02(0.05)0.270.40−0.13(0.13)
Telugu medium0.830.820.02(0.05)0.840.820.02(0.05)0.730.600.13(0.13)
Infrastructure
No. of classrooms3.603.490.11(0.16)3.623.450.17(0.16)3.673.87−0.20(0.49)
No. of toilets2.932.730.20(0.18)2.902.810.09(0.18)3.332.830.50(0.53)
Electricity0.980.980.00(0.01)0.980.980.01(0.01)0.970.98−0.02(0.03)
Headmaster room0.240.24−0.00(0.04)0.230.25−0.02(0.04)0.380.350.03(0.11)
Playground0.570.57−0.00(0.03)0.560.59−0.02(0.03)0.620.70−0.08(0.09)
No boundary wall0.450.47−0.02(0.03)0.460.46−0.00(0.04)0.420.47−0.05(0.10)
Inspections
Visits by BRC1.921.910.02(0.20)1.931.890.04(0.20)1.921.520.40(0.53)
Visits by CRC3.163.060.11(0.37)3.202.980.22(0.36)2.982.270.72(0.85)
Observations1,6857772,4621,6947682,4626060120
Initial assignmentFinal assignmentRetest sample
(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)
VariableTabletPaperDiffDiff SETabletPaperDiffDiff SETabletPaperDiffDiff SE
Enrolment
Class 418.0817.110.97(1.34)17.9417.400.54(1.31)20.2218.222.00(3.71)
Primary (class I–V)88.5184.823.70(6.41)88.0385.832.20(6.32)102.1796.185.98(19.72)
School characteristics
Government or aided0.800.80−0.01(0.05)0.800.790.01(0.05)0.580.580.00(0.13)
Private unaided0.200.200.01(0.05)0.200.21−0.01(0.05)0.420.42−0.00(0.13)
Rural0.860.850.02(0.07)0.860.850.01(0.07)0.770.87−0.10(0.13)
English medium0.170.18−0.02(0.05)0.160.18−0.02(0.05)0.270.40−0.13(0.13)
Telugu medium0.830.820.02(0.05)0.840.820.02(0.05)0.730.600.13(0.13)
Infrastructure
No. of classrooms3.603.490.11(0.16)3.623.450.17(0.16)3.673.87−0.20(0.49)
No. of toilets2.932.730.20(0.18)2.902.810.09(0.18)3.332.830.50(0.53)
Electricity0.980.980.00(0.01)0.980.980.01(0.01)0.970.98−0.02(0.03)
Headmaster room0.240.24−0.00(0.04)0.230.25−0.02(0.04)0.380.350.03(0.11)
Playground0.570.57−0.00(0.03)0.560.59−0.02(0.03)0.620.70−0.08(0.09)
No boundary wall0.450.47−0.02(0.03)0.460.46−0.00(0.04)0.420.47−0.05(0.10)
Inspections
Visits by BRC1.921.910.02(0.20)1.931.890.04(0.20)1.921.520.40(0.53)
Visits by CRC3.163.060.11(0.37)3.202.980.22(0.36)2.982.270.72(0.85)
Observations1,6857772,4621,6947682,4626060120

Note: All characteristics presented in the table are taken from the administrative data on school characteristics (U-DISE). Columns (1), (5) and (9) present the mean of characteristics in schools assigned to tablet-based testing; columns (2), (6) and (10) present the mean of characteristics for schools assigned to paper-based testing; columns (3), (7) and (11) present the difference between these two groups; columns (4), (8) and (12) present SEs of the difference. SEs are clustered at the academic cluster level.

2. Results

2.1. Comparisons between Paper- and Tablet-Based Tests

Students who were tested on paper score much higher—about 28 percentage points higher in mathematics and English, and 21 percentage points higher in Telugu—than students who were tested on tablets (Figure1(a), panel A of Table2). Relative to the distribution of test scores in tablet-based tests, these differences equal about 1 SD in each subject.12 The resulting distribution of achievement in paper-based tests is very negatively skewed, with substantial ceiling effects, whereas tablet-based tests provide a more well-distributed measure of achievement. Figure1(b) shows for each test item that the proportion of students answering each item correctly is substantially higher in paper-based tests. Thus, conclusions about student achievement may differ drastically if administering the assessment in one mode versus the other.

Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools (3)

Fig. 1.

Test Scores in Paper and Tablet Tests.

Note: Panel (a) shows the distribution of student-wise aggregate test scores (percentage correct) for students who were tested on the same question papers using tablet-based or paper-based assessments. Panel (b) shows the percentage of students correctly answering the same question (denoted by triangles) across the paper and tablet tests. Schools were randomly assigned to be tested via paper or tablet assessments.

Open in new tabDownload slide

Table 2.

Open in new tab

Causal Effect of Tablet-Based Tests on Cheating.

(1)(2)(3)(4)(5)(6)
MathTeluguEnglishMathTeluguEnglish
Panel A. Comparing tablet- and paper-based tests in the full population
Dependant variable: percentage correct
Paper test28.48***21.49***28.55***27.55***20.94***27.28***
(1.068)(1.026)(1.233)(0.985)(0.876)(1.073)
Constant49.36***56.88***48.24***49.67***57.06***48.66***
(0.881)(0.831)(0.741)(0.443)(0.404)(0.402)
Strata FEsYYY
Observations37,66037,48137,84937,66037,48137,849
R20.2560.1720.2760.3070.2270.329
Panel B. Comparing original scores with the retest
Dependant variable: deviations from the retest
Paper test19.73***15.99***17.70***20.36***13.96***18.22***
(4.412)(3.664)(2.911)(3.345)(2.771)(2.437)
Constant−2.666−4.152−2.392−2.960−3.215*−2.647*
(4.292)(3.449)(2.315)(2.145)(1.744)(1.503)
Strata FEsYYY
Observations1,6501,6401,6621,6481,6381,660
R20.1310.0930.1120.2030.1530.163
Panel C. Internal validation of test data (school level)
Dependant variable: school flagged as suspicious
Paper test0.386***0.354***0.358***0.382***0.350***0.351***
(0.0195)(0.0203)(0.0201)(0.0193)(0.0201)(0.0193)
Constant0.0454***0.0519***0.0227***0.0467***0.0531***0.0246***
(0.00735)(0.00748)(0.00508)(0.00629)(0.00604)(0.00520)
Strata FEsYYY
Observations2,4242,4242,4252,4242,4242,425
R20.2310.1980.2370.2680.2370.267
(1)(2)(3)(4)(5)(6)
MathTeluguEnglishMathTeluguEnglish
Panel A. Comparing tablet- and paper-based tests in the full population
Dependant variable: percentage correct
Paper test28.48***21.49***28.55***27.55***20.94***27.28***
(1.068)(1.026)(1.233)(0.985)(0.876)(1.073)
Constant49.36***56.88***48.24***49.67***57.06***48.66***
(0.881)(0.831)(0.741)(0.443)(0.404)(0.402)
Strata FEsYYY
Observations37,66037,48137,84937,66037,48137,849
R20.2560.1720.2760.3070.2270.329
Panel B. Comparing original scores with the retest
Dependant variable: deviations from the retest
Paper test19.73***15.99***17.70***20.36***13.96***18.22***
(4.412)(3.664)(2.911)(3.345)(2.771)(2.437)
Constant−2.666−4.152−2.392−2.960−3.215*−2.647*
(4.292)(3.449)(2.315)(2.145)(1.744)(1.503)
Strata FEsYYY
Observations1,6501,6401,6621,6481,6381,660
R20.1310.0930.1120.2030.1530.163
Panel C. Internal validation of test data (school level)
Dependant variable: school flagged as suspicious
Paper test0.386***0.354***0.358***0.382***0.350***0.351***
(0.0195)(0.0203)(0.0201)(0.0193)(0.0201)(0.0193)
Constant0.0454***0.0519***0.0227***0.0467***0.0531***0.0246***
(0.00735)(0.00748)(0.00508)(0.00629)(0.00604)(0.00520)
Strata FEsYYY
Observations2,4242,4242,4252,4242,4242,425
R20.2310.1980.2370.2680.2370.267

Notes: Panel A uses student-level data from the original assessments. The dependent variable is the percentage correct on items that were common across all sets. In panel B the dependent variable is the difference in test scores, at the individual student level, between the original test and the retest. Panel C uses school-level data. The dependent variable is an indicator for being flagged, at the school level, using the procedure described in Angrist etal. (2017). In all panels, Columns (1)–(3) present unconditional differences, while Columns (4)–(6) condition on randomization strata. SEs are clustered at the academic cluster level. *** p < 0.01.

Table 2.

Open in new tab

Causal Effect of Tablet-Based Tests on Cheating.

(1)(2)(3)(4)(5)(6)
MathTeluguEnglishMathTeluguEnglish
Panel A. Comparing tablet- and paper-based tests in the full population
Dependant variable: percentage correct
Paper test28.48***21.49***28.55***27.55***20.94***27.28***
(1.068)(1.026)(1.233)(0.985)(0.876)(1.073)
Constant49.36***56.88***48.24***49.67***57.06***48.66***
(0.881)(0.831)(0.741)(0.443)(0.404)(0.402)
Strata FEsYYY
Observations37,66037,48137,84937,66037,48137,849
R20.2560.1720.2760.3070.2270.329
Panel B. Comparing original scores with the retest
Dependant variable: deviations from the retest
Paper test19.73***15.99***17.70***20.36***13.96***18.22***
(4.412)(3.664)(2.911)(3.345)(2.771)(2.437)
Constant−2.666−4.152−2.392−2.960−3.215*−2.647*
(4.292)(3.449)(2.315)(2.145)(1.744)(1.503)
Strata FEsYYY
Observations1,6501,6401,6621,6481,6381,660
R20.1310.0930.1120.2030.1530.163
Panel C. Internal validation of test data (school level)
Dependant variable: school flagged as suspicious
Paper test0.386***0.354***0.358***0.382***0.350***0.351***
(0.0195)(0.0203)(0.0201)(0.0193)(0.0201)(0.0193)
Constant0.0454***0.0519***0.0227***0.0467***0.0531***0.0246***
(0.00735)(0.00748)(0.00508)(0.00629)(0.00604)(0.00520)
Strata FEsYYY
Observations2,4242,4242,4252,4242,4242,425
R20.2310.1980.2370.2680.2370.267
(1)(2)(3)(4)(5)(6)
MathTeluguEnglishMathTeluguEnglish
Panel A. Comparing tablet- and paper-based tests in the full population
Dependant variable: percentage correct
Paper test28.48***21.49***28.55***27.55***20.94***27.28***
(1.068)(1.026)(1.233)(0.985)(0.876)(1.073)
Constant49.36***56.88***48.24***49.67***57.06***48.66***
(0.881)(0.831)(0.741)(0.443)(0.404)(0.402)
Strata FEsYYY
Observations37,66037,48137,84937,66037,48137,849
R20.2560.1720.2760.3070.2270.329
Panel B. Comparing original scores with the retest
Dependant variable: deviations from the retest
Paper test19.73***15.99***17.70***20.36***13.96***18.22***
(4.412)(3.664)(2.911)(3.345)(2.771)(2.437)
Constant−2.666−4.152−2.392−2.960−3.215*−2.647*
(4.292)(3.449)(2.315)(2.145)(1.744)(1.503)
Strata FEsYYY
Observations1,6501,6401,6621,6481,6381,660
R20.1310.0930.1120.2030.1530.163
Panel C. Internal validation of test data (school level)
Dependant variable: school flagged as suspicious
Paper test0.386***0.354***0.358***0.382***0.350***0.351***
(0.0195)(0.0203)(0.0201)(0.0193)(0.0201)(0.0193)
Constant0.0454***0.0519***0.0227***0.0467***0.0531***0.0246***
(0.00735)(0.00748)(0.00508)(0.00629)(0.00604)(0.00520)
Strata FEsYYY
Observations2,4242,4242,4252,4242,4242,425
R20.2310.1980.2370.2680.2370.267

Notes: Panel A uses student-level data from the original assessments. The dependent variable is the percentage correct on items that were common across all sets. In panel B the dependent variable is the difference in test scores, at the individual student level, between the original test and the retest. Panel C uses school-level data. The dependent variable is an indicator for being flagged, at the school level, using the procedure described in Angrist etal. (2017). In all panels, Columns (1)–(3) present unconditional differences, while Columns (4)–(6) condition on randomization strata. SEs are clustered at the academic cluster level. *** p < 0.01.

2.2. Direct Audit Measures of Cheating

The primary outcome of the experiment is the magnitude of cheating, which is not directly observed through the official tests alone.

I first use the audit as an external benchmark and measure the discrepancy between the two assessments in the percentage of correct responses. Figure2 compares results in the two treatment arms with the retest. For all items, paper-based official tests significantly over-report student achievement, whereas tablet tests correspond much more closely with the retest (Figure2(a)). Aggregating scores at the student level, differences between the audit- and tablet-based tests are centred close to zero (indicating little deviation on average), but, for paper-based tests, scores are clearly higher in the official tests than the retest (Figure2(b)). Paper tests exaggerate performance by 16–20 percentage points in each subject (panel B of Table2). Point estimates suggest a negative effect of 2–4 percentage points in the tablet-based assessment, plausibly reflecting lower student familiarity with digital tests (only statistically significant in Telugu).

Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools (4)

Fig. 2.

Correspondence of the Retest with the Official Paper and Tablet Tests.

Note: Panel (a) shows the difference in percentage correct in the official test and the retest audit on common items at the student level. Panel (b) shows the difference in the proportion of students answering the same question correctly in the official test (vertical axis) and the retest (horizontal axis). Whereas the average deviation in responses between the tablet tests and the retest is very small, student achievement is overstated in paper-based assessments that were proctored internally by school teachers.

Open in new tabDownload slide

This is a conservative measure of distortion because answering the same question may be easier the second time around, thus leading average student performance to be higher in the retest (‘learning effects’).13 We did not conduct the external tests before the official assessment for logistical reasons and to ensure the confidentiality of the questions being used. We were also concerned that, having recently answered these questions in an external test, students may be deterred from cheating in the official tests knowing that external validation measures exist.

2.3. Detecting Cheating Based on Internal Validation

While the audit method provides a direct measure of distortion, it is costly in time and effort, cannot be estimated for all schools and is infeasible to administer in all settings. Thus, I investigate a second measure that only requires access to item-level data from the official tests. I use the method used of Angrist etal. (2017), which is adopted from official practice to flag classrooms with suspected cheating in the INVALSI exams in Italy. The algorithm proceeds in three steps. First, item-level data are used to generate four summary statistics at the classroom level: (a) the mean percentage correct, (b) the variance of percentage correct, (c) the proportion of non-missing answers and (d) an index of hom*ogeneity of answer options in the classroom. Second, these data are reduced to two principal components. Finally, schools are classified into clusters using a hard k-mean clustering approach, flagging the cluster with high mean, low variance, few missing responses and hom*ogeneous responses to individual test questions. I pool the item-level data for both modes of assessment and run this algorithm separately by subject. Details of the procedure and characteristics of resulting clusters are presented in Online Appendix B.

The ABV procedure is particularly suitable in our application. First, the INVALSI assessments are similar in scope and implementation to regular paper-based assessments here: adopting the same procedure to flag cheating allows me to benchmark levels of distortion in this setting to those documented elsewhere. Also, the ABV procedure does not suffer from concerns of potential learning between assessments. Second, and as importantly, this procedure does not posit a common item response model across students in the paper- and tablet-based assessments. This is in contrast to commonly used indices including the |$\omega$|-index, K-index and GBT index (see Romero etal., 2015; Martinelli etal., 2018). This assumption is likely to be inappropriate in our setting: students with the same underlying ability may score differently on paper and tablet tests due to a difference in the underlying item response function and it is also likely that some cheating may be rationalised through a higher estimated level of achievement.14 Finally, the ABV procedure is computationally simpler and more transparent for non-specialist audiences than alternative IRT-based indices; in addition to already being used by policy agencies elsewhere, these features make it a promising candidate for adoption in this setting.

Panel C of Table2 compares suspected cheating in digital and paper-based tests, as administered at scale. Between 38% and 43% of classrooms in the paper assessment are flagged as suspicious using this approach. This is much higher than the figuresreported by Battistin etal. (2017) in southern Italy, where the proportion of suspected manipulators was assessed to be 11%–16%. In contrast, only 2%–5% of the classrooms are similarly flagged in the tablet assessment.

I use the audit data to validate the ABV procedure against the direct retest-based measure of cheating. Figure3 shows substantial agreement, on average, between the two methods in flagging cheating: in classrooms not flagged by the ABV procedure, there is little discrepancy between the audit and the official test; in flagged classrooms, students score 17–21 percentage points lower in the audit than their original score. The agreement between alternative procedures using different data suggests that the ABV method is reliable for analyses presented here and potentially for scaled-up implementation. This direct validation is important because a major concern with statistical procedures to detect suspicious patterns is how to assess the risk of type-I and type-II errors in the absence of further rounds of testing. Results are often sensitive to the choice of index.15 Furthermore, without an external benchmark, especially in high-prevalence settings, such indices may only differentiate between excessive and moderate cheating (rather than measure absolute levels).

Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools (5)

Fig. 3.

Audit Correspondence with the Official Test for Flagged and Non-Flagged Schools.

Note: This figurecompares the difference between the official test score and the audit, which serves as our direct measure of cheating, across the schools that are flagged for potential manipulation using the indirect procedure of Angrist etal. (2017). As can be seen, there is considerable agreement across the two metrics. There is very little evidence of any disagreement on average in schools that are not flagged by the indirect analysis, whereas the difference is pronounced (17–21 percentage points) in schools flagged as cheating.

Open in new tabDownload slide

2.4. Correlates of Cheating

I study heterogeneity in the difference in percentage correct between the tablet- and paper-based assessments by three observed covariates, each of which may affect cheating. The first dimension is whether the school is private. Since private schools are subject to market pressure, they may face stronger pressure to ensure better test results. The second dimension is the number of students enrolled in grade 4, which may affect cheating through multiple channels: teachers’ effort costs of proctoring may be higher for larger classrooms; alternatively, if cheating is costless for students, the probability that at least one student in a class knows the right answer (and can share) rises in larger classes. The third covariate is the reported number of visits by officials (which could be taken as an indicator of official accountability; Muralidharan etal., 2017).

The results are presented in Online Appendix TableA2. In all three subjects, the difference in performance in the paper and tablet tests is greater in private schools than in government schools. The magnitude of this difference is meaningful: whereas government school students perform between 18–24 percentage points worse on tablets than on paper, this difference is larger by 3–8 percentage points in private schools. It also appears that cheating is greater in larger classrooms, although the magnitude of this relationship is much more modest: an increase in the number of students by ten only increases cheating by |$\sim$|0.8 percentage points on average. Finally, there is little evidence that the number of inspections is correlated with lower cheating.

2.5. Intervention Costs and Further Considerations

The results presented above suggest that tablet-based assessments nearly eliminated cheating. Decisions on broader policy adoption, however, face at least two important further considerations.

The first relates to intervention costs. The main additional costs in tablet-based testing come from the hardware: personnel costs are unchanged since teachers and cluster resource staff would also be required to dedicate time to testing in paper-based testing. This experiment used |$\sim$|3,500 tablets to test grade 4 students in the district over ten working days. The marginal cost of testing additional students in the same school is modest. Extending this testing period, with staggered testing across schools, I estimate that 5,000 tablets would suffice for testing around 200,000 students over a ten-week period. Low-specification tablets currently cost about USD 60–100. With a three-year depreciation period, this translates into a per-test cost of about 50–80 cents at scale. This expenditure is feasible even within current education budgets in LMICs. Importantly, these are not incremental costs, but rather displace substantial effort and expenditure related to manual grading, data entry and record-keeping of paper-based tests, and printing. Thus, tablet-based tests may be feasible and cost effective at scale. More granular data also make it easier to detect suspicious responses. Adaptive testing may also make assessments more informative.16

The second, and important, caveat relates to long-run effects: as a one-off trial, the experiment here does not reflect any long-term adaptation by agents. Such responses have been shown to undermine previous reforms. For instance, Banerjee etal. (2008) and Dhaliwal and Hanna (2017) documented how technology-aided public sector interventions to reduce health worker absenteeism were undermined shortly after introduction, with the implicit agreement of higher levels of bureaucracy. Such complete reversal is not universal though: Muralidharan etal. (2016) presented an example of a reform that persisted and, in a different state, Banerjee etal. (2020) documented a temporary reversal of the corruption-reducing intervention they studied followed by a nationwide scale-up. The long-term efficacy of tablet-based tests, as with any intervention, can only be assessed once it has been in operation for a sustained period. This concern is, of course, not only specific to large-scale policy experiments.17 Thus, following any such reforms, data integrity will need to be evaluated repeatedly in representative samples.

3. Discussion

3.1. Implications for Policy

The pervasiveness of cheating I find suggests that policies that raise the costs of cheating across the board may be more promising here than those focused on detecting and punishing individual cheaters (which are appropriate with lower prevalence, e.g., Jacob and Levitt, 2003). Digital testing might be an especially promising candidate to scale (and to continue to monitor effectiveness of in sub-samples over time).

These results are of immediate policy relevance: in India, the new National Education Policy mandates a census of student achievement in selected grades alongside a proficiency-based goal of ‘universal foundational literacy and numeracy’ (Government of India, 2020). Similar large-scale assessments and proficiency targets are also recommended as the cornerstone of education reforms to raise student achievement in LMICs by the World Bank and other multilateral aid agencies. However, if assessments do not reflect true achievement, they are unlikely to fulfil the expectation of governments and multilateral agencies to catalyse action on the ‘learning crisis’.18 Overall, these results suggest that simultaneous enabling reforms to ensure data integrity will need to underpin any attempts to use test score data as a basis for large interventions in similar settings.

3.2. Implications for Research

The widespread adoption of large-scale assessments in many developing countries holds the prospect of transforming education research in these settings, as it has done in Europe, the United States and Latin America (see Figlio etal., 2016). It could substantially cut the costs of education research, enable many more retrospective analyses of policy changes using quasi-experimental methods and improve sample coverage and precision substantially.

Unfortunately, at least in India and similar settings, my results are also sobering for using data for research. Cheating-induced distortion is more problematic than classical measurement error: not only is the level of achievement distorted, this distortion differs systematically by school type and, plausibly, may vary over time and by the characteristics of individual students.19 It is unlikely that all correlates of manipulation can be adequately captured in large-scale data and some of these correlates may be likely to interact with whatever policy or attribute that the researcher is trying to study. Thus, in settings with weak governance, using these assessments as the mainstay of education research may be inadvisable even with low stakes.20 Improving the integrity of these data systems is thus also important on purely scientific grounds.

3.3. Limitations

This study, demonstrating that digital testing is implementable by LMIC government systems at low cost and can substantially improve data integrity, is best considered as a ‘proof of concept’. There are, however, a few important caveats.

The first of these relates to external validity. My results are likely to generalise to schools in other Indian states where the magnitude of cheating seems to be as high (Kingdon and Muzammil, 2009; Singh, 2020), and the organisation of the education system is also similar. Although India is an important setting by itself—with 264.4 million children enrolled in K-12 education, the world’s largest education system—this study cannot speak to generalisability elsewhere. Estimates of the prevalence of cheating are not, to my knowledge, available for other low- and lower-middle-income settings in South Asia or sub-Saharan Africa that have been the principal focus of policy concern about a global learning crisis.21 If the prevalence of cheating is much lower then digital testing may not be an appropriate investment for scarce educational funds.

Second, although I demonstrate that cheating is widespread even in low-stake exams, I cannot answer why this is the case. It is possible that teachers perceive the tests to be high stakes even without explicit incentives (as suggested by Bertoni etal., 2013 for the INVALSI exams in Italy and Singh, 2020 for official tests in a different Indian state). It is also possible that, given the salience of high-stake exams in the Indian education system, many students perceive all tests to be high stakes (and cheat for higher scores, when given the opportunity). Unlike, e.g., Alan etal. (2020), who showed substantial cheating even with no performance incentives in Turkish schools, we did not collect extensive student-level information to study the correlates of cheating.

Third, this experiment was designed only to measure (and address) aggregate levels of cheating. It cannot, however, decompose the extent to which cheating is initiated by teachers or by students. This would be of independent interest to examine, but, unlike tablet-based tests here, would require interventions that only affect one of these sources (e.g., random seating of students, as in Lin and Levitt, 2020) or a precise way of measuring the two sources of cheating separately.

4. Conclusions

I have documented two results in this paper: first, cheating inflates official achievement data and, second, substantial reductions in distortion may be possible at scale within LMIC state capacity and budgets. Given the willingness of governments to adopt such assessments, interventions to improve their integrity may have high rewards both for policy and research.

While measuring student achievement is an important application, the challenge presented by distortion in administrative data on public services is much broader than education alone. Similarly widespread misreporting is possible in many sectors, including health, welfare, disaster relief, agriculture or social security.22 In each of these, the lack of reliable information can constrain the scope of both policy action and public accountability. In this respect, administrative data, both in its scope and reliability, form a key part of the basic infrastructure of state capacity for the implementation of core functions. The core principles of improving administrative data presented in this paper—sample-based audits to provide an external benchmark and technology-aided methods that limit the opportunity to manipulate and enable precise (but cheap) detection—may also generalise across sectors.

The overarching message of this paper, therefore, is to suggest caution for strategies that emphasise ‘data-driven’ approaches to policy reform without considering the provenance of the data. The creation of robust data systems with standardised measurements and individual-level data has historically been a major (and often contentious) challenge in developing administrative capacity across the world (Scott, 1998). Remedying such distortion at scale remains a substantial task for administrative systems with weak governance.

Additional Supporting Information may be found in the online version of this article:

Online Appendix

Replication Package

1

For example, the first policy response to the problem of abysmally poor learning outcomes in LMICs (‘the learning crisis’) suggested in the World Development Report in 2018 was to assess learning ‘using well-designed student assessments to gauge the health of education systems [...] and using the resulting learning measures to spotlight hidden exclusions, make choices and evaluate progress’ (World Bank, 2018). Proficiency-based measures are now central to national and global policy goals, including the UN Sustainable Development Goals (Goal 4).

2

Bansal and Roy (2019) provided direct examples of discrepancies impeding decision-making: ‘As a consequence of conflicting measurements and lack of quality data, the objective of index-based measurement systems—to prioritise and identify weaknesses for improvement—is largely a lost cause. Today, if any bandwidth is spent in states, it is on wondering “what really is the truth”. Data-based learning assessments and rankings that should have been a clarion call to action for states have degraded into a source of frustration and cynicism, as well as the target of ridicule.’ In a complementary paper, I show that levels of achievement in official tests in Madhya Pradesh, a large Indian state, are severely exaggerated, especially for low-performing students (Singh, 2020).

3

This procedure, described in Section2.3, flags suspicious response patterns in classrooms with a high-mean, low-variance, high-within-class similarity in answer responses and a low proportion of missing responses.

4

See, for instance, Jacob and Levitt (2003), Angrist etal. (2017) and Dee etal. (2019), who documented teacher-induced distortion, and Martinelli etal. (2018), who documented copying by students. Of these, only Martinelli etal. (2018) used data from a developing country (Mexico) in the atypical setting of a high-stake incentive program in eighty-eight Mexican schools.

5

This intervention was inspired by encouraging results reported in Andrabi etal. (2017) and Afridi etal. (2020) in other South Asian contexts from similar interventions.

6

Logistically, it was only feasible to test one grade per school. By grade 4, students can be expected to answer written tests (at younger ages, tests would have needed to include oral stimuli to ensure comprehension).

7

In principle, the tablet-based tests could have allowed for more within-class variation in items administered by drawing on a broader item bank and, also, combining items more flexibly than the three sets administered in the paper-based arm. We chose not to do this to keep the test content in the two arms as comparable as possible.

8

The official letter also made clear that the testing was district wide and that it was being pushed by the state government with the intention of generating community report cards. As such, we expect that the tests had high salience for all schools, even though the test carried no formal incentives. The test was not introduced as a pilot.

9

The presence of external observers is frequently mandated even for regular paper-based tests in India. However, these observers commonly do not show up or do so perfunctorily (Singh, 2020).

10

The audit could only be completed in 117 schools. Given a tight retesting window, and ambiguity in identifiers for some schools, we could not track back three schools.

11

The unequal sample split across testing regimes reflected initial plans of the government to follow this study with a subsequent experiment on the effect of community-based report cards (as in Andrabi etal., 2017). Our prior, based on data from a different state (Singh, 2020), was that paper-based assessments would have substantial manipulation (whereas the efficacy of tablet-based tests was unknown). Since we did not want to disseminate report cards in settings where misreporting was severe, it was prudent to increase the proportion of tablet-based tests to two-thirds of the clusters. We randomised at the cluster level to keep testing modes constant within communities (since report cards would have compared schools within villages/wards). Unfortunately, for logistical and administrative reasons, this subsequent report card experiment was never implemented.

12

Percent correct scores in the tablet tests have a |$\sigma$| of 24.11 in math, 22.09 in English and 23.44 in Telugu.

13

This concern may be particularly important in the paper-based assessment arm, where students answer in the same format across the original test and the retest. Furthermore, teachers may have used the initial assessment for revision in the paper-based assessment schools. This is not a concern in the tablet testing arm since no physical question papers were left in schools. Evidence suggests that this is, in fact, the case since students who were initially tested on paper performed somewhat better in the retest (Online AppendixFigureA2, Table A1). Since the retest was administered on paper to all students, I cannot separate the effect of test-mode familiarity from revision.

14

Direct evidence of this is provided in Online AppendixFigureA3 that presents the distributions of estimated ability, as estimated in pooled data by a (common) 3-PL Item Response Theory model, in the two treatment arms. The paper-based test has a substantially higher estimated achievement even though, by virtue of randomisation, one would expect the estimated distribution of ability to be similar across the two groups. The assumption of a common item response model is also unnecessary here since, unlike, e.g., Martinelli etal. (2018), our focus is not on identifying individual students who cheat, but rather on seeing if, on aggregate, one testing regime is subject to less distortion.

15

See, for instance, the discussion of the trade-offs between different types of cheating indices in Martinelli etal. (2018), who found that cheating is higher with student incentives, but that it varied depending on whether they used the |$\omega$|-index (their preferred measure) or the K-index (which shows very little sign of cheating).

16

This is particularly important since students are often very far behind grade-appropriate levels and thus tests tied closely to official syllabi alone may miss even substantial learning gains (Muralidharan etal., 2019).

17

See, for instance, Jayaraman etal. (2016) for an example of how adaptation by agents reverses short-term conclusions about the productivity effects of payment schedules for workers.

18

Note that the ‘learning crisis’ is fundamentally about absolute learning levels, which have been the primary focus of this paper. However, substantial manipulation of levels may still preserve ordinal ranks of students within school or between schools (as documented by Singh, 2020 in a different state).

19

See, e.g., Alan etal. (2020), who used rich survey data to show that cheating in Turkish primary schools varies by socioeconomic status and IQ and that the responsiveness of cheating to incentives varies by altruism.

20

They may however still be useful for specific uses such as in evaluating pre-treatment differences between schools or students in a policy evaluation. See Muralidharan and Singh (2020) for an example.

21

Although, see Berkhout etal. (2020), who presented suggestive evidence of reduced cheating from computer-based tests in Indonesia, and Alan etal. (2020), who reported a prevalence rate of 34% for cheating in a sample of elementary school children in Istanbul.

22

For instance, similar concerns have been raised in India about official data on basic sanitation. Administrative data declaring villages as ‘open-defecation free’ are reported to be substantially overstated, despite in principle having a detailed process for verification by multiple administrative authorities (Agarwal, 2019).

Notes

The data and codes for this paper are available on the Journal repository. They were checked for their ability to reproduce the results presented in the paper. The replication package for this paper is available at the following address: https://doi.org/10.5281/zenodo.10727363.

I am grateful to Erich Battistin, Martina Bjorkman Nyqvist, Konrad Burchardi, Luis Crouch, Lee Crawfurd, Jonathan de Quidt, Jishnu Das, Tore Ellingsen, Clement Imbert, Karthik Muralidharan, Geeta Kingdon, Gaurav Khanna, Derek Neal, Lant Pritchett, Mauricio Romero and several seminar participants for insightful comments. I am also grateful to officials at the Government of Andhra Pradesh—in particular, Ms. Sandhya Rani and Mr. Santhosh Singh—as well as staff at the Central Square Foundation, especially Rahul Ahluwalia, Saloni Gupta, Neil Maheshwari and Devika Kapadia, for their collaboration. Ramamurthy Sripada, Nawar Al Ebadi and Edoardo Bollati provided outstanding field management and research assistance.This project was supported by the Research in Improving Systems of Education (RISE) program funded by UK aid. Ethical approval for fieldwork was obtained from the Institutional Review Board of JPAL South Asia at the Institute for Financial Management and Research.

References

Afridi

F.

,

Barooah

B.

,

Somanathan

R.

(

2020

). ‘

Improving learning outcomes through information provision: Experimental evidence from Indian villages

’,

Journal of Development Economics

, vol.

146

,

102276

.

Agarwal

K.

(

2019

). ‘

Government data proves we shouldn’t believe India is “open defecation free”

 ’,

The Wire, India

,

2 October

.

Google Scholar

OpenURL Placeholder Text

Alan

S.

,

Ertac

S.

,

Gumren

M.

(

2020

). ‘

Cheating and incentives in a performance context: Evidence from a field experiment on children

’,

Journal of Economic Behavior & Organization

, vol.

179

, pp.

681

701

.

Andrabi

T.

,

Das

J.

,

Khwaja

A.I.

,

Zajonc

T.

(

2011

). ‘

Do value–added estimates add value? Accounting for learning dynamics

’,

American Economic Journal: Applied Economics

, vol.

3

, pp.

29

54

.

Google Scholar

OpenURL Placeholder Text

Andrabi

T.

,

Das

J.

,

Khwaja

A.I.

(

2017

). ‘

Report cards: The impact of providing school and child test scores on educational markets

’,

American Economic Review

, vol.

107

(

6

), pp.

1535

63

.

Angrist

J.D.

,

Battistin

E.

,

Vuri

D.

(

2017

). ‘

In a small moment: Class size and moral hazard in the Italian mezzogiorno

’,

American Economic Journal: Applied Economics

, vol.

9

(

4

), pp.

216

49

.

Google Scholar

OpenURL Placeholder Text

ASER

. (

2019

).

Annual Status of Education Report 2018

,

New Delhi

:

ASER Centre

.

Google Scholar

OpenURL Placeholder Text

Banerjee

A.V.

,

Duflo

E.

,

Glennerster

R.

(

2008

). ‘

Putting a band-aid on a corpse: Incentives for nurses in the Indian public health care system

’,

Journal of the European Economic Association

, vol.

6

(

2–3

), pp.

487

500

.

Google Scholar

OpenURL Placeholder Text

Banerjee

A.

,

Duflo

E.

,

Imbert

C.

,

Mathew

S.

,

Pande

R.

(

2020

). ‘

E-governance, accountability, and leakage in public programs: Experimental evidence from a financial management reform in India

’,

American Economic Journal: Applied Economics

, vol.

12

(

4

), pp.

39

72

.

Google Scholar

OpenURL Placeholder Text

Bansal

S.

,

Roy

S.

(

2019

). ‘

The tyranny of metrics: Why we learning nothing from the learning outcome data

’,

Financial Express

,

28 February

.

Google Scholar

OpenURL Placeholder Text

Battistin

E.

,

DeNadai

M.

,

Vuri

D.

(

2017

). ‘

Counting rotten apples: Student achievement and score manipulation in Italian elementary schools

’,

Journal of Econometrics

, vol.

200

(

2

), pp.

344

62

.

Berkhout

E.

,

Pradhan

M.

,

Rahmawati

D.S.

,

Swarnata

A.

(

2020

). ‘

From cheating to learning: An evaluation of fraud prevention on national exams in Indonesia

’,

Working Paper 20/046,Research on Improving Systems of Education

.

Bertoni

M.

,

Brunello

G.

,

Rocco

L.

(

2013

). ‘

When the cat is near, the mice won’t play: The effect of external examiners in Italian schools

’,

Journal of Public Economics

, vol.

104

, pp.

65

77

.

Bold

T.

,

Kimenyi

M.

,

Mwabu

G.

,

Ng’ang’a

A.

,

Sandefur

J.

(

2018

). ‘

Experimental evidence on scaling up education reforms in Kenya

’,

Journal of Public Economics

, vol.

168

, pp.

1

20

.

Borcan

O.

,

Lindahl

M.

,

Mitrut

A.

(

2017

). ‘

Fighting corruption in education: What works and who benefits?

’,

American Economic Journal: Economic Policy

, vol.

9

(

1

), pp.

180

209

.

Google Scholar

OpenURL Placeholder Text

Das

J.

,

Zajonc

T.

(

2010

). ‘

India shining and Bharat drowning: Comparing two Indian states to the worldwide distribution in mathematics achievement

’,

Journal of Development Economics

, vol.

2

, pp.

175

85

.

Google Scholar

OpenURL Placeholder Text

Dee

T.

,

Dobbie

W.

,

Jacob

B.A.

,

Rockoff

J.E.

(

2019

). ‘

The causes and consequences of test score manipulation: Evidence from the New York regents examinations

’,

American Economic Journal: Applied Economics

, vol.

11

(

3

), pp.

382

423

.

Google Scholar

OpenURL Placeholder Text

Dhaliwal

I.

,

Hanna

R.

(

2017

). ‘

The devil is in the details: The successes and limitations of bureaucratic reform in India

’,

Journal of Development Economics

, vol.

124

, pp.

1

21

.

Figlio

D.

,

Karbownik

K.

,

Salvanes

K.G.

(

2016

). ‘

Education research and administrative data

’, in (E.A. Hanushek, S. Machin, L. Woessmann, eds.),

Handbook of the Economics of Education

, pp.

75

138

.,

Elsevier

,

.

Fujiwara

T.

(

2015

). ‘

Voting technology, political responsiveness, and infant health: Evidence from Brazil

’,

Econometrica

, vol.

83

(

2

), pp.

423

64

.

Glewwe

P.

,

Muralidharan

K.

(

2016

). ‘

Improving education outcomes in developing countries: Evidence, knowledge gaps, and policy implications

’, in (

Hanushek

E.A.

,

Machin

S.

,

Woessmann

L.

, eds.),

Handbook of the Economics of Education

, pp.

653

743

.,

Amsterdam

:

Elsevier

.

Government of India

. (

2020

).

National Education Policy 2020

,

New Delhi

:

Ministry of Human Resource Development, Government of India

.

Jacob

B.A.

,

Levitt

S.D.

(

2003

). ‘

Rotten apples: An investigation of the prevalence and predictors of teacher cheating

’,

The Quarterly Journal of Economics

, vol.

118

(

3

), pp.

843

77

.

Jayaraman

R.

,

Ray

D.

,

DeVéricourt

F.

(

2016

). ‘

Anatomy of a contract change

’,

American Economic Review

, vol.

106

(

2

), pp.

316

58

.

Kingdon

G.

,

Muzammil

M.

(

2009

). ‘

A political economy of education in India: The case of Uttar Pradesh

’,

Oxford Development Studies

, vol.

37

(

2

), pp.

123

44

.

Leaver

C.

,

Ozier

O.

,

Serneels

P.

,

Zeitlin

A.

(

2021

). ‘

Recruitment, effort, and retention effects of performance contracts for civil servants: Experimental evidence from Rwandan primary schools

’,

American Economic Review

, vol.

111

(

7

), pp.

2213

46

.

Lewis-Faupel

S.

,

Neggers

Y.

,

Olken

B.A.

,

Pande

R.

(

2016

). ‘

Can electronic procurement improve infrastructure provision? Evidence from public works in India and Indonesia

’,

American Economic Journal: Economic Policy

, vol.

8

(

3

), pp.

258

83

.

Google Scholar

OpenURL Placeholder Text

Lin

M.J.

,

Levitt

S.D.

(

2020

). ‘

Catching cheating students

’,

Economica

, vol.

87

(

348

), pp.

885

900

.

Loyalka

P.

,

Sylvia

S.

,

Liu

C.

,

Chu

J.

,

Shi

Y.

(

2019

). ‘

Pay by design: Teacher performance pay design and the distribution of student achievement

’,

Journal of Labor Economics

, vol.

37

(

3

), pp.

621

62

.

Martinelli

C.

,

Parker

S.W.

,

Pérez-Gea

A.C.

,

Rodrigo

R.

(

2018

). ‘

Cheating and incentives: Learning from a policy experiment

’,

American Economic Journal: Economic Policy

, vol.

10

(

1

), pp.

298

325

.

Google Scholar

OpenURL Placeholder Text

Mbiti

I.

,

Muralidharan

K.

,

Romero

M.

,

Schipper

Y.

,

Manda

C.

,

Rajani

R.

(

2019

). ‘

Inputs, incentives, and complementarities in education: Experimental evidence from Tanzania

’,

The Quarterly Journal of Economics

, vol.

134

(

3

), pp.

1627

73

.

Muralidharan

K.

,

Das

J.

,

Holla

A.

,

Mohpal

A.

(

2017

). ‘

The fiscal cost of weak governance: Evidence from teacher absence in India

’,

Journal of Public Economics

, vol.

145

, pp.

116

35

.

Muralidharan

K.

,

Niehaus

P.

,

Sukhtankar

S.

(

2016

). ‘

Building state capacity: Evidence from biometric smartcards in India

’,

American Economic Review

, vol.

106

(

10

), pp.

2895

929

.

Muralidharan

K.

,

Niehaus

P.

(

2017

). ‘

Experimentation at scale

’,

Journal of Economic Perspectives

, vol.

31

(

4

), pp.

103

24

.

Muralidharan

K.

,

Singh

A.

,

Ganimian

A.

(

2019

). ‘

Disrupting education? Experimental Evidence on technology-aided instruction in India

’,

American Economic Review

, vol.

109

(

4

), pp.

1426

60

.

Muralidharan

K.

,

Singh

A.

(

2020

). ‘

Improving public sector management at scale: Experimental evidence on school governance in India

’,

Working paper, National Bureau of Economic Research

.

Muralidharan

K.

,

Sundararaman

V.

(

2011

). ‘

Teacher performance pay: Experimental evidence from India

’,

Journal of Political Economy

, vol.

119

(

1

), pp.

39

77

.

Pritchett

L.

(

2013

).

The Rebirth of Education: Schooling Ain’t Learning

,

Washington, DC

:

Center for Global Development

.

Google Scholar

OpenURL Placeholder Text

Romero

M.

,

Riascos

Á.

,

Jara

D.

(

2015

). ‘

On the optimality of answer-copying indices: Theory and practice

’,

Journal of Educational and Behavioral Statistics

, vol.

40

(

5

), pp.

435

53

.

Scott

J.C.

(

1998

).

Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed

,

New Haven, CT

:

Yale University Press

.

Google Scholar

OpenURL Placeholder Text

Singh

A.

(

2020

). ‘

Myths of official measurement: Auditing and improving administrative data in developing countries

’,

Working Paper 20/042,Research on Improving Systems of Education

.

OpenURL Placeholder Text

Vivalt

E.

(

2020

). ‘

How much can we generalize from impact evaluations?

’,

Journal of the European Economic Association

, vol.

18

(

6

), pp.

3045

89

.

World Bank

. (

2018

).

World Development Report 2018: Learning to Realize Education’s Promise

,

Washington DC

:

The World Bank

.

Google Scholar

OpenURL Placeholder Text

© The Author(s) 2024. Published by Oxford University Press on behalf of Royal Economic Society.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

JEL

I25 - Education and Economic Development I28 - Government Policy O15 - Human Resources; Human Development; Income Distribution; Migration

Issue Section:

Short paper

Download all slides

  • Supplementary data

  • Supplementary data

    Advertisem*nt

    Citations

    Views

    148

    Altmetric

    More metrics information

    Metrics

    Total Views 148

    41 Pageviews

    107 PDF Downloads

    Since 3/1/2024

    Month: Total Views:
    March 2024 52
    April 2024 20
    May 2024 20
    June 2024 56

    Citations

    Powered by Dimensions

    Altmetrics

    ×

    Email alerts

    Article activity alert

    Advance article alerts

    New issue alert

    JEL classification alert

    Receive exclusive offers and updates from Oxford Academic

    Citing articles via

    Google Scholar

    • Latest

    • Most Read

    • Most Cited

    Energy Tax Exemptions and Industrial Production
    Returns to labour mobility
    Gender Differences in Math Tests: The Role of Time Pressure
    Demand for Commitment in Credit and Saving Contracts: A Field Experiment
    The Short- and Long-Run Impacts of Free Education on Schooling: Direct Effects and Intra-Household Spillovers

    More from Oxford Academic

    Economics

    Social Sciences

    Books

    Journals

    Advertisem*nt

    Improving Administrative Data at Scale: Experimental Evidence on Digital Testing in Indian Schools (2024)

    References

    Top Articles
    Latest Posts
    Article information

    Author: Kieth Sipes

    Last Updated:

    Views: 5959

    Rating: 4.7 / 5 (47 voted)

    Reviews: 94% of readers found this page helpful

    Author information

    Name: Kieth Sipes

    Birthday: 2001-04-14

    Address: Suite 492 62479 Champlin Loop, South Catrice, MS 57271

    Phone: +9663362133320

    Job: District Sales Analyst

    Hobby: Digital arts, Dance, Ghost hunting, Worldbuilding, Kayaking, Table tennis, 3D printing

    Introduction: My name is Kieth Sipes, I am a zany, rich, courageous, powerful, faithful, jolly, excited person who loves writing and wants to share my knowledge and understanding with you.