Main

Rapid diagnostic tests (RDTs) save lives by informing case management, treatment, screening, disease control and elimination programs1. Lateral flow tests are among the most common RDTs, and hundreds of millions of these tests are performed worldwide each year. They have the potential to support near-person testing and decentralized management of a range of clinically important diseases (including malaria, HIV, syphilis, tuberculosis, influenza and noncommunicable diseases2), making it convenient for the end user and more affordable for health systems3. However, RDTs also present some issues, namely: errors in performing the test and interpreting the result4,5, quality control and lack of electronic data capture records of the test and results within health systems and surveillance. Many of these would be overcome with the real-time connectivity associated with REASSURED—the new criterion for an ideal test to reflect the importance of digital connectivity, coined by Peeling and coworkers1. Real-time connectivity involves the use of mobile-phone-connected RDTs. To date there have been few peer-reviewed studies or evaluations of the effectiveness of connected lateral flow tests at scale in populations in low- and middle-income countries.

Recent studies comparing the human interpretation of a HIV RDT to various gold standards, such as immunoblot6,7,8,9, enzyme immunoassay7,9,10,11, standardized test panels12 or different HIV RDTs13,14,15, have highlighted the common issue of subjective interpretation of the test result, which can lead to incorrect diagnosis. User error (especially in the case of weak reactive lines) and inadequate supervision of testers were identified as prime factors for misinterpretation16. In a study of differently experienced users interpreting results of HIV RDTs by looking at pictures of tests17, the accuracy of interpretation varied between 80 and 97%. This highlights the importance of experience in reading the test, as well as the subjectivity involved in reading a weak test line. Evidence also suggests that some fieldworkers struggle to interpret RDTs because of color blindness or short-sightedness18. Another study used photographs of HIV RDTs to quantify the subtle difference in tests with faint lines declared as true positive (TP) or false positive (FP) by a panel of human users19. While these were small-scale studies (n = 148 and 8, respectively), both highlighted the potential for photographs to improve quality control and decision making.

Deep learning algorithms, harnessing advances in large datasets and processing power, have recently shown the ability to exceed human performance in a plethora of visual tasks, including cell-based diagnostics20, interpretation of dermatologic21, ophthalmologic22 and radiographic images23, playing strategic games24 and in clinical medicine when used alongside appropriate guidelines25,26. While some emerging studies are looking at the application of deep learning to the interpretation of RDTs27,28, little is known about the ability of machine learning models to analyze field-acquired diagnostic test data, with concerns about the potential uniformity of images (for example, focus and tilt), harsh environmental factors such as lighting (for example, brightness and shadowing), and the variety of test types. In addition, there is a general lack of large real-world datasets available to successfully train deep learning classifiers, particularly from low- and middle-income countries. Recent advances in consumer electronic devices and deep learning have the potential to improve RDT quality assurance, staff training and connectivity, eventually supporting self-testing such as for HIV, which has been shown to be cost effective29, to appeal to young people30 and help reduce anxiety31.

Mobile health (mHealth) approaches, which marry RDTs with widely available mobile phones, take advantage of inbuilt sensors (for example, cameras) found in the phones, battery life, processing power, screens to display results and connectivity to send results to health databases. A recent field study has shown high levels of acceptability for a device sending HIV RDT results to online databases in real time32. An array of approaches have been piloted at small scales (n ≤ 283) and have shown good performance. However, most require a physical attachment such as a dongle (92–100% sensitivity, 97–100% specificity)33, a cradle34 or a portable reader (97–98% sensitivity)35, which increases cost and complexity, and these are typically reliant on simple image analysis software.

We explore the potential of deep learning algorithms to classify field-based RDT images as either positive or negative, focusing on HIV as an exemplar and piloting at scale in population ‘test beds’ in KwaZulu-Natal, typical of semi-rural settings in subSaharan Africa. Figure 1 shows the concept of our deep learning–enabled REASSURED diagnostic system to capture and interpret RDT results. Our approach first involved building a large image library of field-acquired test images as a training dataset, optimizing algorithms for high sensitivity and specificity and then deploying our classifier in a pilot study to assess its performance compared to traditional visual interpretation with a range of end users having varying levels of training.

Fig. 1: Infographic illustrating the benefits of data capture in supporting field decisions.
figure 1

Current workflow used by fieldworkers (blue); our proposed mHealth system of automated RDT classifier plus data capture and transmission to a secure mHealth database (orange); and the benefits arising from deploying the proposed system (green). Black rectangles represent tablets or smartphones.

Our standard image collection protocol (Fig. 2a) and library are described in Methods. In brief, 11,374 photographs of HIV RDT were captured by >60 fieldworkers using Samsung tablets (SM-P585, 8-megapixel camera, f1/9 with autofocus capability). Embedding of routine image collection into staff workflows was acceptable and feasible, and participant consent rate was 96%. We optimized our mHealth system for the two different HIV RDTs used in the study as part of routine household population surveillance. At first glance these RDTs appear similar, but have different features and numbers of test lines. To reduce the number of variables, we cropped the images around the region of interest (ROI) (Fig. 2b). Figure 2c shows a snapshot of the very diverse real-world field conditions where the images were captured (indoors, outdoors, in the shade and in direct sunlight).

Fig. 2: Standardization of image capture, image preprocessing and training library.
figure 2

a, Fieldworker capturing a photograph of two HIV RDTs at the time of interpretation, in the field in rural South Africa (image credit: Africa Health Research Institute). The two HIV RDTs are fitted in a plastic tray designed to standardize image capture and facilitate image preprocessing. b, Interpretation process, starting from the original picture of HIV RDTs used during the study, preprocessing to select the ROI then interpretation of the test result. If two lines (control + test) are present on the paper strip at the time of interpretation, the test result is positive. Note: for the ABON HIV RDT, one or two different test lines can appear (T1 and T2) depending on the type of HIV infection (HIV-1 and HIV-2, respectively). The test result is positive regardless of which test line is present, or if both test lines are present on the paper strip at the time of interpretation. If only the top line (control) is present, the test is negative; if no control line can be seen, the test is deemed invalid. c, Snapshot of the image library of HIV RDTs collected in the field in rural South Africa (162 randomly selected images out of 11,374), illustrating the diversity of color, background and brightness.

Each image was labeled (Methods) according to the test result. Figure 3a details the number of images used to train classifiers to automatically read the result of HIV RDT images. The training process is described in Methods. To test the reproducibility of the process, we performed a tenfold cross-validation. As can be seen in Fig. 3b, the average sensitivity (95.9 ± 5.1% for type A, 98.7 ± 1.7% for type B) and specificity (99.0 ± 0.6% for type A, 99.8 ± 0.2% for type B) achieved across the ten folds was high and consistent for both types of HIV RDT. We therefore used all available data to train a final classifier for each type of test, which were then used in our field study. We investigated different common classification methods in use for clinical diagnostics (support vector machine36 (SVM) and convolutional neural networks (CNNs)), including three different CNN architectures (ResNet50 (ref. 37), MobileNetV2 (refs. 38,39) and MobileNetV3 (ref. 40), and found MobileNetV2 the most appropriate for our task, as can be seen in Fig. 3c.

Fig. 3: Algorithm training and performance.
figure 3

a, Table showing the number of images in the training library, divided into two label categories (positive and negative), as well as two subcategories corresponding to the test type. b, Table summarizing the training process using cross-validation, with a training set of n = 3,998 (type A) and n = 6,221 (type B). Sensitivity and specificity were obtained using a hold-out testing dataset of n = 445 (type A) and n = 693 (type B). c, Barplots showing the average performance (sensitivity and specificity) of four classification methods trained on our dataset, using cross-validation (error bars represent s.d. from the mean). The three CNNs pretrained on the ImageNet dataset (ResNet50, MobileNetV2 and MobileNetV3) were retrained and tested using our dataset. The SVM was trained using features extracted by the histogram of oriented gradients. All four classifiers were trained using the training set described in b. Sensitivity and specificity were obtained using the hold-out testing dataset described in b.

We then conducted a field pilot study in rural South Africa to assess the performance of our mHealth system compared to visual interpretation, with a range of end users having varying levels of training (Methods). Five participants (two nurses and three newly trained community health workers) were each asked to give their interpretation of 40 HIV RDTs and to acquire a photograph of the RDT via the application. The plastic trays used to collect the image library were not used in this pilot study. All five participants (100%) were able to use our mHealth system without training, demonstrating its feasibility and acceptability. The photographs were then evaluated by an expert RDT interpreter, followed by our deep learning algorithms on a secure server. The results were not fed back to the study participants, to avoid confirmation bias. The performance results can be seen in Fig. 4.

Fig. 4: Performance evaluation of our mHealth system compared to traditional visual interpretation: field pilot study.
figure 4

a, Graphics showing the agreement (%) between pairs of study participants when asked to interpret HIV RDT results using traditional visual interpretation. Participants were two experienced nurses (N1, N2) and three community health workers (C1, C2, C3). For each pair of participants there were n = 38 HIV RDTs. Observations are separated according to the two types of HIV RDT used in the study. Purple-bordered area on both graphics highlights agreement between the two experienced nurses, while the orange-bordered area highlights agreement between the three pairs of community health workers. b, Confusion matrices showing the number of TN, FP, FN and TP results when comparing the interpretation of our mHealth system (top row) and traditional visual interpretation (bottom row) to the ground truth. Red matrices on the left include the results for all study participants, which are broken down into experienced nurses (orange matrices) and community health workers (purple matrices). c, Barplots showing the performance index for individual participants. Participants are divided between experienced nurses and community health workers. The performance index is the ratio of the performance of our mHealth system to that of traditional visual interpretation: performance index ≥1 indicates that our mHealth system performed better than (or as well as) traditional visual interpretation. The observations are separated according to the two types of HIV RDT used in the study.

When comparing the traditional visual interpretation of RDTs we observed varied levels of agreement between participants (61–100%) as can be seen in Fig. 4a. As expected, agreement between nurses (N1 and N2: 100 and 94.4% agreement for test types A and B, respectively) was greater than that between newly trained community health workers (C1, C2 and C3: 80–90 and 61.1–94.4% for test types A and B, respectively). Test type B showed the lower level of agreement. The low level of agreement between participants, and variability due to the type of HIV RDT, were of concern and highlighted the need for a more objective and consistent method to interpret HIV RDTs in the field. The confusion matrices in Fig. 4b demonstrate that our mHealth system reduced the number of errors in reading RDTs. The number of FP results from our mHealth system was found to be lower than that for traditional visual interpretation (0 compared to 11—the largest variation being observed for community health workers, 10), which translates as an improvement in specificity from 89 to 100% and an improvement in positive predictive value from 88.7 to 100%. Similarly, the number of false-negative (FN) results was just two in our mHealth system compared to four for traditional visual interpretation, which translates as an improvement in sensitivity from 95.6 to 97.8% and an improvement in negative predictive value from 95.7 to 98%. We plotted the ratio of our mHealth system performance to participant performance, for both sensitivity and specificity (Fig. 4c). All participants had a sensitivity index ≥1 for test type A; four out of five participants (N1, N2, C1 and C2) also had the same index for test type B, demonstrating that our mHealth system was more effective than those participants at reading positive test results. Our system was also more reliable at reading negative tests, because all participants had a specificity index ≥1 for both types of HIV RDT.

We acknowledge the following limitations of our study. First, our pilot study involved a relatively small number of participants (five) although we note this is comparable to other similar pilot studies reported in the field. In future, larger evaluation studies and clinical trials will be needed to assess the performance of the system, involving participants with a broader range of demographics including age, gender and different levels of digital literacy, as well as more expert readers. In addition, future studies would benefit from the inclusion of an invalid test classifier and different mobile phone types with varying camera specifications. Although images were analyzed on a secure server, future analysis could be on-device and thus overcome the need to upload images. We are also currently investigating an image segmentation approach using deep learning for the next iteration of the smartphone application.

To conclude, we have demonstrated the potential of deep learning for accurate classification of RDT images, with an overall performance of 98.9% accuracy, notably higher than traditional visual interpretation of study partipants (92.1%), comparable to reports of 80–97% accuracy17. Given that >100 million HIV tests are performed annually, even a small improvement in quality assurance could impact the lives of millions of people by reducing the risk of FP and FN. We believe our real-world image library is the first of its kind at this scale and we demonstrate that deep learning models can be deployed with mobile devices in the field, without the need for cradles, dongles or other attachments. It lays the foundation for deep learning–enabled REASSURED diagnostics, demonstrating that RDTs linked to a mobile device could standardize the capture and interpretation of test results for decision makers, reducing interpretation and transcription errors and workforce training. Our findings are based on HIV testing decision support for fieldworkers, nurses and community health workers, but in future could be applicable to decision support for self-testing. We focused on HIV as an exemplar, but the capacity of the classifier for adaptation to two different test types suggests that it is amenable to a large range of RDTs spanning both communicable and noncommunicable diseases. This platform could be utilized for workforce training, quality assurance, decision support and mobile connectivity to inform disease control strategies, strengthen healthcare system efficiency and improve patient outcomes and outbreak management. The ideal connected system would link connected RDTs to laboratory systems, whereby remote monitoring of RDT functionality and utilization could also allow health programs to optimize testing deployment and supply management to deliver sustainable development goals and ensure that no one is left behind. The real-time alerting capability of connected RDTs could also support public health outbreak management by mapping ‘hotspots’ for epidemics, including COVID-19, to protect populations.

Methods

Ethics

Ethical approval for the demographic surveillance study was granted by the Biomedical Research Ethics Committee of the University of KwaZulu-Natal, South Africa (no. BE435/17). Separate informed consent was required for the main household survey, the HIV sero-survey, the HIV point of care test and photographs of the HIV test.

Ethical approval for the collection of human blood samples used in the pilot study was granted by the Biomedical Research Ethics Committee of the University of KwaZulu-Natal, South Africa (no. BFCJ 11/18).

Recruitment of participants to the Africa Health Research Institute Population Implementation Platform for the image library

Eligible participants were all individuals aged 15 years and older and resident within the geographic boundaries of the Africa Health Research Institute (AHRI) population intervention program surveillance area (see ref. 41 for the cohort profile). Individuals who had died or outmigrated before the surveillance visit were no longer eligible. There were three contact attempts by the fieldworker team and a further three contact attempts by a tracking team before an individual was considered uncontactable. All individuals in the study gave informed consent. Specifically, all contacted eligible individuals who gave informed consent for this study were offered a rapid HIV test if they were not currently being administered antiretroviral therapy. For children under the age of 18 years, written consent for rapid HIV testing was obtained from the parent or guardian and assent from the participant.

HIV RDT image library collection

The original RDT images library was collected in rural South Africa by a team of 60 fieldworkers between 2017 and 2019. AHRI fieldworkers survey a population of 170,000 people in rural KwaZulu-Natal. Participants were visited at their home, those giving informed consent were tested for HIV using a combination of two HIV RDTs and, following further consent, a photograph of their two HIV RDTs was captured by the fieldworker on a tablet at the time of interpretation. Both HIV RDTs were used as part of routine demographic surveillance in AHRI. The test type continued to change during this study following recommendations by the South African government, exemplifying the need for robust systems in reading multiple test formats.

While the two HIV RDTs used in this study have their own instructions for use (see manufacturer’s instructions), they all generally follow the same principle of collecting a drop of blood from the participant’s fingertip, delivering that drop of blood to the sample pad and using a drop of chase buffer to facilitate sample flow through the length of the paper strip. The result (a combination of one or two lines appearing on the paper strip) is then read out after a period of 10–40 min, depending on the type of HIV RDT used.

For minimal disturbance of workflow, a plastic tray designed to hold both HIV RDTs was given to each fieldworker (Fig. 2a). This ensured that fieldworkers were required to capture only one image per participant. The tasks of separating the two HIV RDTs and isolating the ROI used to train the classifier were conducted further down the line as part of data preprocessing.

A standard operating procedure (SOP) on how to capture the image was cocreated and optimized with the team of fieldworkers; a copy of the SOP can be found in Extended Data Fig. 1. The SOP was designed to minimize the impact of environmental factors, as well as to ensure a standard means of capturing images. All fieldworkers attended a 2-day initial training program during which the objectives of data collection and design of the plastic tray were clearly explained, and each fieldworker was personally trained and given feedback on how to capture valid photographs. A training protocol was also established to ensure that newly enrolled fieldworkers who did not attend the initial training session could also be trained to capture images for the project. Finally, picture quality assessment sessions were conducted to give the fieldworkers team feedback, and to ensure that most images were of sufficient quality for use in training the classifier.

All images were captured using Samsung tablets (SM-P585, 8-megapixel camera, f1/9 with autofocus capability) using the native Android camera application and stored on the device until the end of the day, when they were transferred to a secure database at AHRI. Our mHealth system allows the saving of only one picture per test and per participant to the tablet and uploading to the AHRI database. After anonymization (including stripping of geocoordinates from the image EXIF data), batches of 2,000–3,000 images were securely transferred to University College London team members on a quarterly basis and stored securely in a ‘data-safe haven’ managed by the university.

Levels of both feasibility (93%) and acceptability (98%) of the system used to capture HIV RDT images were high, according to a survey taken by fieldworkers involved in the study.

For the purposes of this study, an initial batch of 11,374 images were used. Because only very few invalid results were obtained from the field, it was decided, for the purposes of this proof-of-concept study, to focus on training the classifier to distinguish between positive and negative results. To optimize this task, the ROI around each HIV RDT was isolated and used to train the classifier.

Image labeling

All preprocessed images were labeled by a group of three RDT experts (99.2% agreement with fieldworkers’ labeling). Labeling is the process of sorting images into categories, which are then used to train the classifier. The categories chosen here correspond to the possibilities for the HIV RDT result—that is, positive and negative. We recognize that a third outcome, ‘invalid’, is also possible and needs to be considered when using the system to provide a confident diagnosis. However, the absence of invalid test results in our library of images collected by fieldworkers did not allow us to train the classifier on this third category in the present study. We therefore focused training on the two main categories (positive and negative), and are exploring other ways to incorporate the invalid outcome in our mHealth system. This could mean either using data augmentation techniques on the low numbers of invalid test results images, or adding a preprocessing step to detect the presence of a control line on the image before deciding to feed it (or not, in the case where the control line is absent) to the classifier.

Training library

The labeled images were divided into two subcategories corresponding to the HIV RDT type. The two types of test in our library are:

  • Type A: ABON HIV 1/2/O Tri-Line Human Immunodeficiency Virus Rapid Test Device (whole blood/serum/plasma) (ABON Biopharm (Hangzhou) Co., Ltd)

  • Type B: ADVANCED QUALITY ONE STEP Anti-HIV (1&2) Test (InTec PRODUCTS, INC.).

While two tests were administered per patient, in this study we treat each test individually since the tests are from different manufacturers and therefore could respond differently to the same blood sample. The collection system design also guaranteed that there was never more than one image of a given test per participant.

Image normalization

Before being used for training, each image was resized to the dimensions of the input layer then standardized. Standardization of the data was performed using equation (1) below, where xs is the standardized pixel value, xo the original pixel value and μ and σ are the mean and s.d. of all pixels in the image, respectively.

$$x_{\mathrm{s}} = \frac{{x_{\mathrm{o}} - \mu }}{\sigma }$$
(1)

Cross-validation

Each dataset (one for each type of HIV RDT) was randomly divided into ten equal folds. Using the leave-one-out method, ten classifiers were trained using nine folds as the training set (further randomly divided into 80% training and 20% validation). To account for imbalanced datasets (roughly 13:1 negative:positive ratio), we forced every batch during training to contain 50% positive images and 50% negative images using random sampling. Each model was then optimized by creating a receiver operating characteristic curve using the validation set. This yielded an optimal threshold which was used to evaluate the model performance on the testing set (the remaining tenth fold). The deployment models were obtained by retraining using all the available data, for each type of HIV RDT. All training and evaluation were conducted using the scikit-learn and Tensorflow libraries in Python.

Comparison with established classification methods

The SVM was trained using preprocessed features extracted using the histogram of oriented gradients, with principal component analysis used to filter out less important features. The three CNNs (ResNet50, MobileNetV2 and MobileNetV3) were pretrained using the ImageNet dataset then retrained using our dataset. For all four methods, training and evaluation were conducted using the scikit-learn and Tensorflow libraries in Python.

Android application

We developed a smartphone/tablet Android application designed for end users to capture a picture of their HIV RDT at the time of reading of the test result. Together with end users, we optimized the design to maximize the simplicity of the process to make our mHealth system accessible to end users with a broad range of digital literacy. All that is required from the end user is to roughly align a semitransparent template of the HIV RDT with their HIV RDT and press a button to capture an image. Cropping around the ROI is then performed automatically in the background (using the pixel coordinates of the template overlay), as is the process of sending the ROI to our classifier and receiving our mHealth system result. For the purpose of this pilot study, participants were not made aware of our mHealth system’s interpretation of the test results, to avoid bias for their own interpretation. Screenshots of the application can be found in Extended Data Fig. 2.

Field pilot study protocol

The Android application was deployed in a field pilot study in KwaZulu-Natal, South Africa. Five participants were randomly selected from the staff at AHRI—two experienced nurses and three community health workers. Forty HIV RDTs (20 type A, 20 type B) were performed following the manufacter’s guidelines using discarded, anonymized human blood samples (ten positive, ten negative according to enzyme-linked immunosorbent assay). For each of the 40 HIV RDTs, every participant was asked to record their visual interpretation of the test result, then to use our mHealth system on a tablet to capture a photograph of the HIV RDT. The system consisted of our Android application (described above) installed on a single Samsung SM-P585 tablet, identical to those used by fieldworkers for data collection. Participants were not shown the automated interpretation of the test result provided by our mHealth system, to avoid confirmation bias. The field pilot study took place at the AHRI rural site in the heart of the community (Mtubatuba, KwaZulu-Natal) under lighting conditions identical to those under which the mHealth system is intended to be used. A short (10-min) demonstration on how to use the smartphone application was given to all participants, who were then left on their own to proceed with the task of reading the HIV RDTs and capturing images.

Field pilot study data analysis

The data analysis consisted of the comparison of three datasets:

  1. 1.

    Traditional visual interpretation by study participants

  2. 2.

    Independent expert interpretation of the images captured by study participants

  3. 3.

    Automated machine learning interpretation by our classifier.

Traditional visual interpretaiton was recorded on the tablet by each study participant immediately after being shown the HIV RDTs. Only two of the 40 HIV RDTs (corresponding to ten images out of 200) had to be discarded from the analysis, because one participant took a photograph of the wrong HIV RDTs and it was therefore not possible to compare interpretation results across all five participants.

An independent RDT expert subsequently visually interpreted all 190 HIV RDT images; this expert had substantial experience conducting performance evaluations of lateral flow rapid tests for ocular and genital Chlamydia trachomatis in the Phillippines, the Gambia and Senegal. Visual interpretation was performed 1–5 h after sample addition. The independent expert certified that none of the HIV RDT results had changed during this time frame.

The automated machine learning interpretation by our classifiers was processed on our secured server. The results were compared to traditional visual interpretation (shown in the confusion matrices in Fig. 4) while the independent expert then analyzed the results using the performance indicators described below.

Performance indicators

The four indicators of performance investigated were sensitivity, specificity, positive predictive value (PPV) and negative predicitve value (NPV). For each image, the classifier produces an outcome that belongs to one of the four categories TP, true negative (TN), FP or FN. Whether the outcome is true or false depends on comparison with the gold standard chosen.

Sensitivity is the ability of the classifier to correctly detect a positive result by measuring the ratio \(\frac{{\mathrm{TP}}}{{\mathrm{TP + FN}}}\), while the specificity is the ratio \(\frac{{\mathrm{TN}}}{{\mathrm{TN + FP}}}\) and translates the ability of the classifier to correctly detect a negative result. PPV is the ratio \(\frac{{\mathrm{TP}}}{{\mathrm{TP + FP}}}\) and NPV is the ratio \(\frac{{\mathrm{TN}}}{{\mathrm{TN + FN}}}\). These indicate the proportions of positive and negative results, as determined by a diagnostic test, that are true positves and true negatives, respectively.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.