Use of databases for clinical research
- Correspondence to Dr Yoon K Loke, Norwich Medical School, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ, UK;
- Received 15 September 2013
- Revised 14 January 2014
- Accepted 16 January 2014
- Published Online First 31 January 2014
Databases are electronic filing systems that have been set up to capture patient data in a variety of clinical and administrative settings. While randomised controlled trials are the gold standard for the evaluation of healthcare interventions, electronic databases are valuable research options for studies of aetiology and prognosis, or where trials are too expensive/not logistically feasible. However, databases exist in many different settings and formats (often developed for administrative or financial reimbursement purposes rather than clinical research), and researchers need to put careful thought into identifying and acquiring relevant data sets. Accuracy of records and validation of diagnoses are key issues when planning a database study. High-quality databases can readily capture outcome data (as part of routine clinical care) without the costs and burden of additional trial-related follow-up, and there are promising hybrid models which combine the benefits of randomisation with the efficiency of outcome ascertainment using existing databases.
Most, if not all readers of the journal will be familiar with the gold-standard of healthcare interventional research—high-quality randomised controlled trials (RCTs) that provide unbiased evidence to guide treatment decisions. However, there are many situations where the clinical question requires something other than an RCT. This may include, for instance, research into aetiological factors (‘what are the genetic determinants of childhood leukaemia in Africa?’) or prognostic (‘what are the clinical features that indicate high risk of mortality in children with pneumonia?’) indicators. Equally, there may be logistical, financial and ethical reasons why a clinical trial may not be feasible in a particular clinical setting. Let us take, for example, recent suspicions regarding an increased risk of Clostridium difficile diarrhoea in patients receiving proton pump inhibitors (PPIs).1 It is not entirely clear if the problem is restricted to hospitalised older adults, or whether it might also involve children who are prescribed PPIs. While it would be theoretically possible to conduct a placebo-controlled RCT by randomising some children to PPIs (accompanied by regular stool collection to detect C difficile infection), there are potentially insurmountable hurdles in trying to overcome ethical constraints or in persuading participants to join the trial. Rather than leaving this important question unanswered, would an alternative option be to interrogate the Microbiology database for positive/negative C difficile stool cultures in children and then check against the inpatient electronic prescribing database to determine whether PPIs had been used or not?
From the above example, readers should note the need to carefully select a database that would be the best fit for addressing a particular research question. The key limitation here is that in most instances, researchers will have to make do (or tailor their methods to align) with existing information that has already been collected in a particular database (which may or may not have been originally set up with clinical research in mind). Relatively few clinicians have the time or money to construct and customise a large-scale database de novo. Hence, a good understanding of the coverage and validity of existing data sets is essential for the development of a research proposal.
What are the types of database that might be used in clinical research?
For the purposes of this article, we would consider a clinical database to be an electronic filing system, where pieces of information (‘records’) regarding healthcare users are kept in a structured format. Each patient record has a number of fields to it (eg, medication, diagnoses, laboratory results, healthcare episodes) so that data about that person can be recorded in a (hopefully) systematic and accurate manner. Now, databases can have multiple roles in the healthcare setting and can range from a small focused registry (eg, children with congenital heart disease undergoing cardiac surgery at a single specialist hospital) to extensive nationwide networks covering general practice data on millions of people in England, such as the Clinical Practice Research Database (http://www.cprd.com) or The Health Improvement Network Database (THIN, http://csdmruk.cegedim.com). There are of course many databases that lie between the small specialist ones and large population registries, and a number of examples are presented in table 1 to illustrate the substantial diversity among databases.
It is clear that the nature and depth of the information captured in the specialist care setting will differ somewhat from the items recorded by general practitioners. Focused databases are particularly useful when considering rare diseases or when there are specific concerns about uncommon, novel adverse effects. These databases can be constructed to facilitate capture of a data set of patients that are afflicted by a particular shared condition or undergoing specific treatment, including their clinical features, outcomes and results of laboratory or radiological tests. Hence, researchers may be able to address more specific research questions or finer points through the use of such focused databases. In contrast, the strength of population databases lies in the availability of data in millions of people, thus facilitating the process of identifying important signals for future hypothesis-testing studies. However, population databases may be less able to capture finer details or results of complicated investigations in patients who are overseen by specialist teams. A recent survey conducted by Neubert et al2 provides a comprehensive report on the key strengths and limitations of 17 large-scale databases that can be used for paediatric medicines research (drug utilisation or drug safety) in Europe.
Moreover, while there are many databases focused on acquisition of clinical data, it is also worth being aware of other types of healthcare databases that can yield valuable information. Financial or administrative databases are typically run by healthcare providers or insurers (eg, Veteran's Administration in the USA) to keep track of patients and their use of health services. Some countries (for instance, Denmark and Taiwan) have wide coverage through national prescribing registers and hospital episode registers that capture healthcare utilisation throughout almost their entire population. While these databases are not specifically designed for research, the available data may encompass items such as length of stay, re-admissions, attendance for procedures or operations, prescriptions redeemed at pharmacies, and discharge diagnoses. As an example, researchers who are evaluating the relationship between anticholinergic drugs and acute urinary retention may run a search of a regional healthcare administrative database for patients recorded as having undergone urinary catheterisation in emergency departments, and check what drugs the patients had recently redeemed a prescription for.3
Selection of the appropriate type of database is a critical first step in addressing a research question. For instance, development of a pneumonia prognostic score based on respiratory rate and serum pH would be best done by accessing an Intensive Care or Emergency Department database, and there would be little value in searching general practice records. Conversely, primary care databases would be very suited to determining whether or not non-steroidal anti-inflammatory drugs are associated with an increased rate of asthma in children. Specific examples of research questions and choice of database are listed in table 2.
Nevertheless, some of these barriers (or information missing from the specific database) can be overcome by electronic record linkage, for instance where the patient's hospital number is used to link the inpatient prescription database to the microbiology stool results database so that any association between PPI and diarrhoea can be evaluated. Record linkage can be broadened further to encompass mortality data that are routinely captured by the government, so that it is possible to ascertain the vital status of patients who did not attend follow-up because they had passed away. Equally, patient symptoms (eg, epigastric pain or gastrointestinal discomfort) may be difficult to capture in healthcare databases, and creative researchers can resort to proxy indicators such as ‘new prescription of PPI’ in an attempt to capture such events.
What administrative and governance issues should be considered?
Availability and access to data sets is another key consideration, depending on the nature of the database. It is not simply about sitting down and being given a spread sheet which allows researchers to simply key in a few terms to fish out interesting records. Aside from ensuring compliance with ethical, confidentiality and data integrity requirements (which vary considerably with each database and are beyond the scope of this paper), careful planning is required when requesting permission to access and search the database for relevant records. Most of the large scale databases have dedicated staff for developing and running searches. However, given the administrative fees involved for each search, it is best to have a finalised protocol prior to embarking on this work. Familiarity with the format of the fields in the particular database is an essential prerequisite, whether doing your own search, or when working with the database experts. Some researchers have reported on their experience of a considerable administrative burden encompassing many months of work and substantial financial expenditure for access to population databases in the UK.4 Here, they highlighted a cost of up to £7000 per data set (for what seemed to be a relatively simple search), and an interval of between 1–6 months from time of application to receipt of the requested data. In addition to this, a total of 59 days of a senior data analyst's time was consumed in the process of successfully sourcing data sets from seven different providers.
What are the problems with format, accuracy and validity of data sets?
Although database programmers often set up automated rules to ensure accuracy of the data set, it would be unwise to assume that there are no rogue values or howlers in the records. The simplest things can stymie the best efforts of researchers when faced with a disparate set of medical records from hundreds of different clinicians and general practices. For instance, where the oxygen level in a particular record is shown with a value of ‘90’, was this value obtained from an arterial blood gas (partial pressure in mm Hg) or via a pulse oximeter (saturation in percentage)? Equally, instead of recording continuous values of glomerular filtration rate (mL/min), some databases may have chosen to categorically classify by stage of chronic kidney disease. There are also simple keying in errors that are very difficult to detect, for instance if the diastolic blood pressure such as 90 mm Hg is transposed into the systolic blood pressure field by mistake.
Validating the specific diagnosis is a major problem in most databases. This stems from heavy reliance on the accuracy of discharge letters (or even cause of death on certificates) and the subsequent electronic coding that takes place. While larger databases have carried out some degree of diagnostic validation,5 most clinicians will be familiar with instances where the correct diagnosis has not been reached, or appropriately coded. Moreover, certain events such as coronary ischaemic events may be coded under several different headings such as ‘acute coronary syndrome’ or ‘unstable angina’ or ‘myocardial infarction’, and great care is therefore needed in identifying relevant outcomes from database searches.
Databases and clinical research—what the future brings?
There are two main barriers to the wider use of databases in clinical research. In the preceding discussion, we have highlighted issues with the availability, relevance and validity of clinical data within databases. Many of these issues can be overcome with increasing sophistication of database design and technology, including greater linkage and automated capture/transfer between administrative and clinical databases, at primary and secondary care level. Equally, greater awareness and enhanced training can also improve engagement of healthcare personnel and quality of data input into the system.
However, until now, the risk of bias from lack of randomisation has been perceived as a major obstacle to using databases when comparing the effects of healthcare interventions. This is a shame because high-quality clinical databases can allow rapid identification of potentially suitable participants, and easily capture outcome data (as part of routine clinical care) without the costs and burden of additional trial-related procedures and follow-up visits. However, a recent coronary intervention trial (using hybrid methodology) illustrates that it may well be possible to have the best of both worlds, in that patients can be randomised to different interventions, while their outcomes are efficiently captured through existing clinical registries.6 ,7 Given the rising complexity and bureaucracy (as well as costs) of conducting RCTs, the randomised registry trial offers a tantalising glimpse of a future of rapid, inexpensive large-scale comparative research studies.7
Contributors YKL contributed fully to the drafting of this manuscript.
Competing interests None.
Provenance and peer review Commissioned; externally peer reviewed.