Data mining of cancer vaccine trials: a bird's-eye view
- Research
- Open Access
Data mining of cancer vaccine trials: a bird's-eye view
- Xiaohong Cao1_27Email author,
- Karen B Maloney2_27 and
- Vladimir Brusic2_27
- Received: 19 April 2008
- Accepted: 12 December 2008
- Published: 12 December 2008
Abstract
Background
A wealth of information on clinical trials has been provided by publicly accessible online registries. Information technology and data exchange standards enable rapid extraction, summarization, and visualization of information and derived knowledge from these data sets. Clinical trials data was extracted in the XML format from the National Library of Medicine ClinicalTrials.gov site. This data includes categories such as 'Summary of Purpose', 'Trial Sponsor', 'Phase of the Trial', 'Recruiting Status', and 'Location'. We focused on 645 clinical trials related to cancer vaccines. Additional facts on cancer types, including incidence and survival rates, were retrieved from the National Cancer Institute Surveillance data.
Results
This application enables rapid extraction of information about institutions, diseases, clinical approaches, clinical trials dates, predominant cancer types in the trials, clinical opportunities and pharmaceutical market coverage. Presentation of results is facilitated by visualization tools that summarize the landscape of ongoing and completed cancer vaccine trials. Our summaries show the number of clinical vaccine trials per cancer type, over time, by phase, by lead sponsors, as well as trial activity relative to cancer type and survival data. We also have identified cancers that are neglected in the cancer vaccine field: bladder, liver, pancreatic, stomach, esophageal, and all of the low-incidence cancers.
Conclusion
We have developed a data mining approach that enables rapid extraction of complex data from the major clinical trial repository. Summarization and visualization of these data represents a cost-effective means of making informed decisions about future cancer vaccine clinical trials.
Keywords
- Cancer Vaccine
- Clinical Trial Data
- Vaccine Trial
- Data Mining Approach
- Data Mining Application
Background
The World Health Organization's Global Burden of Disease statistics identified cancer as the second largest global cause of death, after cardiovascular disease [1]. Cancer is the fastest growing segment of the disease burden; global cancer deaths are projected to increase from 7.1 million in 2002 to 11.5 million in 2030 [2]. Advances in prevention, diagnostics and treatment of cancer have contributed to the improved prognosis for cancer patients: one third of cancers are preventable and another third are curable through early detection and effective therapy [3]. New cancer therapies are subject of vigorous research including the application of new high-throughput biomedical technologies that generate large amounts of data. The related information explosion mandates the use of biomedical bioinformatics.
Researchers and clinicians need rapid access to multiple types of information, including molecular, clinical, and literature databases, and clinical trials registries, as well as suitable data analysis tools. The National Center for Biotechnology Information (NCBI) hosts resources for retrieval and analysis of bio-molecular data. These are accessible through the NCBI's web site [4]. Clinical trials data are available through clinical trials registries. Since 1971, regulatory efforts on clinical trials registration have resulted in a significant level of compliance by both sponsors and conductors of clinical trials [5]. ClinicalTrials.gov [6] has emerged as the largest registry in the world [7]. Currently it contains information on 50,000 clinical trials including 16,000 cancer-related entries. The US National Cancer Institute (NCI) provides access to a clinical trials registry within the Physician Data Query (PDQ) database [8]. The PDQ contains (Dec 2007) 21,000 abstracts of cancer-related clinical trials and regularly exchanges data with the ClinicalTrials.gov registry. Data sharing and direct access to resources (researchers, computers, software, data, research participants, and other) are considered critical for the advancement of cancer research and the improvement of health care. The initiatives such as Cancer Research Network [9] and Cancer Biomedical Informatics Grid [10] provide the framework for integration of various data types and tools for cancer research. Standardized data formats (e.g. demographics, health plan eligibility, tumor registry, inpatient and ambulatory utilization, medication dispensing, laboratory tests, imaging procedures, others) [9] facilitate access to and sharing of data and automated analysis.
Knowledge discovery from databases, also known as data mining, is an emerging field that applies techniques from databases, statistics and artificial intelligence to extract high-level information (knowledge) from a large volume of low-level data. Examples of high-level information derived from low-level data include forms that are more compact (e.g., short reports), more abstract (e.g., descriptive models of the process that generated data), or more useful (e.g., predictive models for estimating values of the future cases) than existing low-level data [11]. Mining clinical trials data usually refers to using statistical and modeling tools for analysis and design of clinical trials. If appropriate clinical trials data (e.g. aims, goals, regimens and conditions, end points, sample sizes, and others) are stored in the registry, data mining can help design better, more efficient trials that require smaller patient cohorts. Standardized data formats, such as XML markup language [12], bring text files into machine-readable form, thus enhancing automated analysis. Both NCI PDQ and ClinicalTrials.gov provide clinical trial registry data in the XML format.
We have developed a data mining approach for rapid summarization and visualization of information from clinical trial registries. This method has been applied to the analysis of cancer vaccine trials and provides convenient means of extraction and presentation of key data about cancer vaccine trials. The significant progress in cancer biology and cancer immunology has not yet been fully translated into successful clinical vaccine applications [13]. Though advances in cancer vaccine development have been reported [e.g. [14, 15]], a wide variety of factors are involved in tumor immune escape, making design and production of effective therapeutic vaccines difficult. Increasing knowledge of possible limiting factors include, among others, dysfunction of the immune system, immunosuppressive effects of tumor microenvironment, production of suppressor T cells, defective antigen processing and presentation, and immunotherapy resistance of established tumors. To utilize the accumulated data and knowledge and translate these into improved clinical trials, integration of basic and clinical immunology and improved data processing capabilities is required. Our data mining approach provides better understanding of the cancer vaccine clinical trials landscape, and enables rapid analysis of the hotspots of cancer vaccine activity, as well as the identification of neglected cancers. This report describes the utility of basic data mining techniques of summarization, tabulation, and visualization applied to the clinical trials repository data.
Results
Cancer vaccine trials data mining questions
A key characteristic of our application is its versatility; questions can be formulated and re-formulated rapidly. A complex, composite question also can be defined directly from the interface using several mouse clicks. This application has been designed to dynamically generate graphs, summarizing data across relevant clinical trials.
Sample questions that can be answered by this system.
Examples of cancer vaccine trials landscape questions: |
---|
1. How many vaccine trials have started each year since 1995? |
2. How many of these trials are currently open? |
3. Who are the lead sponsors for these trials? |
4. What are the main vaccine platforms used by these sponsors in the trials? |
5. Do they focus on certain cancer types or on specific vaccine platform? |
6. What are the cancer types that have been interrogated using vaccine strategy? |
7. What are the cancer types that are currently interrogated? |
8. What phase are these trials? |
9. Is there any linkage between cancer prevalence and the numbers of vaccine trials for that particular cancer? |
10. Are there any cancers neglected within the vaccine field? |
The versatility of this system enables the analysis of various dimensions of the clinical trials landscape, including clinical trials by timeline, type of cancer, lead institution, patient population, and/or specific vaccine technology.
Trials over time
Types of cancers most often targeted by vaccine strategy
Vaccine trial counts by cancer types and trial phase.
Cancer Type |
Phase 1 |
Phase 2 |
Phase 3 |
Phase 4 |
Unspecified |
Total |
---|---|---|---|---|---|---|
Melanoma |
43 |
107 |
9 |
7 |
166 |
|
Cervix |
6 |
23 |
35 |
1 |
5 |
70 |
Breast |
37 |
27 |
1 |
4 |
69 |
|
Prostate |
20 |
37 |
4 |
3 |
64 |
|
Lung |
25 |
28 |
5 |
2 |
60 |
|
Ovary |
24 |
11 |
1 |
3 |
39 |
|
Non-Hodgkin lymphoma |
7 |
20 |
7 |
1 |
2 |
37 |
Colon |
11 |
22 |
3 |
36 |
||
Kidney |
7 |
23 |
2 |
32 |
||
Rectum |
9 |
18 |
3 |
30 |
||
Pancreas |
8 |
17 |
3 |
1 |
29 |
|
CML |
9 |
14 |
1 |
24 |
||
AML |
9 |
10 |
1 |
3 |
23 |
|
Myeloma |
5 |
12 |
2 |
2 |
21 |
|
Brain |
11 |
8 |
1 |
20 |
||
Eye |
4 |
14 |
2 |
20 |
||
Other oral cavity |
10 |
5 |
15 |
|||
Pharynx |
10 |
4 |
1 |
15 |
||
Mouth |
10 |
4 |
14 |
|||
All sites |
10 |
3 |
1 |
14 |
||
Liver |
5 |
7 |
2 |
14 |
||
CLL |
8 |
5 |
13 |
|||
Soft tissue |
7 |
5 |
1 |
13 |
||
Stomach |
8 |
3 |
11 |
|||
Esophagus |
6 |
2 |
8 |
|||
Hodgkin lymphoma |
2 |
5 |
1 |
8 |
||
ALL |
3 |
2 |
2 |
7 |
||
Bladder |
5 |
1 |
1 |
7 |
||
Bone |
3 |
2 |
1 |
6 |
||
Vulva |
1 |
4 |
5 |
|||
Testis |
3 |
1 |
4 |
|||
Vagina |
4 |
4 |
||||
Cranial nerves & other nervous system |
1 |
1 |
1 |
3 |
||
Gallbladder & other billiary |
2 |
1 |
3 |
|||
Small intestine |
1 |
1 |
1 |
3 |
||
Anus |
1 |
1 |
||||
Corpus |
1 |
1 |
||||
Other non-epithelial skin |
1 |
0 |
1 |
|||
Penis |
1 |
1 |
||||
Tongue |
1 |
1 |
||||
Total |
189 |
348 |
72 |
2 |
34 |
645 |
Institutions and corporations most active in developing cancer vaccines
Trials by disease prevalence
Technology platforms used by vaccine strategies
Conclusion
A large number of records are available in clinical trial repositories but the analysis is limited to simple queries. This data mining application for summarization and visualization of clinical trials data was developed and applied to the analysis of cancer vaccine clinical trials landscape. It enables complex queries of cancer vaccine trials data through data mining. Data mining revealed hidden patterns, trends and biases in the data. The application enables formulating queries and asking specific questions. It revealed the cancer vaccine trials development over time. The most common cancers targeted by vaccine therapy in clinical trials are: melanoma, cervical, prostate, breast, and leukemia. Neglected cancers include bladder, liver, pancreatic, stomach, esophageal, and all of the low-incidence cancers. Our approach provides a bird's-eye view of the cancer vaccine landscape, and enables rapid analysis of the hotspots of cancer vaccine activity, as well as the identification of neglected cancers.
Discussion
The landscape of clinical vaccine trials has multiple dimensions. These include an historical timeline of cancer vaccine trials, organizations sponsoring vaccine trials, principal technologies, and types of cancers, all of which we analyzed in this study. Many of the fields in the XML files were not included in this study, for example, "minimum age", "maximum age", "inclusion criteria", "exclusion criteria", and others. The selection of fields was defined by the scope of this study: rapid extraction of clinical trials statistics and identification of major activities in the field.
For studies that focus on other questions, such as patient groups, different sets of questions would be asked and different dimensions would be included for selections of relevant fields from the XML files.
The decision by the US Food and Drug Administration (FDA) to publicize information on clinical trials was made to enable public access to data that address questions about the safety of certain drugs [5]. In addition, public access to this data helps counter publication bias, in which positive results are published more easily and faster than negative results [5]. Though a significant quantity of data and information that describe cancer vaccine clinical trials is available, the existing means of converting these data into knowledge are less advanced.
Our data mining process consisted of data preparation and the application of summarization [see [11]] and visualization [see [16]] tools upon extracted specific knowledge from cancer vaccine clinical trials data. By accessing comprehensive clinical trials information using appropriate software tools, several mouse clicks provide access to knowledge that would otherwise require hiring of specialists or consultants. This knowledge is useful for various professionals, including research directors and planners in companies, hospital administrators, researchers and scientists, physicians conducting clinical trials, policy makers, grant officers at funding bodies, and government regulators. By combining public databases of clinical trials, data formatting by XML, and computational analysis and visualization, we have shown that specific knowledge can be extracted, summarized, and presented to the user.
We have implemented a highly flexible system to visualize and mine clinical trial information in the cancer vaccine field. We obtained all data about cancer vaccine trials from ClinicalTrials.gov database. In accordance with the 1997 FDA Modernization Act [17], information from clinical trials on serious or life-threatening diseases must be registered at ClinicalTrials.gov within 21 days of the trial starting date. The ClinicalTrials.gov has become the world's largest clinical trial database and it contains information from clinical trials performed world-wide. Although some cancer vaccine trials may not necessarily be included, ClinicalTrials.gov repository is comprehensive and representative of the world-wide landscape of clinical trials in the cancer vaccine field. Some fields of interest in this study were not included in the original XML files; for example, fields related to specific vaccine technology. We inspected the records, identified vaccine technology, and manually annotated this field using external data sources. In addition, clinical trials records in the back-end database were enriched with fields containing cancer statistics data. Manual annotation of these data has limitations, principally because subsequent studies also need manual annotation. This shortcoming can be addressed either by automation of annotation, or by inclusion of these fields in the clinical trials records. We envision that the power of this system can only increase with the improvement of the content of public databases.
Methods
Data
Of 16,000 cancer trials in ClinicalTrials.gov, some 900 are cancer vaccine or cancer immunotherapy trials. The information provided in the trial descriptions includes "summary of purpose", "trial sponsor", "phase of the trial", "recruiting status", and "location", and other, more specific information. Clinical trials related to cancer vaccines are defined by keywords "cancer" and "vaccine" present in the title or abstract of the trial record. Prior to January 2008, some 642 trials fit this definition. Input primary data consisted of downloaded XML files containing information on cancer vaccine trials.
Additional information on cancer types was retrieved from the Surveillance Epidemiology and End Results (SEER) Cancer Statistics Review (CSR) provided by the National Cancer Institute (NCI). Incidence and survival rates associated with various cancer types were obtained from the SEER/CSR report 2003 [18].
System description
Major cancer vaccine technologies.
Vaccine antigen status |
Vaccine technology |
---|---|
Defined antigen |
Dendritic cell (peptide) |
Dendritic cell (protein) |
|
Dendritic cell (specific RNA) |
|
Dendritic cell (virus) |
|
Viral vector (antigen) |
|
Viral vector (other) |
|
Naked DNA |
|
Protein |
|
Peptides |
|
Anti-idiotype |
|
Antigen-specific adjuvant |
|
Other |
|
Undefined antigen |
Dendritic cells (tumor cell pulsed) |
Dendritic cells (lysate pulsed) |
|
Dendritic cells (allo-lysate pulsed) |
|
Allogeneic Dendritic cells (lysate pulsed) |
|
Dendritic cells (antigen) |
|
Dendritic cells (non-specific DNA) |
|
Dendritic cells (non-specific RNA) |
|
Dendritic cells (idiotype) |
|
Oncolytic virus |
|
Autologous (tumor lysate) |
|
Allogeneic (tumor lysate) |
|
Autologous (tumor cell) |
|
Allogeneic (tumor cell) |
|
Autologous (T cell) |
|
Autologous (lymphocytes) |
|
Autologous (PBMC) |
|
Immunomodulators |
|
Immune activators |
|
Other non-cancer vaccines |
|
Other |
Spotfire DXP Platform Description
We have chosen Spotfire-DXP (version 2.0, spotfire.tibco.com) software to construct the environment for our data mining application. Spotfire-DXP is proprietary software which can be licensed through TIBCO Software Inc. http://www.tibco.com. The Spotfire-DXP is a dynamic tool suitable for summarization and visualization of tabular data. Although other similar tools, such as SAS, Insightful, and others offer more sophisticated statistical power, they are less suitable for summarization, visualization, and tabulation tasks. Spotfire is easy to learn and use because of graphical user interface that allows graphing and tabulation by using drag-and-drop actions.
The software environment consists of four main areas, 'Menu Bar', 'Main Graphing Area', 'Query Filter Panel', and 'Details on Demand'. The top 'Menu Bar' consists of file and data manipulation tools, in addition to icons for generating new graphs. The 'Main Graphing Area' is screen space where user displays results in either graphs or tables. The right 'Query Filter Panel' is automatically populated with column heads when the data table is brought in. The 'Details on Demand' window will only show up when a user wants to investigate a subset of records in more details.
Data mining
Tabular data are constructed from the backend database. Our application uses standard Spotfire-DXP toolbox for extraction and manipulation of data. Fields that we used in this study included 'Cancer Type', 'Incidence Rate', '5-Year Survival (%)', 'Trial ID', 'Brief Summary', 'Lead Sponsor', 'Collaborator', 'Phase', 'Trial Status', 'Vaccine Platform', and 'Trial Start' data.
To illustrate the data mining process, we have used an example in which we answered Question 1 (Table 1) by generating a bar graph shown in Figure 1. The process started by selection of the 'Bar Graph' on the 'Menu Bar'. This generated the default bar graph in the 'Main Graphing Area'. Then, we selected the 'Start Date' field from the 'Query Filter Panel' and dragged it to the x-axis. Similarly, we dragged the 'TrialID' from the 'Query Filter Panel' to the y-axis. We clicked on the y-axis to apply the 'UniqueCount' function to the y-axis. Once this was done, we changed the coloring scheme by clicking the 'Property' icon in the 'Menu Bar' and specified the colors for different phases (green for phase 1, yellow for phase 2 and red for phase 3.). These actions produced the graph on the 'Trials Started per Year' page. Similarly, we constructed other pages that query specific information about lead sponsors, diseases, vaccine technology platform, and geographic locations, as shown in Figure 6. The data sets and application programs are available upon request from the authors.
Declarations
Acknowledgements
We thank Dr. Richard Fitzpatrick for insightful discussion. We are grateful to Genzyme Corporation for the use of programming resources.
Authors’ Affiliations
References
- Mathers CD, Loncar D: Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med 2006,3(11):e442.View ArticlePubMedGoogle Scholar
- Department of Measurement and Health Information Systems: World Health Statistics 2007 World Health Organization, Geneva, Switzerland 2007.Google Scholar
- Department of Management of Noncommunicable Diseases: National Cancer Control Programmes World Health Organization, Geneva, Switzerland 2002.Google Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, 35:D5–12.View ArticlePubMedGoogle Scholar
- Zielinski SL: Clinical trials registration efforts gain some ground. J Natl Cancer Inst 2005,97(6):410–411.View ArticlePubMedGoogle Scholar
- ClinicalTrials.gov. National Institutes of Health[http://clinicaltrials.gov/]
- Zarin DA, Ide NC, Tse T, Harlan WR, West JC, Lindberg DA: Issues in the registration of clinical trials. JAMA 2007,297(19):2112–2120.View ArticlePubMedGoogle Scholar
- Thiemann KM, Frost MH, Thompson RA: A multifaceted educational approach to increasing awareness and use of physician data query (PDQ). J Cancer Educ 1999,14(2):78–82.PubMedGoogle Scholar
- Hornbrook MC, Hart G, Ellis JL, Bachman DJ, Ansell G, Greene SM, Wagner EH, Pardee R, Schmidt MM, Geiger A, Butani AL, Field T, Fouayzi H, Miroshnik I, Liu L, Diseker R, Wells K, Krajenta R, Lamerato L, Neslund Dudas C: Building a virtual cancer research organization. J Natl Cancer Inst Monogr 2005, (35):12–25.Google Scholar
- The Cabig Strategic Planning Workspace: The Cancer Biomedical Informatics Grid (caBIG™): Infrastructure and Applications for a Worldwide Research Community. Stud Health Technol Inform 2007, 129:330–334.Google Scholar
- Brusic V, Zeleznikow J: Knowledge discovery and data mining in biological databases. Knowledge Eng Rev 1999, 14:257–277.View ArticleGoogle Scholar
- Saadawi GM, Harrison JH Jr: Definition of an XML markup language for clinical laboratory procedures and comparison with generic XML markup. Clin Chem 2006,52(10):1943–1951.View ArticlePubMedGoogle Scholar
- Aptsiauri N, Cabrera T, Pawelec G, Gouttefangeas C, Derhovanessian E, Garrido F, Garcia-Lora A: International conference: progress in vaccination against cancer-2006 (PIVAC 6), Granada, Spain. Cancer Immunol Immunother 2007,56(8):1311–1322.View ArticlePubMedGoogle Scholar
- O'Rourke MG, Johnson MK, Lanagan CM, See JL, O'Connor LE, Slater GJ, Thomas D, Lopez JA, Martinez NR, Ellem KA, Schmidt CW: Dendritic cell immunotherapy for stage IV melanoma. Melanoma Res 2007,17(5):316–322.View ArticlePubMedGoogle Scholar
- Inogès S, Rodrìguez-Calvillo M, Zabalegui N, Lòpez-Dìaz de Cerio A, Villanueva H, Soria E, Suárez L, Rodríguez-Caballero A, Pastor F, García-Muñóz R, Panizo C, Pèrez-Calvo J, Melero I, Rocha E, Orfao A, Bendandi M, Grupo Español de Linfomas/Trasplante Autologo de Medula Oseo study group; Programa para el Estudio y Tratamiento de Hemopatias Malignas study group: Clinical benefit associated with idiotypic vaccination in patients with follicular lymphoma. J Natl Cancer Inst 2006,98(18):1292–1301.View ArticlePubMedGoogle Scholar
- Thearling K, Becker B, Dennis DeCoste D, Bill Mawby D, Pilote M, Sommerfield D: Visualizing Data Mining Models. Information Visualization in Data Mining and Knowledge Discovery (Edited by: Fayyad U, Grinstein G, Andreas Wierse A). Morgan Kaufman 2001, 205–222.Google Scholar
- FDA Modernization Act[http://hippo.findlaw.com/FDA.html]
- National Cancer Institute: [http://seer.cancer.gov/csr/1975_2004/] Surveillance Epidemiology and End Results 2004.Google Scholar
- NCI Drug Dictionary. National Cancer Institute[http://www.cancer.gov/Templates/drugdictionary.aspx]
- Ward S, Copier J, Dalgleish A: Technical challenges facing therapeutic cancer vaccines. Curr Opin Drug Discov Devel 2008,11(2):168–177.PubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.