The World Health Organization's Global Burden of Disease statistics identified cancer as the second largest global cause of death, after cardiovascular disease [1]. Cancer is the fastest growing segment of the disease burden; global cancer deaths are projected to increase from 7.1 million in 2002 to 11.5 million in 2030 [2]. Advances in prevention, diagnostics and treatment of cancer have contributed to the improved prognosis for cancer patients: one third of cancers are preventable and another third are curable through early detection and effective therapy [3]. New cancer therapies are subject of vigorous research including the application of new high-throughput biomedical technologies that generate large amounts of data. The related information explosion mandates the use of biomedical bioinformatics.
Researchers and clinicians need rapid access to multiple types of information, including molecular, clinical, and literature databases, and clinical trials registries, as well as suitable data analysis tools. The National Center for Biotechnology Information (NCBI) hosts resources for retrieval and analysis of bio-molecular data. These are accessible through the NCBI's web site [4]. Clinical trials data are available through clinical trials registries. Since 1971, regulatory efforts on clinical trials registration have resulted in a significant level of compliance by both sponsors and conductors of clinical trials [5]. ClinicalTrials.gov [6] has emerged as the largest registry in the world [7]. Currently it contains information on 50,000 clinical trials including 16,000 cancer-related entries. The US National Cancer Institute (NCI) provides access to a clinical trials registry within the Physician Data Query (PDQ) database [8]. The PDQ contains (Dec 2007) 21,000 abstracts of cancer-related clinical trials and regularly exchanges data with the ClinicalTrials.gov registry. Data sharing and direct access to resources (researchers, computers, software, data, research participants, and other) are considered critical for the advancement of cancer research and the improvement of health care. The initiatives such as Cancer Research Network [9] and Cancer Biomedical Informatics Grid [10] provide the framework for integration of various data types and tools for cancer research. Standardized data formats (e.g. demographics, health plan eligibility, tumor registry, inpatient and ambulatory utilization, medication dispensing, laboratory tests, imaging procedures, others) [9] facilitate access to and sharing of data and automated analysis.
Knowledge discovery from databases, also known as data mining, is an emerging field that applies techniques from databases, statistics and artificial intelligence to extract high-level information (knowledge) from a large volume of low-level data. Examples of high-level information derived from low-level data include forms that are more compact (e.g., short reports), more abstract (e.g., descriptive models of the process that generated data), or more useful (e.g., predictive models for estimating values of the future cases) than existing low-level data [11]. Mining clinical trials data usually refers to using statistical and modeling tools for analysis and design of clinical trials. If appropriate clinical trials data (e.g. aims, goals, regimens and conditions, end points, sample sizes, and others) are stored in the registry, data mining can help design better, more efficient trials that require smaller patient cohorts. Standardized data formats, such as XML markup language [12], bring text files into machine-readable form, thus enhancing automated analysis. Both NCI PDQ and ClinicalTrials.gov provide clinical trial registry data in the XML format.
We have developed a data mining approach for rapid summarization and visualization of information from clinical trial registries. This method has been applied to the analysis of cancer vaccine trials and provides convenient means of extraction and presentation of key data about cancer vaccine trials. The significant progress in cancer biology and cancer immunology has not yet been fully translated into successful clinical vaccine applications [13]. Though advances in cancer vaccine development have been reported [e.g. [14, 15]], a wide variety of factors are involved in tumor immune escape, making design and production of effective therapeutic vaccines difficult. Increasing knowledge of possible limiting factors include, among others, dysfunction of the immune system, immunosuppressive effects of tumor microenvironment, production of suppressor T cells, defective antigen processing and presentation, and immunotherapy resistance of established tumors. To utilize the accumulated data and knowledge and translate these into improved clinical trials, integration of basic and clinical immunology and improved data processing capabilities is required. Our data mining approach provides better understanding of the cancer vaccine clinical trials landscape, and enables rapid analysis of the hotspots of cancer vaccine activity, as well as the identification of neglected cancers. This report describes the utility of basic data mining techniques of summarization, tabulation, and visualization applied to the clinical trials repository data.