Detection of Biomarkers in Humans using Classification Procedures

image of 4 peaks from the mass spectra of human sera One of the goals in cancer research is to find biomarkers that can be used for early detection of cancers, design individual therapies, and to identify underlying processes involved in the disease. Because so many myriad processes are involved in disease states, the goal is akin to that of trying to “find a needle in a haystack”. One method or sets of methods that can be used to distinguish healthy vs. disease states from each other is to analyze samples from patients, cluster the samples based on distinguishing features. One method(s) to find important features that could be relevant is to use classification models. In principal, the goal is to use a small subset of the large amount of data generated from these studies by proteomic, metabolomic, or metabonomic investigations in order to build a procedure that can be used to separate the Classes of the samples from each other. To this end, ClassCK has been developed and is freely available to assist in the construction of good, fuzzy classifiers. This program combines a feature selection method with several distance metrics and classification methods with up to two different scoring functions.

Six different classification methods are currently available within ClassCK; a Distance-Dependent K-Nearest Neighbors method, a method based on K-Means Clustering, and four methods based on Agglomerative Hierarchical Clustering.

  1. Distance-Dependent K-Nearest Neighbors (DD-KNN)
  2. K-Means Clustering
  3. Single Linkage Clustering
  4. Average Linkage Clustering
  5. Complete Linkage Clustering
  6. Distance-Dependent Jarvis-Patrick Clustering

Recent studies have used the results of to build classification models. Some illustrative examples are shown below:

Disease Classification

The Classifier Construction Kit (ClassCK), developed at the Advanced Biomedical Computing Center (ABCC), has been successfully used with mass spectra of human serum samples to identify subjects with colon cancer. A set of 48 classifiers was constructed using a set of training data and then used for a bind set of testing data. Approximately 88% of the testing samples received the same classification from all 48 classifiers, and of them, 97% were correctly classified. Work is continuing to understand the source of ambiguous and incorrect classifications.

This program has also been used with NMR spectra from urine to identify subjects with interstitial and bacterial cystitis, and with microarray results to uniquely separate one type of cancer from samples spanning 13 cancer types.

Predicting Drug Response

During the clinical trial of celecoxib for patients with familial adenomatous polyposis (FAP) it was found that a small fraction of the subjects in the 400mg bid treatment arm did not respond to the drug in that their polyp burden did not decrease. Mass spectra of serum samples from all members of this treatment arm were used by ClassCK to identify a small number of possible biomarkers that accurately distinguish between responders and non-responders. One of the biomarkers was previously identified using a Random Forest classifier, though ClassCK was able to identify others. In addition, this biomarker was also identified by ClassCK from spectra taken at a later time (new SELDI chip surface) on a different mass spectrometer. The reproducibility of this result suggests that SELDI spectra of serum samples may have clinical diagnostic possibilities.

Disease Associations Studies

Genotypic and environmental parameters have been examined for disease associations in human studies using three programs developed at the ABCC. The Polymorphism Interaction Analysis (PIA) program scans datasets containing a large number of Single Nucleotide Polymorphism (SNP) sites to find small numbers of SNPs that show the largest separation between cases and controls. The Hypothesis Tester (HypTest) uses specific genotypic and environmental factors to determine the odds ratio from case-control studies, standardizing for confounding factors. The Haplotype Tester (HaploTest) uses sets of SNPs to construct haplotypes using an Expectation Maximization algorithm, and then examines the distribution of cases and controls among the predominant haplotypes.

These programs have been used with data from 1530 subjects in the PLCO (prostate, lung, colon, and ovary) screening trial to identify genetic variations in the inducible isoform of the prostaglandin endoperoxide G/H synthase gene (PTGS2, also known as Cox-2) that reduce or increase the likelihood of colon cancer. This analysis included smoking history and the use of non-steroidal anti-inflammatory drugs (aspirin and ibuprophen). Single SNP and haplotype patterns were examined and the odds ratio of disease as well as statistical parameters (χ² and 95% confidence intervals) were determined by standardizing the results relative to age and, if necessary, smoking history.

In addition, 94 SNPs in 63 genes have been examined from a set of 571 males; 216 colon cancer cases and 255 controls. PIA examined all combinations of 2, 3, and 4 SNPs to find specific combinations that maximized the separation of cases and controls in the possible haplotypes. The top combinations were examined to determine if specific SNPs or metabolic pathways were important. This project is continuing with the examination of subjects with lung cancer.

invisible spacer