Technology Background and Data

Technology

DNAFSMiner is built on statistical and data mining technologies. The methodology consists three steps: (1) generating candidate features from the sequences; (2) selecting relevant features from candidates, and (3) integrating the selected features with support vector machines (SVM) to build a classification and prediction system.

In the first step, candidate features are generated using k-gram nucleotide acid or amino acid patterns, which are simply patterns of k consecutive letters of nucleotide symbols or amino acid symbols. The occurrence of a pattern within certain bps upstream and downstream of a candidate functional site is used as the value of the feature. Then, in the framework of the new feature space, the original nucleotide sequences are transformed

In the second step, an entropy-based feature selection algorithm is applied to select important features associated with distinguishing true functional sites from false ones using training data.

In the third step, state-of-the-art classification algorithm support vector machines (SVM) is used to build classification and prediction model.An SVM selects a small number of critical boundary samples from each class of training data and builds a discriminant function that separates them as widely as possible. The decision function for a test sample T is constructed as:

where xi are the training data points, yi are the class labels (functional site is mapped to 1 while non-functional site is mapped to -1) of these data points, b and ai are parameters to be determined from training samples. K(.) is the kernel function which defines an inner product. The kernel function is constructed by SVM algorithm to map the training data into a higher dimensional space when the linear separation is impossible in the original one. Then, f(T)>0 if the sample T is more likely to be a functional site, and f(T)<0 if T is more likely to be a non-functional site. To normalize f(T), a transformation function s(T) is defined as:

Thus, f(T) is normalized by s(T) into the range (0,1). For each candidate of the functional site, score s(T) is used to give the prediction. Note that if f(T)>0 then s(T)>0.5, and if f(T)<0 then s(T)<0.5.

To measure the performance of a model, we adopt standard metrics defined as follows. Sensitivity measures the proportion of true functional sites that are correctly recognized as true functional sites. Specificity measures the proportion of false functional sites that are correctly recognized as false functional sites. Precision measures the proportion of the claimed true functional sites that are indeed true functional sites. Accuracy measures the proportion of predictions, both for true functional sites and false functional sites, that are correct. Let TP be the true positives, TN the true negatives, FP the false positives, and FN the false negatives. Then the above measures are defined as:
sensitivity = TP/(TP + FN),
specificity = TN/(TN + FP),
precision = TP/(TP + FP), and
accuracy = (TP + TN)/(TP + FN + TN + FP).
Besides, ROC (Receiver Operating Characteristic) curve is also used to illustrate the tradeoff between sensitivity and specificity.


Data

The TIS Miner was trained on 3312 vertebrate mRNA sequences extracted from GenBank (release 95). The data was first analysed by Pedersen et al and it contains 3312 true TIS ATGs and 10063 non-TIS ATGs. The training accuracy of the classification model is 92.45% at 80.19% sensitivity and 96.48% specificity. The model built have been tested on two sets of data.

The first validation set consists of 480 human cDNA sequences that once was analysed by Hatzigeorgiou. This set of data was collected from the protein database Swissprot. All the human proteins whose N-terminal sites are sequenced at the amino acid level were selected and manually checked. Then the full-length mRNAs for these proteins, whose TIS had been indirectly experimentally verified, were retrived. The testing accuracy on this data (after being removed the similar sequences to the training set) is 89.42% at 96.28% sensitivity and 89.15% specificity.

The second validation set was constructed by ourselves by extracting a number of annotated human genes of Chromosome X and Chromosome 21 from Human Genome Build30. The testing result of our model on this data set is 71.01% sensitivity with 84.02% specificity. The Figure 1 below shows the ROC curve of our model on the prediction of TISs in these genomic sequences.



Figure 1: An ROC curve of TIS prediction on some genomic sequences given by the TIS Miner.


The Poly(A) Signal Miner was trained on 2327 terminal sequences including 1632 "unique" and 695 "strong" poly(A) sites. It was first collected and used to train system Erpin. Our training accuracy is 78.16% at 84.10% sensitivity and 71.54% specificity.

Then it was evaluated on a set of 982 positive sequences containing annotated poly(A) signals from EMBL and four sets of same sized negative sequences: 982 CDS sequences, 982 intronic sequences of the first intron, 982 randomized UTR sequences of same 1st order Markov model as human 3' UTRs, and 982 randomized UTR sequences of same mono nucleotide composition as human 3' UTRs. These data sets were first analysed by Gautheret et al in using Erpin. Figure 2 below shows ROC curves of the Poly(A) Signal Miner on the validation sets.



Figure 2: ROC curves of poly(A) signal prediction on some validation sequences given by the Poly(A) Signal Miner.


Input

"TIS Miner" and "Poly(A) Signal Miner" are invoked from the left pane of the main page. For prediction, a nucleic acid sequence is required which can be submitted either in raw or in FASTA format. A limit of maximum 50,000 bps per sequence per submission is set to avoid a long waiting time for users. The "Number of predictions" is the number of top scored candidates of the predicted functional site that is displayed in the result page (default setting is 5). When predicting poly(A) signals, users can also select the hexamer poly(A) signal consensus other than the default "AATAAA". The options include "ATTAAA" or any variant of "NNTANA"-type.


Output

The output page of the TIS miner is a table with 6 columns.
(1) No. of ATG(s) from the 5' end. The number i in this column of the table indicates that the corresponding candidate is the ith candidate ATG from the 5' end. Generally, a sequence may contain multiple candidates of the functional site.
(2) Score. This column shows the score (ranging in (0,1)) of the prediction that "the corresponding candidate is a true TIS". It is given by the prediction model built by SVM (support vector machines) on the training sequences. The higher the score is, the more likely the corresponding candidate is a true TIS. Table 1 gives the overall accuracy, sensitivity, specificity and precision under different thresholds of the score based on the validation results on Human Chromosome data. For example, if the threshold is set as 0.6 (i.e. if the prediction score of a candidate is greater than 0.6, then it will be predicted as a true TIS; otherwise, it will be predicted as a non-TIS), the accuracy, sensitivity, specificity and precision are 72.2%, 54.6%, 89.7% and 84.1%, respectively.



Table 1: TIS Miner --- overall accuracy, sensitivity, specificity and precision under different thresholds of the score based on the validation results on Human Chromosome data.

(3) Position(bp). This column is the position of the corresponding candidate in the submitted nucleic acid sequence.
(4) Identity to Kozak consensus [AG]XXATGC. According to Kozak's weight matrix developed for TIS prediction, a G residue tends to follow a true TIS while an A or G residue tends to be found 3 nucleotides upstream of a true TIS. This column shows how the candidate ATG fits this consensus.
(5) Is any ATG in 100bp upstream? This column indicates that whether an ATG exists within the 100 bps of the upstream of the candidate.
(6) Is any in-frame stop codon in 100bp downstream? This column answers that whether an in-frame stop codon is found within the 100 bps of the downstream of the candidate.

The output page of the poly(A) signal miner is a table with 3 columns.
(1) No. of AATAAA(s) from the 5' end. The number i in this column of the table indicates that the corresponding candidate is the ith candidate poly(A) signal from the 5' end. Generally, a sequence may contain multiple candidates of the poly(A) signal (e.g. AATAAA).
(2) Score. This column shows the score (ranging in [0,1]) of the prediction that "the corresponding candidate is a true poly(A) signal". It is given by the prediction model built by SVM (support vector machines) on the training sequences. The higher the score is, the more likely the corresponding candidate is a true poly(A) signal. Table 2 gives the sensitivity on the 982 annotated true poly(A) signals, and the respective specificity (SP) and precision (PR) of the same sized false poly(A) signals from CDS sequences (CDS), intronic sequences of the first intron (Intron), randomized UTR sequences of same 1st order Markov model as human 3' UTRs (1st Markov) and randomized UTR sequences of same mono nucleotide composition as human 3' UTRs (Simple), under the different thresholds of the score. For example, if the threshold is set as 0.6 (i.e. if the prediction score of a candidate is greater than 0.6, then it will be predicted as a true poly(A) signal; otherwise, it will be predicted as a false one), the sensitivity is 59.8%, the specificity and precision in terms of the false poly(A) signals in CDS sequences are 88.6% and 83.9%, respectively.



Table 2: Poly(A) Signal Miner --- sensitivity, specificity and precision under different thresholds of the score based on our validation data.

(3) Position(bp). This column is the position of the corresponding candidate in the submitted nucleic acid sequence.


Range of computations

Currently, DNAFSMiner can be used for predicting (1)translation initiation sites, which is in the form of ATG in most cases, in vertebrate DNA/mRNA/cDNA sequences; and (2) poly(A) signals in human DNA sequences.