- Benchmark Eukaryotic Dataset: Details;
- Benchmark Human Dataset: Details;
- Benchmark Plant Dataset: Details;
- Benchmark Gram-positive Bacteria Dataset: Details;
- Benchmark Gram-negative Bacteria Dataset: Details;
- Benchmark Virus Dataset: Details;
These benchmark datasets were used for leave-one-out cross validation tests.
The detailed procedures of constructing these datasets are roughly summarized as follows: (here we use the human species as an example)
- Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/);
- Go to the 'Search' section and select 'Protein Knowledgebase (UniProtKB)' (default) in the 'Search in' option;
- In the 'Query' option, select or type 'reviewered: yes';
- Select 'AND' in the 'Advanced Search' option, and then select 'Taxonomy [OC]' and type in 'Human' (or 'Eukaryota' for eukaryotic proteins, 'Viridiplantae' for plant proteins, 'Firmicutes' and 'Actinobacteria' for Gram-positive bacteria proteins, 'Proteobacteria' for Gram-negative bacteria proteins and 'Virus' for virus proteins);
- Select 'AND' in the 'Advanced Search' option, and then select 'Fragment: no';
- Select 'AND' in the 'Advanced Search' option, and then select 'Sequence length' and type in '50 - ' (no less than 50);
- Select 'AND' in the 'Advanced Search' option, and then select 'Date entry integrated' and type in a time-period, such as '-20080429' for the time before Apr-29-2008 or '20110308-20120418' for the time between Mar-08-2011 and Apr-18-2012;
- Select 'AND' in the 'Advanced Search' option, and then select "Subcellular location: XXX Confidence: Experimental";(XXX means the specific subcellular locations. For example, for human proteins, it includes 14 different locations: centrosome; cytoplasm; cytoskeleton; endoplasmic reticulum; endosome; extracellular; Golgi apparatus; lysosome; microsome; mitochondrion; nucleus; peroxisome; plasma membrane; and synapse. For eukaryotic proteins, it includes 22 subcellular locations; and so on.)
- Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated).
Note: After selecting the proteins, Blastclust (or CD-HIT) was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%.
You can also download the procedures here.
Dept. of EIE, The Hong Kong Polytechnic University
Last update: 22 Oct. 2016