Several stringent benchmark datasets were used to evaluate the performance of FUEL-mLoc for different species, which are listed as follows:
  • Benchmark Eukaryotic Dataset: Details;
  • Benchmark Human Dataset: Details;
  • Benchmark Plant Dataset: Details;
  • Benchmark Gram-positive Bacteria Dataset: Details;
  • Benchmark Gram-negative Bacteria Dataset: Details;
  • Benchmark Virus Dataset: Details;

These benchmark datasets were used for leave-one-out cross validation tests.

The detailed procedures of constructing these datasets are roughly summarized as follows: (here we use the human species as an example)

  1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/);
  2. Go to the 'Search' section and select 'Protein Knowledgebase (UniProtKB)' (default) in the 'Search in' option;
  3. In the 'Query' option, select or type 'reviewered: yes';
  4. Select 'AND' in the 'Advanced Search' option, and then select 'Taxonomy [OC]' and type in 'Human' (or 'Eukaryota' for eukaryotic proteins, 'Viridiplantae' for plant proteins, 'Firmicutes' and 'Actinobacteria' for Gram-positive bacteria proteins, 'Proteobacteria' for Gram-negative bacteria proteins and 'Virus' for virus proteins);
  5. Select 'AND' in the 'Advanced Search' option, and then select 'Fragment: no';
  6. Select 'AND' in the 'Advanced Search' option, and then select 'Sequence length' and type in '50 - ' (no less than 50);
  7. Select 'AND' in the 'Advanced Search' option, and then select 'Date entry integrated' and type in a time-period, such as '-20080429' for the time before Apr-29-2008 or '20110308-20120418' for the time between Mar-08-2011 and Apr-18-2012;
  8. Select 'AND' in the 'Advanced Search' option, and then select "Subcellular location: XXX Confidence: Experimental";(XXX means the specific subcellular locations. For example, for human proteins, it includes 14 different locations: centrosome; cytoplasm; cytoskeleton; endoplasmic reticulum; endosome; extracellular; Golgi apparatus; lysosome; microsome; mitochondrion; nucleus; peroxisome; plasma membrane; and synapse. For eukaryotic proteins, it includes 22 subcellular locations; and so on.)
  9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated).

Note: After selecting the proteins, Blastclust (or CD-HIT) was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%.

You can also download the procedures here.

 

Contact @ Shibiao Wan, Man-Wai Mak,

URL: http://www.eie.polyu.edu.hk/~mwmak/

Dept. of EIE, The Hong Kong Polytechnic University

Last update: 22 Oct. 2016