1. How is Eumicrobedb different from VMD database?
The current version of Eumicrobedb is based on a completely different database and analytical engine called as Genome Annotator Lite (GAL); developed at the Indian Institute of Chemical Biology (IICB). GAL inherits basic traits from the Genome Unified Schema (GUS: www.gusdb.org ), while a lot of additional modules have been built around it. GUS schema has been trimmed substantially to retain modules essential for genomics programs and the obsolete ones have been discarded. One more important modification is the migration of oracle database into a MYSQL platform. All the perl object layers that served as database API with dependencies in Bioperl has been replaced with light weight stand alone parsers. The database engine and the front end is rebuilt to incorporate new additions such as:
2. When I click on the organism name on the main menu, where do I go?
In the main page [ http://www.eumicrobedb.org ] , when the user clicks on the organism name, it takes them to the genome browser page. By default the first scaffold gets loaded with its entire feature sets as tracks. The first green track is the non-coding track that on clicking fetches the sequence that lies between 2 coding regions. The blue track is the coding track which mouseovers on the empty text area on the top panel with details of the product information of that gene. The details of the browser tracks are described in the following sections.
3. Can I see my gff files on the browser?
Yes you can. This feature is available for version 10 onwards. You could upload your gff file (now in version 11 the input file size is raised from 2 MB to 10 MB) and see them on the browser as new tracks. For example, in your gff file, you may have features for many scaffolds, but you can only see the tracks of the scaffold that is already under display. For example, if the browser shows scaffold_1, then scaffold_1 of your gff file will be shown in that view.
4. What is new on the browser?
Several new tracks have been introduced on the browser. The most significant one is the comparative genomics tracks. We have performed an all vs all comparative genomics on the current 26 organism list using lastz. The output files generated in SAM format are displayed on the browser. On mouse over, each comparative genomics tracks shows the region of alignment.
5. What are the different color codes mean?
Currently, each of the tracks on the browser are color coded differently. The coding tracks are coded in blue and non-coding tracks are coded in green. The EST tracks have 4 distinct color schemes: red color indicates a good alignment; green color indicates alignment with a lesser quality where the percentage of alignment is less than 90%. The third type is color coded in blue and the worse alignment is coded in black where there are query gaps.
6. Why do I see genome fasta files have a scaffold name rather than the names given by the genome centers?
We have collected genome data from various genome centers that had different nomeclature. For uniformity, we renamed them into Scaffold_XX( where XX stands for numbers) format. For genome centers that had named genome fasta files differently than the scaffold format e.g; Albugo laibachii, Aphanomyces astaci, Aphanomyces invadans, Pythium ultimum ; we sorted the genome scaffolds in descending order of their length and renamed them beginning with scaffold_1 as the largest scaffold. The following link has the list of original and new names for each of the genome fasta files. http://www.eumicrobedb.org/ForEMBOSS/
7. Can I get non-coding sequences upstream and downstream of a coding sequence?
We have created clickable non-coding tracks for each of the genome fasta files on the browser page. One can click on them and get the DNA sequence present between two coding sequences.
8. How to interpret the comparative genomics modules on the browser?
The comparative genomics modules displays all vs all whole genome comparison in color coded tracks. For comparative genomics analysis, we have used Lastz, that is a successor of blastz [http://pipmaker.bx.psu.edu/dist/blastz.pdf]. Alignment results are stored in the database in SAM format. When user views a scaffold on browser, the lastz alignments of remaining organisms appear on the tracks below. The orientation and the region of the alignments appear on mouse over in the prescribed text box frame. Each of these tracks are clickable. On click, the synteny detail page shows the details about the syntenic region in organism 1 (The reference organism on which, the scaffolds load on the browser) and organism 2 with the existing features. The features could be protein coding genes, promoters, tRNAs etc.
9. What are the contents of the gene detail page?
Our current annotations comprises of Blastx , Interproscan , secretome , TMHMM, SignalP, Psort, Prop, Pathway, conserved proteome output. We plot the gene features followed by the the loglikehood values (Staden, 1982) and fickett values for each of the genes in the browser detail page. The 2 images following the gene feature image indicates the protein coding potential of the predicted gene. In loglikelihood plot, the frame that is plotted above the 0 mark is most likely the protein coding strand. It often co-relates with the introns-exon boundaries of the gene.
10. On gene click what information do I get?
The gene plot on click takes to another page where the sequence with exon-intron boundaries are plotted.
There are many different types of query options are now available, that enables the user to query the database in extracting most of the information. For instance, the user can get the synteny, secretome, cluster, pathway, primary and secondary annotation, Go term data queried.
11. What is new in query page?
There are several new query features added in this new version: synteny query, secretome query, pathway query, EST ID query, Go Term query, cluster ID and cluster description query etc.
12. How is the query function structured?
The query option has been classified according to the query type. For example, there are query options on text string search for the primary annotation, domain search etc. The users can now select multiple organisms to query against. Most of the query options and the outputs are self explanatory, bu in case there are problems, we will be happy to sort things out and answer the questions the users may have.
Analyzing Sequences with Toolkit and EMBOSS packages.
We have created special analysis packages e.g; toolkit and EMBOSS GUI for analyzing the sequence data. The toolkit package comprises of Blast, bl2seq (For pairwise sequence alignment), detecting coding potential of sequence, and building weight matices of protein multiple sequence alignment etc. There is a get subsequence module, where the user can choose the organism name and positions of the sequence one wants to retrieve to get the substring.
13. What kind of sequence analysis I can perform through EMBOSS?
EMBOSS is a open source software that has more than 150 sequence analysis programs. We have installed the binary packages as well as the web based GUI on our server for the ease of data analysis. The genome data is embedded into the GUI of the EMBOSS package. In other words the user does not have to copy paste their sequences or fragments of sequences. User can directly enter the name and the region of the sequence and get the analysis done (Details are explained in tutorial page).
Fig 5:The newly added query features are the genome synteny query; Precomputed syntenic information is already uploaded in the database. Users can select organisms from the pull down menu and choose to go for check syntenic information. Then a pair of syntenic data will be displayed on the page (Figure 5) CG query takes a genomic region of a particular organism and returns all the homologous regions in other organism and displays in a tabular fashion.
Fig 6: Query By KOG ID: Pathway annotation for the protein sequences are present in parsed forms. Against each KOG ID query, the entire enzyme list is displayed.
Fig 7: Query By Conserved Region:
A new conserved region query is introduced since version 10 onwards. Here the user can select an organism, scaffold location for conserved region query. A detailed list of available conserved regions will be displayed in the output page.
Fig 8: EMBOSS PACKAGE:
User has a number of sequence analysis program options available through EMBOSS. In EMBOSS page, there is a small text box available to enter the organism name and the fasta file name for directly entering the sequence into data analysis mode. The naming conventions for each of the organism fasta files are at http://www.eumicrobedb.org/genome/ and their scaffold names are in http://www.eumicrobedb.org/ForEMBOSS/