Text/HTML

NRRD Components

COMPONENTS - A Development of database for rice genetic resources.

Development of database for rice genetic resources: The database comprises of passport information, characterization data for 30 descriptors and evaluation data on select accessions.
 
Scope of application indicating anticipated product and processes
For development of a bioinformatics database on genetic resources of rice that would represent the total spectrum of genetic diversity providing complete information on rice genome would need collation and critical analysis of information on collections conserved in the national system. NBPGR has been coordinating the national network on ex-situ conservation of rice genetic resources with the mandate for maintenance of base collection that has representation of total genetic diversity conserved and used in the Indian Agricultural Research System.
At present, the National Genebank at NBPGR has over 80,000 accessions of rice in its base collections and maintains a database on these collections with passport information that has several limitations for a detailed analysis. Additionally, the earlier collection missions have concentrated on capturing genetic diversity without application of indicators for distinctiveness. Further, many accessions were assembled in various collaborative missions, and due to the absence of a unique identification number, were deposited in the National Genebank with different donor identities. Consequently a large number of duplicate collections have been assembled in the National Genebank. There is, therefore, an imminent need to rationalise these collections with identification of the unique accessions, which represent the total available genomic variability without redundancy. To achieve this, it would be pertinent to establish a functional database, which can store in structured manner the relevant information and provide analysis to enable identification unique accessions based on total geographic and other indicators. On establishment of this database the information on unique accessions would be extended to information available on these accessions in relation to their morphological features (Characterisation data), evaluation data (value added traits) and finally to genomic constitution, including profiles of varieties and important genetic stocks to provide total information on rice genome, its evolution and distribution. The associated information related to characterisation and evaluation data is scattered and archived in many institutions and it is important that this information is retrieved; digitalised and integrated into information retrieval systems developed over passport information of unique accessions. In addition, to the analysis of the associated information, for many accessions it may be essential to undertake detailed agronomic and molecular characterisation before categorising them as duplicates.
 
Project Summary
At present, the National Genebank at NBPGR has over 80,000 accessions of rice in its base collections and maintains a database on these collections with passport information that has several limitations for a detailed analysis. Additionally, the earlier collection missions have concentrated on capturing genetic diversity without application of indicators for distinctiveness. Further, many accessions were assembled in various collaborative missions, and due to the absence of a unique identification number, were deposited in the National Genebank with different donor identities. Consequently a large number of duplicate collections have been assembled in the National Genebank. There is, therefore, an imminent need to rationalise these collections to with identification of the unique accessions, which represent the total available genomic variability without redundancy. To achieve this, it would be pertinent to establish a functional database, which can store in structured manner the relevant information and provide analysis to enable identification unique accessions based on total geographic and other indicators. On establishment of this database the information on unique accessions would be extended to information available on these accessions in relation to their morphological features (Characterisation data), evaluation data (value added traits) and finally to genomic constitution to provide total information on rice genome, its evolution and distribution.
The associated information related to characterisation and evaluation data is scattered and archived in many institutions and it is important that this information is retrieved; digitalised and an integrated information retrieval systems is developed.
 
COMPONENT - B Development of knowledge based database of rice genome.

Development of knowledge based rice genome database: The knowledge base comprises of data on rice whole genome sequence and a huge collection of experimental data generated over several decades.
 
Project Summary.
With the availability of the whole genome sequence of rice and a huge collection of experimental data, generated over several decades, time is appropriate to take stock of the knowledge and to move at higher level of understanding, where the organism is looked upon and studied as a system. It is extremely important that all the information is gathered on a common platform and instead of just a compilation, more logical, intricate and precise associations are formulated. Such compilation would give a very precise overview of the knowledge gathered on rice and would be extremely helpful in devising future line of action. Due to the complexity of the experimental data it is not possible to completely automate the process of this compilation. Thus, it is proposed to initiate a manual curation exercise, which would manually interpret the information in the published literature and associate it to the concerned genomic element/s. The database may be further extended to include non-protein coding elements. A preliminary search at the time of writing the proposal revealed about 10,000 research articles related to rice. Beside the daunting task of analyzing all the articles on rice, the development of logistics that can seamlessly accommodate the extremely heterogeneous information in the publications would be extremely essential and challenging. Thus, the project is conceived to have a defined development phase, which would be aimed at formulating and improving the numerous logical associations and simultaneously developing software tools to implement them. Another extremely important aspect would be the training of curators to perform the manual curation of the scientific literature. This is essential to maintain accuracy and uniformity of interpretations of the scientific literature. It is expected that initially it may not be possible to completely map all the ~10,000 articles to the known rice proteins, however the exercise would bring out several facts that would help to further develop and modify the database and curation procedure to accommodate the data. The outcome of this exercise would be compiled in a database. This information would be displayed in a graphical manner.
 
 
The database would act as a central hub of information on rice and greatly facilitate numerous knowledge-based discoveries. One of the immediate utilities of such database would be the availability of a platform for seamless integration of the advanced molecular data with the knowledge on Indian rice germplasm. To initiate such an interface, data would be generated on genome wide analysis of DNA polymorphism within indica rice cultivars.. Molecular mapping would be performed on Indian cultivated varieties of rice. Highly polymorphic markers would be selected from different chromosomes of rice. This marker set would be used to generate unique profile of about 500 important cultivated varieties of rice.
 
 
Origin of the proposal
Subsequent to generation of whole genome sequence for rice, precise annotation of the whole genome is a daunting task. Over the time several efforts at global level have resulted in generation of more than one version of rice whole genome annotations. All the annotation approaches so far have adopted primarily an automated pipeline for identification of various genomic elements. While the approach is rapid, there is a substantial compromise on the preciseness of information. One of the major aspects that needs to be addressed is the detailed association of the annotations with the published literature. Due to the complicated nature of information in a scientific publication it is not possible to automate this process and thus requires manual curation. In view of the presented facts it is proposed to undertake a study that would manually curate the scientific literature on rice and map it to the whole genome data. There are about 10,000 publications (at the time of writing of this proposal) available in the public databases that address various aspects of rice biology. These articles would be manually evaluated by the curators who will then organize the information in a specified format. Subsequently, the information would be associated with all the genomic elements. This information would be stored as a database which would have the ability to graphically display the integrated information.
 
Definition of the problem
Manual curation is considered the most precise approach for genome annotation. However, the process is extremely tedious and requires great deal of organization. The very strength of the process, i.e. manual analysis, could lead to accumulation of serious errors which may not be very apparent to the end-user. To overcome these bottlenecks following aspects need to be addressed:
Organization of data in literature: Careful investigation of any scientific publication would reveal that it contains very heterogeneous data. Some of it is supported by experimental data while most could be assumptions or theories proposed by the researches. It is important to properly extract this information and classify it appropriately. Thus an organization schema needs to be developed which would allow proper classification of the information. The format should be able to catch every aspect of the research study explained in the publication and should not oversimplify the issues. It should clearly state out points supported by experimental work and those which are logical implication or theories.


Development of annotation guidelines: Biological systems are extremely complicated and it takes years of experience to grasp the intricacies of the subject. It is practically not possible to organize the manual curation effort by involving experts in the various areas of rice research. Thus, it is extremely essential to formulate clear guidelines, which would help to extract, organize and map the information. Curators will have to be trained to work according to these guidelines.
Development of database and data entry portals: A well-defined database needs to be developed to store the curated information at various levels of analysis. The structuring of the database would evolve as the project proceeds. Similarly, interactive data entry portals need to be developed which would be used by the curators to enter the information gathered after mining the literature. Apart from being data entry points these portals would also have several logic verification intersections that would help the curators to ensure correct data entry.
Development of web enabled graphical interface: in order to publicly share the information a graphical interface would be developed which can display the annotated information in a user defined dynamic manner. Appropriate search engines would be implemented to perform exhaustive search on the annotations.