S01 - Session O3 - A strategy for integrating multi-source historical phenotypic and genotypic datasets containing homonyms for global genomic prediction
Information
Authors: Daniel Edge-Garza *, Kate Evans, Elizabeth Ross, Dorrie Main, Craig Hardner
Genomic prediction may be used to combine historical phenotypic and genotypic data sets from multiple sources to examine the environmental stability of genetic performance. Implementation requires accurate matching of identities of genetic treatments (i.e. accessions) and SNP marker loci. However, collection methods and data formats may differ among data sources. Using thesauri to translate data identifiers from a specific source to those with a standard meaning across sources is vital for accurate predictions. For apple, we developed scripts to produce thesauri that standardize accession names and SNP locus identifiers across the RosBREED, FruitBreedomics and Australian Grove genomic datasets generated from three SNP genotyping platforms. However, one of the challenges of integration is the presence of errors in the data that lead to homonyms (non-uniqueness in a name used to refer to a specific accession or its clone). To correctly label the homonyms in the above datasets, the thesauri were primed with historical data ("training" datasets) from public databases and published tables labeled with M alus UNiQue identifiers (MUNQ IDs) and SNP marker information. However, the scripts revealed that these training datasets also contained errors resulting in additional homonyms. Further review showed that these homonyms were caused by: 1) mis-assigned MUNQ IDs to accession name; 2) mis-assigned accession identifier to MUNQ ID/accession name, 3) mis-assigned SNP identifiers to 48 markers, and 4) potentially incorrect records in an international database leading to ostensible inferences about the accession name. To resolve these homonyms, the scripts were extended to identify potential errors in published historical datasets, correct for resolvable data processing errors, and append accession ID or source to the accession name where the source error of the homonym could not be determined. Correcting these homonyms has facilitated the integration of approximately 2500 unique accessions across 289,317 SNP loci from the three genotyping platforms.