S01 - Session O3 - A strategy for integrating multi-source historical phenotypic and genotypic datasets containing homonyms for global genomic prediction

S01 - Session O3 - A strategy for integrating multi-source historical phenotypic and genotypic datasets containing homonyms for global genomic prediction

Tuesday, August 16, 2022 10:45 AM to 11:00 AM · 15 min. (Europe/Paris)
Angers Congress Centre
S01 Breeding and effective use of biotechnology and molecular tools in horticultural crops

Information

Authors: Daniel Edge-Garza *, Kate Evans, Elizabeth Ross, Dorrie Main, Craig Hardner

Genomic prediction may be used to combine historical phenotypic and genotypic data sets from multiple sources to examine the environmental stability of genetic performance. Implementation requires accurate matching of identities of genetic treatments (i.e. accessions) and SNP marker loci. However, collection methods and data formats may differ among data sources. Using thesauri to translate data identifiers from a specific source to those with a standard meaning across sources is vital for accurate predictions. For apple, we developed scripts to produce thesauri that standardize accession names and SNP locus identifiers across the RosBREED, FruitBreedomics and Australian Grove genomic datasets generated from three SNP genotyping platforms. However, one of the challenges of integration is the presence of errors in the data that lead to homonyms (non-uniqueness in a name used to refer to a specific accession or its clone). To correctly label the homonyms in the above datasets, the thesauri were primed with historical data ("training" datasets) from public databases and published tables labeled with M alus UNiQue identifiers (MUNQ IDs) and SNP marker information. However, the scripts revealed that these training datasets also contained errors resulting in additional homonyms. Further review showed that these homonyms were caused by: 1) mis-assigned MUNQ IDs to accession name; 2) mis-assigned accession identifier to MUNQ ID/accession name, 3) mis-assigned SNP identifiers to 48 markers, and 4) potentially incorrect records in an international database leading to ostensible inferences about the accession name. To resolve these homonyms, the scripts were extended to identify potential errors in published historical datasets, correct for resolvable data processing errors, and append accession ID or source to the accession name where the source error of the homonym could not be determined. Correcting these homonyms has facilitated the integration of approximately 2500 unique accessions across 289,317 SNP loci from the three genotyping platforms.

Type of sessions
Oral Presentations
Type of broadcast
In Replay (after IHC)In personIn remote
Keywords
apple SNP arraybigdatadata managementMalus domestica
Room
Amphitheatre Jardin - Screen 1

Oral session including this Oral presentation

S01 - Session O3 - Breeding methodology

Angers Congress Centre

Log in