Courtesy of the Sork Lab in Ecology & Evolutionary Biology at UCLA. Written by Doug Scofield.
This web page can help you join single-column genotypes in to a more compact multi-column format with one genotype per sample. The hard work is accomplished via a Perl script, downloadable directly from here. Run the Perl script with the --help option to see command-line options.
Specifically, the reformatting process starts with lines having the following format (the header line may be missing on user option):
ID Locus Allele1 Allele2 0022S1 1C11 138 140 0022S1 0M07 213 213 0022S2 1C11 140 140 0022S2 0M07 213 217 ...and creates an output table having the following header and data format:
ID 1C11.1 1C11.2 0M07.1 0M07.2 0022S1 138 140 213 213 0022S2 140 140 213 217 ...Output lines are assembled according to the values for the first two columns of the input file, which are assumed to represent a sample identifier and locus identifier, respectively. The specific text format of these columns, as well as the specific format of the third and fourth columns, which represent the diploid genotype of that sample at that locus, are not important; they may be letters, numbers, symbols, etc., so long as they do not contain the character(s) chosen as the column separator.
A number of options can be used to specify the input and output formats.
As noted above, the specific Gonda format of the input will be auto-detected by interpreting each header line encountered. If a header line indicates a different Gonda format than the one currently in use, interpretation will automatically switch to the new format as indicated by the new header line. Input files may switch back and forth between Gonda formats, but each section of unique format must be delimited by an appropriate header line. If you concatenate Gonda results together and make no further modifications to the input file, this should all be handled correctly. Header formats currently recognized include:
Gonda format | header 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... |
---|---|---|---|---|---|---|---|---|---|
1 | Sample File | Sample Name | Panel | Marker | Dye | Allele 1 | Allele 2 | Allele 3 | ... |
2 | Sample File | Sample Name | Panel | Marker | Dye | Allele 1 | Allele 2 | Size 1 | ... |
3 | Sample File | Sample Name | Run Name | Panel | Marker | Dye | SNP | Allele 1 | ... |
4 | external_dna_code | Plate | gel_date | locus | allele_a_bin | allele_b_bin | ... | ... | ... |
5 | File Name | Lane | Sample Info | Category | Peak 1 | Peak 2 | Peak 1 | Peak 2 | ... |
6 | File Name | External_ID | Lane | Sample Info | Category | Peak 1 | Peak 2 | Peak 1 | ... |
7 | SampleName | Plate | Well | Project | Tree | Sample | ... | ... | ... |
Format 5 is an old Gonda format produced when they were using Genotyper software. It has three pairs of "Peak 1<tab>Peak 2" columns: the first are the called alleles, the second are the measured allele sizes, and the third are the peak heights.
Format 6 is a modified Format 5, created by hand using plate map files to insert a column 2 "External_ID" that is our lab-internal ID for the sample in the corresponding well.
Format 7 is not a format produced by Gonda, rather it is an internal format used by D. Scofield for existing Q. agrifolia genotype summary files; it was most convenient to handle it like a Gonda format.
NOTE: If the output file reports a Gonda format 0, then there is an error, probably in the joinloci script. Please tell D. Scofield.