joinloci: join single-column genotypes into a multi-column table

Courtesy of the Sork Lab in Ecology & Evolutionary Biology at UCLA. Written by Doug Scofield.


Note: Total input file size must be less than 50 MB

Timestamped copies of input and output files can be archived on the server. The archive name is provided by you; it may be an existing archive name that you've provided before (say a project name) or a new archive name. If you take advantage of this, do not provide more than one input file with the same name at the same time.

Name of archive:
Skip the archive step


Input file:

Up to twenty total files may be uploaded. An empty filename ends the input list.

Input file 2:
Input file 3:
Input file 4:
Input file 5:
Input file 6:
Input file 7:
Input file 8:
Input file 9:
Input file 10:
Input file 11:
Input file 12:
Input file 13:
Input file 14:
Input file 15:
Input file 16:
Input file 17:
Input file 18:
Input file 19:
Input file 20:

First line of input file contains column headings data.
Ignore case in IDs, marker names, and allele values: Yes No.
Input file columns separated by TAB characters commas.
Note that blanks at beginning or end of line or on either side of the separator will be ignored.
Value used for missing data on input:

Gonda-related options:
Input file is Gonda format: Yes No.
Append actual allele size to allele values: Yes No.
Append peak height to allele values: Yes No.
Append plate of origin to allele values: Yes No.
Append plate of origin to null alleles too: Yes No.
Fix IDs to allow for mix of old and new sample naming schemes: Yes No.

The specific Gonda format of the input will be auto-detected by interpreting each header line encountered; multiple Gonda-format files of differing Gonda formats can be concatenated together for input so long as header lines are kept for each file for joinloci to interpret. Likewise, separate input files provided above may have different Gonda formats, and may themselves be composed of concatenated Gonda-format files. We have tried to make this as easy as possible for you, the user, so if you find any mistakes please let Doug Scofield know.

Output file name:
If the output filename is blank, it will be "joinloci_input-filename". If you provided more than one input file, a "+" and the number of additional files will be appended to the name of the first file, for example "joinloci_first-filename+2.txt".

Output file columns should be separated by TAB characters commas.
Value to use for null alleles:
Value to use for missing data:
Distinguish duplicate loci with unique column header: yes no.
Cut off duplicate reporting at duplicates (useful when joining many runs having an identical standard).


Output file will be returned directly to your browser. You might need to look for it in your "Downloaded Files" folder.


Description

This web page can help you join single-column genotypes in to a more compact multi-column format with one genotype per sample. The hard work is accomplished via a Perl script, downloadable directly from here. Run the Perl script with the --help option to see command-line options.

Specifically, the reformatting process starts with lines having the following format (the header line may be missing on user option):


    ID     Locus Allele1 Allele2

    0022S1 1C11  138     140

    0022S1 0M07  213     213

    0022S2 1C11  140     140

    0022S2 0M07  213     217

    ...

and creates an output table having the following header and data format:

    ID     1C11.1 1C11.2 0M07.1 0M07.2

    0022S1 138    140    213    213

    0022S2 140    140    213    217

    ...

Output lines are assembled according to the values for the first two columns of the input file, which are assumed to represent a sample identifier and locus identifier, respectively. The specific text format of these columns, as well as the specific format of the third and fourth columns, which represent the diploid genotype of that sample at that locus, are not important; they may be letters, numbers, symbols, etc., so long as they do not contain the character(s) chosen as the column separator.

A number of options can be used to specify the input and output formats.

Gonda-specific details

As noted above, the specific Gonda format of the input will be auto-detected by interpreting each header line encountered. If a header line indicates a different Gonda format than the one currently in use, interpretation will automatically switch to the new format as indicated by the new header line. Input files may switch back and forth between Gonda formats, but each section of unique format must be delimited by an appropriate header line. If you concatenate Gonda results together and make no further modifications to the input file, this should all be handled correctly. Header formats currently recognized include:

Gonda format header 1 2 3 4 5 6 7 8 ...
1 Sample File Sample Name Panel Marker Dye Allele 1 Allele 2 Allele 3 ...
2 Sample File Sample Name Panel Marker Dye Allele 1 Allele 2 Size 1 ...
3 Sample File Sample Name Run Name Panel Marker Dye SNP Allele 1 ...
4 external_dna_code Plate gel_date locus allele_a_bin allele_b_bin ... ... ...
5 File Name Lane Sample Info Category Peak 1 Peak 2 Peak 1 Peak 2 ...
6 File Name External_ID Lane Sample Info Category Peak 1 Peak 2 Peak 1 ...
7 SampleName Plate Well Project Tree Sample ... ... ...

Format 5 is an old Gonda format produced when they were using Genotyper software. It has three pairs of "Peak 1<tab>Peak 2" columns: the first are the called alleles, the second are the measured allele sizes, and the third are the peak heights.

Format 6 is a modified Format 5, created by hand using plate map files to insert a column 2 "External_ID" that is our lab-internal ID for the sample in the corresponding well.

Format 7 is not a format produced by Gonda, rather it is an internal format used by D. Scofield for existing Q. agrifolia genotype summary files; it was most convenient to handle it like a Gonda format.

NOTE: If the output file reports a Gonda format 0, then there is an error, probably in the joinloci script. Please tell D. Scofield.