ALFALFA

A Long Fragment Aligner/A Long Fragment Aligner

Introduction

The long read mapper ALFALFA achieves high performance in accurately mapping long (>500bp) single-end and paired-end reads to gigabase-scale reference genomes, while remaining competitive for mapping shorter (>100bp) reads. Its seed-and-extend workflow is underpinned by fast retrieval of super-maximal exact matches from an enhanced sparse suffix array, with flexible parameter tuning to balance performance, memory footprint and accuracy.

Downloads

Latest version (beta release) (zip)

Latest version (beta release) (tar)

ALFALFA (v0.8.1) (zip)

ALFALFA (v0.8.1) (tar)

ALFALFA (v0.8) (zip)

ALFALFA (v0.8) (tar)

Installation

The following commands can be used to install the software if git is installed.

$ git clone git://github.com/readmapping/alfalfa.git
$ cd alfalfa
$ make

The following commands can be used to unpack and install the software when downloading a zip file. Replace VERSION with the name of the downloaded file.

$ unzip VERSION.zip
$ cd VERSION
$ make

The following commands can be used to unpack and install the software when downloading a tarball. Replace VERSION with the name of the downloaded file.

$ tar xvzf VERSION.tar.gz
$ cd VERSION
$ make

Usage

The alfalfa command has the following general anatomy

$ alfalfa <command> [<subcommand>] [options]

where command is either index, align or evaluate. Only the evaluate command requires an additional subcommand.

Options can have a single-letter (preceded by a single hyphen) or multi-letter (preceded by a double hyphen) name, or both. In the latter case, both names of the option can be used interchangeably. The description of an option starts with its name or names (separated by a forward slash), followed by a tuple (between round brackets) indicating the data type and default value of the argument that has to be passed to the option. If no default value is given, it is mandatory to pass an argument to the option. Options for which no tuple is given are used as toggles to enable/disable certain features and need no extra argument.

The following graphical overview of the anatomy of ALFALFA provides details on the command line options that can be used to tweak its operation. The software package offers separate commands for index construction, read mapping and evaluating mapping accuracy. Index construction can also be combined with read mapping during a single run of the package. Click on green regions to get a detailed description and general usage tips of each command, subcommand and option supported by ALFALFA. Click on white background to get a full screen display of the command line anatomy of ALFALFA.

ALFALFA command line anatomy
Command line anatomy of ALFALFA

This graphical overview of the anatomy of ALFALFA provides details on the command line options that can be used to tweak its operation. The software package offers separate commands for index construction, read mapping and evaluating mapping accuracy. Index construction can also be combined with read mapping during a single run of the package. Click on green regions to get a detailed description and general usage tips of each command, subcommand and option supported by ALFALFA. Click on white background to get a full screen display of the command line anatomy of ALFALFA.

The alfalfa command has the following general anatomy

$ alfalfa <command> [<subcommand>] [options]

where command is either index, align or evaluate. Only the evaluate command requires an additional subcommand.

Options can have a single-letter (preceded by a single hyphen) or multi-letter (preceded by a double hyphen) name, or both. In the latter case, both names of the option can be used interchangeably. The description of an option starts with its name or names (separated by a forward slash), followed by a tuple (between round brackets) indicating the data type and default value of the argument that has to be passed to the option. If no default value is given, it is mandatory to pass an argument to the option. Options for which no tuple is given are used as toggles to enable/disable certain features and need no extra argument.

Indexing a reference genome

The index command is used to construct the data structures for indexing a given reference genome.

The constructed index is stored to disk over multiple files that are contained in the same directory and that have names sharing the same prefix. Index files contain bookkeeping information (extension .aux) and individual arrays of an enhanced sparse suffix array: reference genome (extension .ref), suffix array (extension .sa), longest common prefix array (extension .lcp), inverse suffix array (extension .isa; optional), child array (extension .child; optional) and 10-mer lookup array (extension .kmer; optional). The inverse suffix array is the only array that is not constructed by default. The child array and 10-mer lookup array are not strictly necessary and may be omitted in order to save memory, but at the cost of a drop in performance.

The index command can be skipped as the align command also provides the option to generate an index and store it to disk. All options for customizing the index command can therefore also be used in combination with the align command.

List of options for this command:

  • -r/--reference (file) Read more
  • -s/--sparseness (int, 12) Read more
  • -p/--prefix (string, filename passed to the -r option) Read more
  • --no-child Read more
  • --suflink Read more
  • --no-kmer Read more
  • -h/--help Read more
<code>-r</code>/<code>--reference</code> (<code>file</code>)
Specifies the location of a file that contains the reference genome in multi-fasta format.
<code>-s</code>/<code>--sparseness</code> (<code>int</code>, <code>12</code>)
Specifies the sparseness of the index structure as a way to control part of the speed-memory trade-off.
<code>-p</code>/<code>--prefix</code> (<code>string</code>, filename passed to the <code>-r</code> option)
Specifies the prefix that will be used to name all generated index files. The same prefix has to be passed to the -i option of the align command to load the index structure when mapping reads.
<code>--no-child</code>
By default, a sparse child array is constructed and stored in an index file with extension .child. The construction of this sparse child array is skipped when the --no-child option is set. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base in the reference genome. As the data structure provides a major speed-up, it is advised to have it constructed.
<code>--suflink</code>
Suffix link support is disabled by default. Suffix link support is enabled when the --suflink option is set, resulting in an index file with extension .isa to be generated. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base. It is only useful when sparseness is less than four and minimum seed length is very low (less than 10), because it conflicts with skipping suffixes in matching the read. In practice, this is rarely the case.
<code>--no-kmer</code>
By default, a 10-mer lookup table is constructed that contains the suffix array interval positions to depth 10 in the virtual suffix tree. It is stored in an index file with extension .kmer and requires only 8MB of memory. The construction of this lookup table is skipped when the --no-kmer option is set. The lookup table stores intervals for sequences of length 10 that only contain {A,C,G,T}. This data structure speeds up seed-finding if the minimum seed length is greater than 10.
<code>-h</code>/<code>--help</code>
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.
Mapping and aligning a read set
The align command is used for mapping and aligning a read set onto a reference genome. As this process can be customized through a long list of options, we have grouped them into several categories.
I/O options
List of options in this category:
  • -r/--reference (file, part of the index) Read more
  • -i/--index (string) Read more
  • --save Read more
  • -0/--single (file) Read more
  • -1/--mates1 (file) Read more
  • -2/--mates2 (file) Read more
  • -o/--output (file, filename passed to the -r option with additional .sam extension) Read more
<code>-r</code>/<code>--reference</code> (<code>file</code>, <code>part of the index</code>)
Specifies the location of a file that contains the reference genome in multi-fastA format.
<code>-i</code>/<code>--index</code> (<code>string</code>)
Specifies the prefix used to name all generated index files. If this option is not set explicitly, an index will be computed from the reference genome according to the settings of the options that also apply to the index command.
<code>--save</code>
Specifies that if an index is constructed by the align command itself, it will be stored to disk. This option is ignored if the index is loaded from disk (option -i).
<code>-0</code>/<code>--single</code> (<code>file</code>)
Specifies the location of a file that contains single-end reads. Both fastA and fastQ formats are accepted. If both single-end and paired-end reads are specified, single-end reads are processed first.
<code>-1</code>/<code>--mates1</code> (<code>file</code>)
Specifies the location of a file that contains the first mates of paired-end reads. Both fastA and fastQ formats are accepted.
<code>-2</code>/<code>--mates2</code> (<code>file</code>)
Specifies the location of a file that contains the second mates of paired-end reads. Both fastA and fastQ formats are accepted.
<code>-o</code>/<code>--output</code> (<code>file</code>, filename passed to the <code>-r</code> option with additional <code>.sam</code> extension)
Specifies the location of the generated SAM output file containing the results of read mapping and alignment.
Alignment options
List of options in this category:
  • -a/--alignments (int, 1) Read more
  • --no-forward Read more
  • --no-reverse Read more
  • -e/--edit-distance (float, 0.08) Read more
  • --no-rescue Read more
  • -t/--threads (int, 1) Read more
<code>-a</code>/<code>--alignments</code> (<code>int</code>, <code>1</code>)
Specifies the maximum number of alignments reported per read.
<code>--no-forward</code>
Do not compute alignments on the forward strand.
<code>--no-reverse</code>
Do not compute alignments on the reverse complement strand.
<code>-e</code>/<code>--edit-distance</code> (<code>float</code>, <code>0.08</code>)
Represents the maximum percentage of differences allowed in accepting alignments and used in combination with the dynamic programming score function to calculate the minimum alignment score.
<code>--no-rescue</code>
Disables rescue procedures that are normally initiated when ALFALFA does not find seeds and/or alignments with the current parameters.
<code>-t</code>/<code>--threads</code> (<code>int</code>, <code>1</code>)
Number of threads used during read mapping. Using more than one thread results in reporting read alignments in a different order compared to the order in which they are read from the input file(s).
Seed options
List of options in this category:
  • --seed (MEM | SMEM | PSMEM, SMEM) Read more
  • -l/--min-length (int, auto) Read more
  • -m/--max-seeds (int, 10000) Read more
  • --max-smems (int, 10) Read more
  • --max-mems (int, 20) Read more
  • --min-mem-length (int, 50) Read more
  • --no-sparseness Read more
<code>--seed</code> (<code>MEM</code> | <code>SMEM</code> | <code>PSMEM</code>, <code>SMEM</code>)
Specifies the type of seeds used for read mapping. Possible values are MEM for maximal exact matches, SMEM for super-maximal exact matches, and PSMEM for SMEMs with additional rare MEMs. The use of SMEMs generally boosts performance without having a negative impact on accuracy compared to the use of MEMs. On the other hand, there are usually many more MEMs than SMEMs, in general resulting in a higher number of candidate genomic regions. Reporting all MEMs might be useful if reporting more candidate mapping locations is preferred.
<code>-l</code>/<code>--min-length</code> (<code>int</code>, <code>auto</code>)
Specifies the minimum seed length. This value must be greater than the sparseness value used to build the index (option -s). By default, the value of this option is computed automatically using the following procedure. A value of 40 is used for reads shorter than 1kbp. The value is incremented by 20 for every 500bp above 1kbp, with the total increment being divided by the maximum percentage of errors allowed in accepting alignments (option -e).
<code>-m</code>/<code>--max-seeds</code> (<code>int</code>, <code>10000</code>)
Specifies the maximum number of same-length seeds that will be selected per offset in the read sequence. The value passed to this option is multiplied by the automatically computed skip factor that determines sparse matching of sampled suffixes from the read sequence. As a result, the actual number of seeds per starting position in the read might still vary. Higher values of this option result in higher numbers of seeds, increasing in turn the number of candidate genomic regions.
<code>--max-smems</code> (<code>int</code>, <code>10</code>)
Specifies the maximum number of SMEMs per offset in the read sequence to allow MEM-finding. This only applies for PSMEM seeds.
<code>--max-mems</code> (<code>int</code>, <code>20</code>)
Specifies the maximum number of MEMs per offset in the read that can be used for candidate region identification.
<code>--min-mem-length</code> (<code>int</code>, <code>50</code>)
Specifies the minimum length MEMs need to have to be used for candidate region identification.
<code>--no-sparseness</code>
Disables the use of sparseness in the read sequence during seed-finding.
Extend options
List of options in this category:
  • --max-alignments (int, 5000) Read more
  • -f/--max-failures (int, 10) Read more
  • --reset-failures Read more
  • --skip-unique Read more
  • -c/--min-coverage (float, 0.25) Read more
  • --local Read more
  • -b/--bandwidth (int, 100) Read more
  • -M/--match (int, 1) Read more
  • -U/--mismatch (int, -4) Read more
  • -O/--gap-open (int, -6) Read more
  • -E/--gap-extend (int, -1) Read more
  • --full-dp Read more
<code>--max-alignments</code> (<code>int</code>, <code>5000</code>)
Specifies the maximum number of alignments calculated per read. This value should be higher than the number of reported alignments (option -a). Decreasing this value can increase performance of the algorithm, at the cost of a lower accuracy and worse mapping quality estimation.
<code>-f</code>/<code>--max-failures</code> (<code>int</code>, <code>10</code>)
Specifies the maximum number of successive candidate regions that are investigated without success before ALFALFA stops extending the candidate regions of a read. Extension can be restarted only if the remaining candidate regions contain unique seeds.
<code>--reset-failures</code>
If set, the counter of successive candidate regions that are investigated without success is reset if a feasible alignment is found. By default the counter is only reset if a new best alignment is found.
<code>--skip-unique</code>
By default, ALFALFA extends all candidate regions containing unique seeds. If this flag is set, the uniqueness criterium is not taken into account when deciding upon the extension of a candidate region.
<code>-c</code>/<code>--min-coverage</code> (<code>float</code>, <code>0.25</code>)
Specifies the minimum percentage of the read length that candidate regions containing a single seed need to cover before extension of the candidate region is taken into consideration.
<code>--local</code>
By default, ALFALFA uses global alignment during the last phase of the mapping process. Global alignment in essence is end-to-end alignment, as it entirely covers the read but only covers the reference genome in part. Local alignment is used during the last phase of the mapping process if the --local option is set, which may result in soft clipping of the read.
<code>-b</code>/<code>--bandwidth</code> (<code>int</code>, <code>100</code>)
Specifies the maximum bandwidth that is used by the banded alignment algorithm. The bandwidth used is automatically inferred from the specification of the maximum percentage of errors allowed in accepting alignments (option -e), but is bound by this parameter.
<code>-M</code>/<code>--match</code> (<code>int</code>, <code>1</code>)
Specifies the positive score assigned to matches in the dynamic programming extension phase.
<code>-U</code>/<code>--mismatch</code> (<code>int</code>, <code>-4</code>)
Specifies the penalty assigned to mismatches in the dynamic programming extension phase.
<code>-O</code>/<code>--gap-open</code> (<code>int</code>, <code>-6</code>)
Specifies the penalty $O$ for opening a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$. The use of affine gap penalties can be disabled by setting this value to zero.
<code>-E</code>/<code>--gap-extend</code> (<code>int</code>, <code>-1</code>)
Specifies the penalty $E$ for extending a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$.
<code>--full-dp</code>
By default, ALFALFA uses chain-guided alignment to retrieve the CIGAR alignment. If this parameter is set, banded dynamic programming is performed instead. Activating this setting can greatly increase runtime, but can sometimes lead to more optimal alignments.
Paired-end mapping options
List of options in this category:
  • -I/--min-insert (int, 0) Read more
  • -X/--max-insert (int, 1000) Read more
  • --orientation (fr | rf | ff, fr) Read more
  • --no-mixed Read more
  • --no-discordant Read more
  • --dovetail Read more
  • --no-contain Read more
  • --no-overlap Read more
  • --paired-mode (1 | 2 | 3 | 4 | 5 | 6, 1) Read more
  • --paired-rescue Read more
<code>-I</code>/<code>--min-insert</code> (<code>int</code>, <code>0</code>)
Specifies the minimum insert size.
<code>-X</code>/<code>--max-insert</code> (<code>int</code>, <code>1000</code>)
Specifies the maximum insert size.
<code>--orientation</code> (<code>fr</code> | <code>rf</code> | <code>ff</code>, <code>fr</code>)
Specifies the orientation of mates. fr means a forward upstream first mate and reverse complemented downstream second mate or vice versa. rf means a reverse complemented upstream first mate and forward downstream second mate or vice versa. ff means a forward upstream first mate and forward downstream second mate or vice versa. Note that these definitions are literally taken over from Bowtie 2.
<code>--no-mixed</code>
Disables searching for unpaired alignments.
<code>--no-discordant</code>
Disables searching for discordant alignments.
<code>--dovetail</code>
Allows switching between upstream and downstream mates in the definition of their orientation (option --orientation).
<code>--no-contain</code>
Disallows concordant mates to be fully contained within each other.
<code>--no-overlap</code>
Disallows concordant mates to overlap each other.
<code>--paired-mode</code> (<code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code>, <code>1</code>)
Specifies the algorithm used to align paired-end reads. The possible algorithms are discussed in detail in the methods section. Algorithms 1 and 2 do not use information from candidate regions. Algorithms 3 and 4 prioritize extension of candidate regions over both reads. Algorithms 5 and 6 filter the list of candidate regions using the paired-end restraints. Algorithms with an odd number pair mapped reads after alignment. Algorithms with an even number perform dynamic programming across a window defined by the insert size restrictions to search for a bridging alignment reaching the other mate.
<code>--paired-rescue</code>
Enables a rescue procedue if no concordant alignment is found using the current parameter settings.
Miscellaneous options
List of options in this category:
  • -v/--verbose (int, 0) Read more
  • -h/--help Read more
<code>-v</code>/<code>--verbose</code> (<code>int</code>, <code>0</code>)
Turns on lots of progress reporting about the alignment process. Higher numbers give more verbose output. Information is printed to standard error and is useful for debugging purposes. The default value 0 disables progress reporting. The maximum verbosity level currently supported is 7.
<code>-h</code>/<code>--help</code>
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.
Evaluating mapping accuracy

The evaluate command is used for evaluating the accuracy of simulated reads and summarizing statistics from the SAM formatted alignments reported by a read mapper. It requires an additional subcommand that influences both the functionality and the input of the evaluate command. Currently supported subcommands are summary, sam and wgsim.

The evaluate command requires all input files to be sorted by read name. This can be done easily using SAMtools. Furthermore, read names of both mates should be identical for paired-end reads.

Options common to all subcommands:

  • -i/--input-sam (file) Read more
  • -o/--output (file, standard output) Read more
  • -q/--quality (comma-separated list of ints between 0 and 255, 0) Read more
  • -p/--print Read more
  • -h/--help Read more
<code>-i</code>/<code>--input-sam</code> (<code>file</code>)
Specifies the location of a SAM file that contains the read mapping alignments that need to be evaluated.
<code>-o</code>/<code>--output</code> (<code>file</code>, standard output)
Specifies the location of the file that will contain the generated output.
<code>-q</code>/<code>--quality</code> (comma-separated list of <code>ints</code> between <code>0</code> and <code>255</code>, <code>0</code>)
The values in the list represent quality thresholds. For each specified quality threshold, output is produced that reports only on the subset of alignments with quality value greater than or equal to the threshold.
<code>-p</code>/<code>--print</code>
Triggers the generated output to contain a list of all reads from the input SAM file followed by a binary value. Zero in dicates that the read is either unmapped or incorrectly mapped and one indicates that the read was mapped (summary subcommand) or mapped correctly (other subcommands).
<code>-h</code>/<code>--help</code>
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.
Summarizing read mapping statistics

The evaluate summary subcommand reports statistics about the number of mapped reads for which the actual mapping locations are unknown.

Options specific to this subcommand:

  • --reads (int) Read more
  • --paired Read more
<code>--reads</code> (<code>int</code>)
Specifies the number of reads given as input to the read mapper that produced the input SAM file (option --input-sam). This number can be different from the number of reads contained in the input SAM file (option --input-sam) if for example unmapped reads are not reported.
<code>--paired</code>
By default, the input SAM file (option --input-sam) is supposed to contain single-end reads. If the --paired option is set, it is supposed to contain paired-end reads. The summary for paired-end reads contains information on the number of reads mapped as paired and unpaired, as indicated by the flag field of the SAM format.
Evaluating mapping accuracy for reference SAM

The evaluate sam subcommand is used to evaluate the accuracy for sequencing reads generated by the Mason simulator and other read simulators that produce a reference SAM file containing alignments for the simulated reads.

Options specific to this subcommand:

  • -r/--reference (file) Read more
  • -w/--window (comma-separated list of ints, 50) Read more
  • --input-edit (string, NM) Read more
  • --reference-sam (file) Read more
  • --reference-edit (string, XE) Read more
<code>-r</code>/<code>--reference</code> (<code>file</code>)
Specifies the location of a file that contains the reference genome in multi-fasta format.
<code>-w</code>/<code>--window</code> (comma-separated list of <code>ints</code>, <code>50</code>)
The values in the list represent window sizes around the position in the reference genome from which the simulated read was extracted. An alignment is considered to be mapped correctly if it is mapped within a given window around the simulated position. Output is generated for each individual value.
<code>--input-edit</code> (<code>string</code>, <code>NM</code>)
Specifies the field of the SAM format that contains the edit distance of the alignments in the input SAM file (option --i). If no such field exists, the edit distance is computed from the CIGAR string, the read sequence and the reference genome. An alignment that is not mapped within a certain window around the simulated position (option -w) is considered plausibly mapped if its edit distance is less than the edit distance of the alignment taken from the reference SAM file (option --reference-sam).
<code>--reference-sam</code> (<code>file</code>)
Specifies the location of a reference SAM file containing alignments of the simulated reads as generated by the Mason simulator. Alignments contained in this file should be sorted by read name, which is easily done using SAMtools.
<code>--reference-edit</code> (<code>string</code>, <code>XE</code>)
Specifies the field of the SAM format that contains the edit distance of the alignments in the reference SAM file (option --reference-sam). The default value is set to XE because this is the field used by the Mason simulator.
Evaluating mapping accuracy for wgsim

The evaluate wgsim subcommand is used to evaluate the accuracy for reads simulated by wgsim.

Options specific to this subcommand:

  • -r/--reference (file) Read more
  • -w/--window (comma-separated list of ints, 50) Read more
  • --input-edit (string, NM) Read more
<code>-r</code>/<code>--reference</code> (<code>file</code>)
Specifies the location of a file that contains the reference genome in multi-fasta format.
<code>-w</code>/<code>--window</code> (comma-separated list of <code>ints</code>, <code>50</code>)
The values in the list represent window sizes around the position in the reference genome from which the simulated read was extracted. An alignment is considered to be mapped correctly if it is mapped within a given window around the simulated position. Output is generated for each individual value.
<code>--input-edit</code> (<code>string</code>, <code>NM</code>)
Specifies the field of the SAM format that contains the edit distance of the alignments in the input SAM file (option --i). If no such field exists, the edit distance is computed from the CIGAR string, the read sequence and the reference genome. An alignment that is not mapped within a certain window around the simulated position (option -w) is considered plausibly mapped if its edit distance is less than the edit distance of the alignment taken from the reference SAM file (option --reference-sam).

Example

To quickly get started, you can use the small example files contained within the repository and the commands given in the example page.

Change Log

Links

essaMEM project

Contact

Contact lead developer Michaƫl Vyverman (michael[dot]vyverman[at]ugent[dot]be) if you have any further questions or suggestions.