Introduction

The long read mapper ALFALFA achieves high performance in accurately mapping long (>500bp) single-end and paired-end reads to gigabase-scale reference genomes, while remaining competitive for mapping shorter (>100bp) reads. Its seed-and-extend workflow is underpinned by fast retrieval of super-maximal exact matches from an enhanced sparse suffix array, with flexible parameter tuning to balance performance, memory footprint and accuracy.

Downloads

Latest version (beta release) (zip)

Latest version (beta release) (tar)

ALFALFA (v0.8.1) (zip)

ALFALFA (v0.8.1) (tar)

ALFALFA (v0.8) (zip)

ALFALFA (v0.8) (tar)

Installation

The following commands can be used to install the software if git is installed.

$ git clone git://github.com/readmapping/alfalfa.git
$ cd alfalfa
$ make

The following commands can be used to unpack and install the software when downloading a zip file. Replace VERSION with the name of the downloaded file.

$ unzip VERSION.zip
$ cd VERSION
$ make

The following commands can be used to unpack and install the software when downloading a tarball. Replace VERSION with the name of the downloaded file.

$ tar xvzf VERSION.tar.gz
$ cd VERSION
$ make

Usage

The alfalfa command has the following general anatomy

$ alfalfa <command> [<subcommand>] [options]

where command is either index, align or evaluate. Only the evaluate command requires an additional subcommand.

Options can have a single-letter (preceded by a single hyphen) or multi-letter (preceded by a double hyphen) name, or both. In the latter case, both names of the option can be used interchangeably. The description of an option starts with its name or names (separated by a forward slash), followed by a tuple (between round brackets) indicating the data type and default value of the argument that has to be passed to the option. If no default value is given, it is mandatory to pass an argument to the option. Options for which no tuple is given are used as toggles to enable/disable certain features and need no extra argument.

The following graphical overview of the anatomy of ALFALFA provides details on the command line options that can be used to tweak its operation. The software package offers separate commands for index construction, read mapping and evaluating mapping accuracy. Index construction can also be combined with read mapping during a single run of the package. Click on green regions to get a detailed description and general usage tips of each command, subcommand and option supported by ALFALFA. Click on white background to get a full screen display of the command line anatomy of ALFALFA.

This graphical overview of the anatomy of ALFALFA provides details on the command line options that can be used to tweak its operation. The software package offers separate commands for index construction, read mapping and evaluating mapping accuracy. Index construction can also be combined with read mapping during a single run of the package. Click on green regions to get a detailed description and general usage tips of each command, subcommand and option supported by ALFALFA. Click on white background to get a full screen display of the command line anatomy of ALFALFA.

The alfalfa command has the following general anatomy

$ alfalfa <command> [<subcommand>] [options]

where command is either index, align or evaluate. Only the evaluate command requires an additional subcommand.

The index command is used to construct the data structures for indexing a given reference genome.

The constructed index is stored to disk over multiple files that are contained in the same directory and that have names sharing the same prefix. Index files contain bookkeeping information (extension .aux) and individual arrays of an enhanced sparse suffix array: reference genome (extension .ref), suffix array (extension .sa), longest common prefix array (extension .lcp), inverse suffix array (extension .isa; optional), child array (extension .child; optional) and 10-mer lookup array (extension .kmer; optional). The inverse suffix array is the only array that is not constructed by default. The child array and 10-mer lookup array are not strictly necessary and may be omitted in order to save memory, but at the cost of a drop in performance.

The index command can be skipped as the align command also provides the option to generate an index and store it to disk. All options for customizing the index command can therefore also be used in combination with the align command.

List of options for this command:

-r/--reference (file) Read more
Specifies the location of a file that contains the reference genome in multi-fasta format.
-s/--sparseness (int, 12) Read more
Specifies the sparseness of the index structure as a way to control part of the speed-memory trade-off.
-p/--prefix (string, filename passed to the -r option) Read more
Specifies the prefix that will be used to name all generated index files. The same prefix has to be passed to the -i option of the align command to load the index structure when mapping reads.
--no-child Read more
By default, a sparse child array is constructed and stored in an index file with extension .child. The construction of this sparse child array is skipped when the --no-child option is set. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base in the reference genome. As the data structure provides a major speed-up, it is advised to have it constructed.
--suflink Read more
Suffix link support is disabled by default. Suffix link support is enabled when the --suflink option is set, resulting in an index file with extension .isa to be generated. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base. It is only useful when sparseness is less than four and minimum seed length is very low (less than 10), because it conflicts with skipping suffixes in matching the read. In practice, this is rarely the case.
--no-kmer Read more
By default, a 10-mer lookup table is constructed that contains the suffix array interval positions to depth 10 in the virtual suffix tree. It is stored in an index file with extension .kmer and requires only 8MB of memory. The construction of this lookup table is skipped when the --no-kmer option is set. The lookup table stores intervals for sequences of length 10 that only contain {A,C,G,T}. This data structure speeds up seed-finding if the minimum seed length is greater than 10.
-h/--help Read more
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

Specifies the location of a file that contains the reference genome in multi-fasta format.

Specifies the sparseness of the index structure as a way to control part of the speed-memory trade-off.

Specifies the prefix that will be used to name all generated index files. The same prefix has to be passed to the -i option of the align command to load the index structure when mapping reads.

By default, a sparse child array is constructed and stored in an index file with extension .child. The construction of this sparse child array is skipped when the --no-child option is set. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base in the reference genome. As the data structure provides a major speed-up, it is advised to have it constructed.

Suffix link support is disabled by default. Suffix link support is enabled when the --suflink option is set, resulting in an index file with extension .isa to be generated. This data structure speeds up seed-finding at the cost of ($\frac{4}{s}$) bytes per base. It is only useful when sparseness is less than four and minimum seed length is very low (less than 10), because it conflicts with skipping suffixes in matching the read. In practice, this is rarely the case.

By default, a 10-mer lookup table is constructed that contains the suffix array interval positions to depth 10 in the virtual suffix tree. It is stored in an index file with extension .kmer and requires only 8MB of memory. The construction of this lookup table is skipped when the --no-kmer option is set. The lookup table stores intervals for sequences of length 10 that only contain {A,C,G,T}. This data structure speeds up seed-finding if the minimum seed length is greater than 10.

Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

The align command is used for mapping and aligning a read set onto a reference genome. As this process can be customized through a long list of options, we have grouped them into several categories.

List of options in this category:

-r/--reference (file, part of the index) Read more
Specifies the location of a file that contains the reference genome in multi-fastA format.
-i/--index (string) Read more
Specifies the prefix used to name all generated index files. If this option is not set explicitly, an index will be computed from the reference genome according to the settings of the options that also apply to the index command.
--save Read more
Specifies that if an index is constructed by the align command itself, it will be stored to disk. This option is ignored if the index is loaded from disk (option -i).
-0/--single (file) Read more
Specifies the location of a file that contains single-end reads. Both fastA and fastQ formats are accepted. If both single-end and paired-end reads are specified, single-end reads are processed first.
-1/--mates1 (file) Read more
Specifies the location of a file that contains the first mates of paired-end reads. Both fastA and fastQ formats are accepted.
-2/--mates2 (file) Read more
Specifies the location of a file that contains the second mates of paired-end reads. Both fastA and fastQ formats are accepted.
-o/--output (file, filename passed to the -r option with additional .sam extension) Read more
Specifies the location of the generated SAM output file containing the results of read mapping and alignment.

Specifies the location of a file that contains the reference genome in multi-fastA format.

Specifies the prefix used to name all generated index files. If this option is not set explicitly, an index will be computed from the reference genome according to the settings of the options that also apply to the index command.

Specifies that if an index is constructed by the align command itself, it will be stored to disk. This option is ignored if the index is loaded from disk (option -i).

Specifies the location of a file that contains single-end reads. Both fastA and fastQ formats are accepted. If both single-end and paired-end reads are specified, single-end reads are processed first.

Specifies the location of a file that contains the first mates of paired-end reads. Both fastA and fastQ formats are accepted.

Specifies the location of a file that contains the second mates of paired-end reads. Both fastA and fastQ formats are accepted.

Specifies the location of the generated SAM output file containing the results of read mapping and alignment.

List of options in this category:

-a/--alignments (int, 1) Read more
Specifies the maximum number of alignments reported per read.
--no-forward Read more
Do not compute alignments on the forward strand.
--no-reverse Read more
Do not compute alignments on the reverse complement strand.
-e/--edit-distance (float, 0.08) Read more
Represents the maximum percentage of differences allowed in accepting alignments and used in combination with the dynamic programming score function to calculate the minimum alignment score.
--no-rescue Read more
Disables rescue procedures that are normally initiated when ALFALFA does not find seeds and/or alignments with the current parameters.
-t/--threads (int, 1) Read more
Number of threads used during read mapping. Using more than one thread results in reporting read alignments in a different order compared to the order in which they are read from the input file(s).

Specifies the maximum number of alignments reported per read.

Do not compute alignments on the forward strand.

Do not compute alignments on the reverse complement strand.

Represents the maximum percentage of differences allowed in accepting alignments and used in combination with the dynamic programming score function to calculate the minimum alignment score.

Disables rescue procedures that are normally initiated when ALFALFA does not find seeds and/or alignments with the current parameters.

Number of threads used during read mapping. Using more than one thread results in reporting read alignments in a different order compared to the order in which they are read from the input file(s).

List of options in this category:

--seed (MEM | SMEM | PSMEM, SMEM) Read more
Specifies the type of seeds used for read mapping. Possible values are MEM for maximal exact matches, SMEM for super-maximal exact matches, and PSMEM for SMEMs with additional rare MEMs. The use of SMEMs generally boosts performance without having a negative impact on accuracy compared to the use of MEMs. On the other hand, there are usually many more MEMs than SMEMs, in general resulting in a higher number of candidate genomic regions. Reporting all MEMs might be useful if reporting more candidate mapping locations is preferred.
-l/--min-length (int, auto) Read more
Specifies the minimum seed length. This value must be greater than the sparseness value used to build the index (option -s). By default, the value of this option is computed automatically using the following procedure. A value of 40 is used for reads shorter than 1kbp. The value is incremented by 20 for every 500bp above 1kbp, with the total increment being divided by the maximum percentage of errors allowed in accepting alignments (option -e).
-m/--max-seeds (int, 10000) Read more
Specifies the maximum number of same-length seeds that will be selected per offset in the read sequence. The value passed to this option is multiplied by the automatically computed skip factor that determines sparse matching of sampled suffixes from the read sequence. As a result, the actual number of seeds per starting position in the read might still vary. Higher values of this option result in higher numbers of seeds, increasing in turn the number of candidate genomic regions.
--max-smems (int, 10) Read more
Specifies the maximum number of SMEMs per offset in the read sequence to allow MEM-finding. This only applies for PSMEM seeds.
--max-mems (int, 20) Read more
Specifies the maximum number of MEMs per offset in the read that can be used for candidate region identification.
--min-mem-length (int, 50) Read more
Specifies the minimum length MEMs need to have to be used for candidate region identification.
--no-sparseness Read more
Disables the use of sparseness in the read sequence during seed-finding.

Specifies the type of seeds used for read mapping. Possible values are MEM for maximal exact matches, SMEM for super-maximal exact matches, and PSMEM for SMEMs with additional rare MEMs. The use of SMEMs generally boosts performance without having a negative impact on accuracy compared to the use of MEMs. On the other hand, there are usually many more MEMs than SMEMs, in general resulting in a higher number of candidate genomic regions. Reporting all MEMs might be useful if reporting more candidate mapping locations is preferred.

Specifies the minimum seed length. This value must be greater than the sparseness value used to build the index (option -s). By default, the value of this option is computed automatically using the following procedure. A value of 40 is used for reads shorter than 1kbp. The value is incremented by 20 for every 500bp above 1kbp, with the total increment being divided by the maximum percentage of errors allowed in accepting alignments (option -e).

Specifies the maximum number of same-length seeds that will be selected per offset in the read sequence. The value passed to this option is multiplied by the automatically computed skip factor that determines sparse matching of sampled suffixes from the read sequence. As a result, the actual number of seeds per starting position in the read might still vary. Higher values of this option result in higher numbers of seeds, increasing in turn the number of candidate genomic regions.

Specifies the maximum number of SMEMs per offset in the read sequence to allow MEM-finding. This only applies for PSMEM seeds.

Specifies the maximum number of MEMs per offset in the read that can be used for candidate region identification.

Specifies the minimum length MEMs need to have to be used for candidate region identification.

Disables the use of sparseness in the read sequence during seed-finding.

List of options in this category:

--max-alignments (int, 5000) Read more
Specifies the maximum number of alignments calculated per read. This value should be higher than the number of reported alignments (option -a). Decreasing this value can increase performance of the algorithm, at the cost of a lower accuracy and worse mapping quality estimation.
-f/--max-failures (int, 10) Read more
Specifies the maximum number of successive candidate regions that are investigated without success before ALFALFA stops extending the candidate regions of a read. Extension can be restarted only if the remaining candidate regions contain unique seeds.
--reset-failures Read more
If set, the counter of successive candidate regions that are investigated without success is reset if a feasible alignment is found. By default the counter is only reset if a new best alignment is found.
--skip-unique Read more
By default, ALFALFA extends all candidate regions containing unique seeds. If this flag is set, the uniqueness criterium is not taken into account when deciding upon the extension of a candidate region.
-c/--min-coverage (float, 0.25) Read more
Specifies the minimum percentage of the read length that candidate regions containing a single seed need to cover before extension of the candidate region is taken into consideration.
--local Read more
By default, ALFALFA uses global alignment during the last phase of the mapping process. Global alignment in essence is end-to-end alignment, as it entirely covers the read but only covers the reference genome in part. Local alignment is used during the last phase of the mapping process if the --local option is set, which may result in soft clipping of the read.
-b/--bandwidth (int, 100) Read more
Specifies the maximum bandwidth that is used by the banded alignment algorithm. The bandwidth used is automatically inferred from the specification of the maximum percentage of errors allowed in accepting alignments (option -e), but is bound by this parameter.
-M/--match (int, 1) Read more
Specifies the positive score assigned to matches in the dynamic programming extension phase.
-U/--mismatch (int, -4) Read more
Specifies the penalty assigned to mismatches in the dynamic programming extension phase.
-O/--gap-open (int, -6) Read more
Specifies the penalty $O$ for opening a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$. The use of affine gap penalties can be disabled by setting this value to zero.
-E/--gap-extend (int, -1) Read more
Specifies the penalty $E$ for extending a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$.
--full-dp Read more
By default, ALFALFA uses chain-guided alignment to retrieve the CIGAR alignment. If this parameter is set, banded dynamic programming is performed instead. Activating this setting can greatly increase runtime, but can sometimes lead to more optimal alignments.

Specifies the maximum number of alignments calculated per read. This value should be higher than the number of reported alignments (option -a). Decreasing this value can increase performance of the algorithm, at the cost of a lower accuracy and worse mapping quality estimation.

Specifies the maximum number of successive candidate regions that are investigated without success before ALFALFA stops extending the candidate regions of a read. Extension can be restarted only if the remaining candidate regions contain unique seeds.

If set, the counter of successive candidate regions that are investigated without success is reset if a feasible alignment is found. By default the counter is only reset if a new best alignment is found.

By default, ALFALFA extends all candidate regions containing unique seeds. If this flag is set, the uniqueness criterium is not taken into account when deciding upon the extension of a candidate region.

Specifies the minimum percentage of the read length that candidate regions containing a single seed need to cover before extension of the candidate region is taken into consideration.

By default, ALFALFA uses global alignment during the last phase of the mapping process. Global alignment in essence is end-to-end alignment, as it entirely covers the read but only covers the reference genome in part. Local alignment is used during the last phase of the mapping process if the --local option is set, which may result in soft clipping of the read.

Specifies the maximum bandwidth that is used by the banded alignment algorithm. The bandwidth used is automatically inferred from the specification of the maximum percentage of errors allowed in accepting alignments (option -e), but is bound by this parameter.

Specifies the positive score assigned to matches in the dynamic programming extension phase.

Specifies the penalty assigned to mismatches in the dynamic programming extension phase.

Specifies the penalty $O$ for opening a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$. The use of affine gap penalties can be disabled by setting this value to zero.

Specifies the penalty $E$ for extending a gap (insertion or deletion) in the dynamic programming extension phase. The total penalty for a gap of length $L$ equals $O + L\cdot E$.

By default, ALFALFA uses chain-guided alignment to retrieve the CIGAR alignment. If this parameter is set, banded dynamic programming is performed instead. Activating this setting can greatly increase runtime, but can sometimes lead to more optimal alignments.

List of options in this category:

-I/--min-insert (int, 0) Read more
Specifies the minimum insert size.
-X/--max-insert (int, 1000) Read more
Specifies the maximum insert size.
--orientation (fr | rf | ff, fr) Read more
Specifies the orientation of mates. fr means a forward upstream first mate and reverse complemented downstream second mate or vice versa. rf means a reverse complemented upstream first mate and forward downstream second mate or vice versa. ff means a forward upstream first mate and forward downstream second mate or vice versa. Note that these definitions are literally taken over from Bowtie 2.
--no-mixed Read more
Disables searching for unpaired alignments.
--no-discordant Read more
Disables searching for discordant alignments.
--dovetail Read more
Allows switching between upstream and downstream mates in the definition of their orientation (option --orientation).
--no-contain Read more
Disallows concordant mates to be fully contained within each other.
--no-overlap Read more
Disallows concordant mates to overlap each other.
--paired-mode (1 | 2 | 3 | 4 | 5 | 6, 1) Read more
Specifies the algorithm used to align paired-end reads. The possible algorithms are discussed in detail in the methods section. Algorithms 1 and 2 do not use information from candidate regions. Algorithms 3 and 4 prioritize extension of candidate regions over both reads. Algorithms 5 and 6 filter the list of candidate regions using the paired-end restraints. Algorithms with an odd number pair mapped reads after alignment. Algorithms with an even number perform dynamic programming across a window defined by the insert size restrictions to search for a bridging alignment reaching the other mate.
--paired-rescue Read more
Enables a rescue procedue if no concordant alignment is found using the current parameter settings.

Specifies the minimum insert size.

Specifies the maximum insert size.

Specifies the orientation of mates. fr means a forward upstream first mate and reverse complemented downstream second mate or vice versa. rf means a reverse complemented upstream first mate and forward downstream second mate or vice versa. ff means a forward upstream first mate and forward downstream second mate or vice versa. Note that these definitions are literally taken over from Bowtie 2.

Disables searching for unpaired alignments.

Disables searching for discordant alignments.

Allows switching between upstream and downstream mates in the definition of their orientation (option --orientation).

Disallows concordant mates to be fully contained within each other.

Disallows concordant mates to overlap each other.

Specifies the algorithm used to align paired-end reads. The possible algorithms are discussed in detail in the methods section. Algorithms 1 and 2 do not use information from candidate regions. Algorithms 3 and 4 prioritize extension of candidate regions over both reads. Algorithms 5 and 6 filter the list of candidate regions using the paired-end restraints. Algorithms with an odd number pair mapped reads after alignment. Algorithms with an even number perform dynamic programming across a window defined by the insert size restrictions to search for a bridging alignment reaching the other mate.

Enables a rescue procedue if no concordant alignment is found using the current parameter settings.

List of options in this category:

-v/--verbose (int, 0) Read more
Turns on lots of progress reporting about the alignment process. Higher numbers give more verbose output. Information is printed to standard error and is useful for debugging purposes. The default value 0 disables progress reporting. The maximum verbosity level currently supported is 7.
-h/--help Read more
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

Turns on lots of progress reporting about the alignment process. Higher numbers give more verbose output. Information is printed to standard error and is useful for debugging purposes. The default value 0 disables progress reporting. The maximum verbosity level currently supported is 7.

Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

The evaluate command is used for evaluating the accuracy of simulated reads and summarizing statistics from the SAM formatted alignments reported by a read mapper. It requires an additional subcommand that influences both the functionality and the input of the evaluate command. Currently supported subcommands are summary, sam and wgsim.

The evaluate command requires all input files to be sorted by read name. This can be done easily using SAMtools. Furthermore, read names of both mates should be identical for paired-end reads.

Options common to all subcommands:

-i/--input-sam (file) Read more
Specifies the location of a SAM file that contains the read mapping alignments that need to be evaluated.
-o/--output (file, standard output) Read more
Specifies the location of the file that will contain the generated output.
-q/--quality (comma-separated list of ints between 0 and 255, 0) Read more
The values in the list represent quality thresholds. For each specified quality threshold, output is produced that reports only on the subset of alignments with quality value greater than or equal to the threshold.
-p/--print Read more
Triggers the generated output to contain a list of all reads from the input SAM file followed by a binary value. Zero in dicates that the read is either unmapped or incorrectly mapped and one indicates that the read was mapped (summary subcommand) or mapped correctly (other subcommands).
-h/--help Read more
Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

Specifies the location of a SAM file that contains the read mapping alignments that need to be evaluated.

Specifies the location of the file that will contain the generated output.

The values in the list represent quality thresholds. For each specified quality threshold, output is produced that reports only on the subset of alignments with quality value greater than or equal to the threshold.

Triggers the generated output to contain a list of all reads from the input SAM file followed by a binary value. Zero in dicates that the read is either unmapped or incorrectly mapped and one indicates that the read was mapped (summary subcommand) or mapped correctly (other subcommands).

Prints to standard error the version number, usage description and an overview of the options that can be used to customize the software package.

The evaluate summary subcommand reports statistics about the number of mapped reads for which the actual mapping locations are unknown.

Options specific to this subcommand:

--reads (int) Read more
Specifies the number of reads given as input to the read mapper that produced the input SAM file (option --input-sam). This number can be different from the number of reads contained in the input SAM file (option --input-sam) if for example unmapped reads are not reported.
--paired Read more
By default, the input SAM file (option --input-sam) is supposed to contain single-end reads. If the --paired option is set, it is supposed to contain paired-end reads. The summary for paired-end reads contains information on the number of reads mapped as paired and unpaired, as indicated by the flag field of the SAM format.

Specifies the number of reads given as input to the read mapper that produced the input SAM file (option --input-sam). This number can be different from the number of reads contained in the input SAM file (option --input-sam) if for example unmapped reads are not reported.

By default, the input SAM file (option --input-sam) is supposed to contain single-end reads. If the --paired option is set, it is supposed to contain paired-end reads. The summary for paired-end reads contains information on the number of reads mapped as paired and unpaired, as indicated by the flag field of the SAM format.

The evaluate sam subcommand is used to evaluate the accuracy for sequencing reads generated by the Mason simulator and other read simulators that produce a reference SAM file containing alignments for the simulated reads.

Options specific to this subcommand:

-r/--reference (file) Read more
Specifies the location of a file that contains the reference genome in multi-fasta format.
-w/--window (comma-separated list of ints, 50) Read more
The values in the list represent window sizes around the position in the reference genome from which the simulated read was extracted. An alignment is considered to be mapped correctly if it is mapped within a given window around the simulated position. Output is generated for each individual value.
--input-edit (string, NM) Read more
Specifies the field of the SAM format that contains the edit distance of the alignments in the input SAM file (option --i). If no such field exists, the edit distance is computed from the CIGAR string, the read sequence and the reference genome. An alignment that is not mapped within a certain window around the simulated position (option -w) is considered plausibly mapped if its edit distance is less than the edit distance of the alignment taken from the reference SAM file (option --reference-sam).
--reference-sam (file) Read more
Specifies the location of a reference SAM file containing alignments of the simulated reads as generated by the Mason simulator. Alignments contained in this file should be sorted by read name, which is easily done using SAMtools.
--reference-edit (string, XE) Read more
Specifies the field of the SAM format that contains the edit distance of the alignments in the reference SAM file (option --reference-sam). The default value is set to XE because this is the field used by the Mason simulator.

Specifies the location of a file that contains the reference genome in multi-fasta format.

The values in the list represent window sizes around the position in the reference genome from which the simulated read was extracted. An alignment is considered to be mapped correctly if it is mapped within a given window around the simulated position. Output is generated for each individual value.

Specifies the field of the SAM format that contains the edit distance of the alignments in the input SAM file (option --i). If no such field exists, the edit distance is computed from the CIGAR string, the read sequence and the reference genome. An alignment that is not mapped within a certain window around the simulated position (option -w) is considered plausibly mapped if its edit distance is less than the edit distance of the alignment taken from the reference SAM file (option --reference-sam).

Specifies the location of a reference SAM file containing alignments of the simulated reads as generated by the Mason simulator. Alignments contained in this file should be sorted by read name, which is easily done using SAMtools.

Specifies the field of the SAM format that contains the edit distance of the alignments in the reference SAM file (option --reference-sam). The default value is set to XE because this is the field used by the Mason simulator.

The evaluate wgsim subcommand is used to evaluate the accuracy for reads simulated by wgsim.

Options specific to this subcommand:

-r/--reference (file) Read more
Specifies the location of a file that contains the reference genome in multi-fasta format.
-w/--window (comma-separated list of ints, 50) Read more
The values in the list represent window sizes around the position in the reference genome from which the simulated read was extracted. An alignment is considered to be mapped correctly if it is mapped within a given window around the simulated position. Output is generated for each individual value.
--input-edit (string, NM) Read more
Specifies the field of the SAM format that contains the edit distance of the alignments in the input SAM file (option --i). If no such field exists, the edit distance is computed from the CIGAR string, the read sequence and the reference genome. An alignment that is not mapped within a certain window around the simulated position (option -w) is considered plausibly mapped if its edit distance is less than the edit distance of the alignment taken from the reference SAM file (option --reference-sam).

Specifies the location of a file that contains the reference genome in multi-fasta format.

Example

To quickly get started, you can use the small example files contained within the repository and the commands given in the example page.

Change Log

v0.8.1 (November 19, 2014)
1. Bugfixes.
2. Improved multithreading performance
v0.8 (July 22, 2014)
1. Added and changed command line options, changed default values of some command line parameters.
2. Bugfixes.
3. Removed some warnings during compilation using intel compilers.
4. Added option to use full dynamic programming.
5. Changed global/local behavior.
6. Added 3 more paired-end alignment methods and improved existing ones.
7. Added more rescue procedures.
8. Improved chaining again.
9. Changed debug information printing.
v0.7.4 (April 23, 2014)
1. Major changes in the Algorithm.
2. Index now includes reference.
3. Incorporation of klib library for I/O and dynamic programming.
4. Added new methods for seed-finding.
5. Improved seed-finding.
6. Improved candidate regions identification and selection.
7. Changed candidate region extension criterium.
8. Greatly improved chaining.
9. Allow multiple chains per candidate region.
10. Major refactoring of code.
v0.7.3 (August 13, 2013)
1. Improved command line printing method.
2. Few bug fixes.
3. Fixed few typos in manual.
v0.7.2 (June 13, 2013)
1. Improved online help.
2. Reordered source code in repository
v0.7.1 (May 23, 2013)
1. Major refactoring of the command line anatomy of ALFALFA.
2. Changed default values of command line options based on benchmark analysis.
3. Removal of deprecated code.
4. Improved implementation of --paired-end mode 4.
v0.7.0 (Apr 19, 2013)
1. Major refactoring of source code.
2. Fixed a some minor issue.
3. Added more verbose output for debugging purposes.
4. Enabled calculation of the edit distance of alignments using reference genome and CIGAR string.
5. Added new option --local to support local alignment.
6. Improved calculation of SMEMs.
v0.6.4 (Mar 04, 2013)
1. Integrated essaMEM features to speed up seed-finding.
2. Introduced a 10-mer array to speed up seed-finding.
3. Fixed some minor issues.
4. Major refactoring of source code and removal of deprecated code.
v0.6.3 (Feb 21, 2013)
1. Major refactoring of source code.
2. Added new option -v/--verbose to turn on lots of progress reporting about the alignment process for debugging purposes.
Older versions Read more
- v0.6.2 (Jan 23, 2013)
  1. Fixed issue introduced in v0.6.1.
  2. Added wgsim subcommand to evaluate command.
- v0.6.1 (Jan 14, 2013)
  1. Automatically remove artifacts that indicate mate origin as suffixes of the name of paired-end reads (/0/1/2).
  2. Fixed issue with evaluate command.
- v0.6.0 (Nov 16, 2012)
  1. Added framework for storing the index of the reference genome to disk.
  2. Fixed issue with evaluate command.
- v0.5.4 (Oct 09, 2012)
  1. Added new command evaluate to check mapping accuracy of simulated reads.
- v0.5.3 (Sep 27, 2012)
  1. Fixed issue with paired-end read mapping.
  2. Fixed a memory leak fixed introduced in v0.5.2.
  3. Reduced memory footprint of enhanced sparse suffix arrays.
- v0.5.2 (Sep 24, 2012)
  1. Fixed issue with paired-end read mapping.
  2. Reduced memory footprint of dynamic programming matrix.
- v0.5.1 (Sep 13, 2012)
  1. Fixed issue with paired-end read mapping.
  2. Fixed issues with reference genomes larger than 2Gbp.
- v0.5.0 (Sep 03, 2012)
  1. Major refactoring of source code.
  2. Fixed many issues.
  3. Added new behavior for single-end read alignment that finds seeds for both strands first, followed by prioritized extension of candidate regions, instead of aligning both strands separately.
  4. Refactored command line options.
  5. Changed default values of command line options based on benchmark analysis.
  6. Added support for paired-end read mapping.
  7. Added command line options for paired-end read mapping.
  8. Added multiple candidate algorithms for paired-end read mapping.
  9. Updated banded dynamic programming with non-symmetrical band.
- v0.3.5 (Jul 03, 2012)
  1. Optimized performance by semi-static allocation of dynamic programming matrices.
  2. Implemented tracing the alignment from a dynamic programming matrix without requiring additional storage.
- v0.3.4 (Jun 29, 2012)
  1. Changed from regular dynamic programming to banded dynamic programming.
  2. Added score matrix for dynamic programming (internal).
- v0.3.3 (Jun 26, 2012)
  1. Introduced enhanced sparse suffix arrays with sparse child array as the index structure underpinning the seed-finding process.
- v0.3.2 (Jun 08, 2012)
  1. Added multithreading support.
  2. Minimal refactoring of source code.
- v0.3.1 (May 31, 2012)
  1. Optimized dynamic programming in case no affine gap penalties are used.
  2. Changed default values of command line options based on benchmark analysis.
- v0.3.0 (May 31, 2012)
  1. Initial release of the read mapper ALFALFA on GitHub.
- v0.2.0 (March 19, 2012)
  1. Allow possibility to return multiple alignments per read. Added new option -a/--alignments to specify the maximum number of alignments reported per read.
  2. Added algorithm to select multiple candidate regions.
  3. Added some heuristics to speed up mapping time.
- v0.1.0 (March 15, 2012)
  1. Started using MEMs and SMEMs as seeds, instead of MAMs.
  2. Added new option -m/--max-seeds to limit number of seeds that will be selected per starting position in the read sequence.
  3. Fixed issue with local alignment.
  4. Fixed issue with mapping quality calculation.
  5. Fixed issue with CIGAR string calculation.
  6. Dynamic programming at boundaries now takes into account current edit distance of alignment in the gaps between seeds.
- v0.0.0 (March 01, 2012)
  1. Initial release of the read mapper ALFALFA.

Links

essaMEM project

Contact

Contact lead developer Michaël Vyverman (michael[dot]vyverman[at]ugent[dot]be) if you have any further questions or suggestions.