### Overview BWA-PSSM is a probabilistic short read aligner based on the use of position specific scoring matrices (PSSM). Like many of the existing aligners it is fast and sensitive. Unlike most other aligners, however, it also adaptible in the sense that one can direct the alignment based on known biases within the data set. It is coded as a modification of the original [BWA alignment program](http://bio-bwa.sourceforge.net/) and shares the genome index structure as well as many of the command line options. Shown below are common and uncommon usage scenarios. ### Download Get the latest version from the [github repository](https://github.com/pkerpedjiev/bwa-pssm). Note that [version 1.6 of the GDSL library](http://home.gna.org/gdsl) needs to be installed in order for BWA-PSSM to compile. ### Usage PSSM-BWA can be used just like the original BWA aligner. In addition to aligning fastq formatted reads, it has the option of reading raw position specific scoring matrices and performing an alignment by scoring with them. ##### Extra options: **-z [float:f]** This options indicates that the pssm threshold will be the maximum possible score - f * maximum-mismatch-penalty, where maximum-mismatch-penalty is equal to the difference between the match and mismatch score of the highest quality base found in the read. In other words, we allow the equivalent of *f* high quality mismatches in the alignment. **-y [float:f]** Similar to -z, except it uses bwa's default mismatch criteria for creating the mismatch allowance. Suppose bwa's method determines that a read of length l should be allowed n mismatches. Then using -y for such a read will be equivalent to using the option -z (n - f). This can be useful in files containing reads of variable length. ### Examples #### Building an index Exactly like the original bwa aligner. bwa index -a div -p indeces/bwa/short19 /fa/short19.fa The '-a div' option is actually only necessary when trying to create an index for very large genomes (> 4GB). Omitting it on smaller genomes will have no adverse effect on the creation of the index. #### Aligning a fastq file using pssm-based quality aware alignment Much like the original BWA, except instead of the 'aln' command, the 'pssm' command is used: bwa pssm indexes/bwa/short19 test.fastq > out.sai #### More thorough alignment In cases when time is not a limitation, one may be inclined to perform a more thorough alignment. This simply means that more suboptimal partial alignments are tried before the search gives up. The degree to which less likely alignments are tried can be adjusted using the '-m' option. In actuality this sets the size of the interval heap which keeps track of the highest scoring partial alignments currently being considered for further scoring. To do a slower, more thorough search, one would increase the size of this heap. The default value is 400. bwa pssm -m 800 indexes/bwa/short19 test.fastq > out.sai #### Converting the alignment coordinates to a SAM file </h5> Again, just like the original: bwa samse indexes/bwa/short19 test.sai test.fastq > out.sam #### Alignment and conversion to SAM format in one step: This is useful if trying to avoid the use of an intermediate file in the alignment process. bwa pssm indexes/bwa/short19 test.fastq | ./bwa samse \ indexes/bwa/short19 - test.fastq > out.sam #### Converting a FASTQ file to a PSSM manually An example for converting a fastq file to a PSSM is provided in the fastq2wm.pl file. The execution of this file also introduces an error model (as described in the paper below) for 5' C to T damage and 3' G to A damage. To try this conversion script, run the following command: cat damaged.fastq | ./fastq2wm33.pl > damaged.pssm Once this is done, the resulting pssm file can be used to perform an alignment in the regular manner: bwa pssm indexes/bwa/short19 damaged.pssm > out.sai #### Error Models ##### Using an Error Model An error model is a way of specifying a conversion from a fastq file to a PSSM. For each quality value and nucleotide type, a set of PSSM values can be specified. The example below illustrates the concept with an error model for reads that originated from a PAR-CLIP experiment. 39 A 2.00 -12.77 -12.54 -12.34 39 C -12.54 1.77 -12.54 -0.74 39 G -12.54 -12.77 2.00 -12.34 39 T -12.54 -12.77 -12.54 2.00 40 A 2.00 -13.10 -12.87 -12.67 40 C -12.87 1.77 -12.87 -0.74 40 G -12.87 -13.10 2.00 -12.67 40 T -12.87 -13.10 -12.87 2.00 The leftmost two columns indicate the quality score and called base for which this PSSM column applies. The bottommost line, '40 T -12.87', therefore, will be used when a base 'T' with the highest possible quality score is found. So if we take an imaginary read like this: @READ_NAME_XXX ACT +chr3_+_60998148 IIH Then the first base will be replaced by the line '40 A ...', the second with '40 C ...' and the final with the line '39 T ...', yielding the following PSSM: 1 2 3 A 2.00 -12.87 -12.54 C -13.10 1.77 -12.77 G -12.87 -12.87 -12.54 T -12.67 -0.74 2.00 What is evident in this table is that the score for finding a T where the read contained a C is much greater than typically allowed for such a mismatch. This is due to the prevalence of T to C mutations in PAR-CLIP experiments. Thus, when the read contains a C, the bias in the experimental procedure indicates that it is likely that the original genome contained a T. This is not the case when the called base is, for example, 'A' and no experimentally derived bias is observed. To run bwa-pssm with an error model, simply specify the file containing it with the '-G' option: bwa pssm -G error_model.txt indexes/bwa/short19 test.fastq > out.sai ##### Creating an Error Model Creating an error model naturally depends on the types errors that one might expect. Here we will document how the error model for the PAR-CLIP data was created and hope that it serves as an example for other types of data. Reads from PAR-CLIP sequences reflect the tendency to observe the T-to-C transitions caused by the cross-linking experiment. We can denote this probability with the value \\(\gamma\\) and calculate the probability of seeing a particular base in the genome, given the identity of the base in the sample (\\(P(g|a)\\)). The genome bases are the the columns and the sample bases are the rows. | | \\(A_a\\) | \\(C_a\\) | \\(G_a\\) | \\(T_a\\) | |---|---|---|---|---| | \\(A_g\\) | 1 | 0 | 0 | 0 | | \\(C_g\\)| 0 | 1 - \\(\gamma\\) | 0 | 0 | | \\(G_g\\) | 0 | 0 | 1 | 0 | | \\(T_g\\) | 0 | \\(\gamma\\) | 0 | 1 | At the same time, we can create a table describing the probabilities of a seeing a base in the sample given a base in the read (\\(P(a|x)\\)). These probabilities are encoded in the quality scores and describe the probability that a base is correctly called \\(p_c\\) and the probability that a base is incorrectly called \\(p_i=(1 - p_c) / 3\\). This leads to the following probability matrix: | | \\(A_x\\) | \\(C_x\\) | \\(G_x\\) | \\(T_x\\) | |---|---|---|---|---| | \\(A_a\\) | \\(p_c\\) | \\(p_i\\) | \\(p_i\\) | \\(p_i\\) | | \\(C_a\\) | \\(p_i\\) | \\(p_c\\) | \\(p_i\\) | \\(p_i\\) | | \\(G_a\\) | \\(p_i\\) | \\(p_i\\) | \\(p_c\\) | \\(p_i\\) | | \\(T_a\\) | \\(p_i\\) | \\(p_i\\) | \\(p_i\\) | \\(p_c\\) | To get the probability for a base in the genome, given a base in the read (\\(P(g|x)\\)) we can use Equation 3 from the [paper](http://www.biomedcentral.com/1471-2105/15/100). $$P(g|x)=\sum_a{P(g|a)P(a|x)}$$ Multiplying the two probability matrices gives precisely this result, the transpose of which is encoded as the rows of the error model. | | \\(A_x\\) | \\(C_x\\) | \\(G_x\\) | \\(T_x\\) | |---|---|---|---|---| | \\(A_g\\) | \\(p_c\\) | \\(p_i\\) | \\(p_i\\) | \\(p_i\\) | | \\(C_g\\) | \\((1 - \gamma) * p_i\\) | \\((1 - \gamma) * p_c\\) | \\((1 - \gamma) * p_i\\) | \\((1- \gamma) * p_i\\) | | \\(G_g\\) | \\(p_i\\) | \\(p_i\\) | \\(p_c\\) | \\(p_i\\) | | \\(T_g\\) | \\((1 + \gamma) * p_i\\) | \\(\gamma * p_c + p_i\\) | \\((1 + \gamma) * p_i\\) | \\(p_i * \gamma + p_c\\) | A simple script for calculating an error model given a matrix of mutation probabilities is provided in the `scripts` directory of the [github repository](https://github.com/pkerpedjiev/bwa-pssm). The script takes as input an ascii formatted 4x4 matrix of mutation probabilities (PAR-CLIP example below): 1 0 0 0 0 0.8 0 0 0 0 1 0 0 0.2 0 1 And calculates the error model using the following command: python scripts/error_model.py scripts/example_inputs.csv #### PSSM as input One can also pass in the a raw PSSM as input to BWA-PSSM. The expected format is similar to the fastq format except in place of the sequence, the pssm is entered. Taking the PSSM generated in the previous example, the input would simply be: @READ_NAME_XXX ACT +chr3_+_60998148 2.00 -12.87 -12.54 -13.10 1.77 -12.77 -12.87 -12.87 -12.54 -12.67 -0.74 2.00 The sequence present in the second line of the prior example is completely irrelevant since all of the alignment information is derived from the PSSM itself. A sequence which is used for determining the number of mismatches (i.e. for limiting the seed search) can be inferred from the maximum scores of the PSSM. Alignment is done exactly as with a normal fastq file: bwa pssm indexes/bwa/short19 test.pssm > out.sai #### Read Matching BWA-PSSM provides some extra options on top of the original BWA to provide for finer control over how the PSSM is scored over the genome. By default, BWA-PSSM takes the number of mismatches, m, that would be allowed by BWA and calculates the greatest total match and the least total mismatch score for a sequence of length m. The difference between the two is subtracted from the greatest possible PSSM score for the query sequence to yield a threshold for that sequence. Any regions of the genome that score greater than this threshold are considered hits and any below that threshold are discarded. As an example, assume we have a query of length 20 for which the greatest match score is 100. If we allow three mismatches, then the greatest match score for a sequence of length 3 could be 15 and the lowest mismatch score could be -30. This implies that any regions of the genome which score greater than 55 (100 - (15 - 30)) would be considered hits. As the criteria for a match is now dependent on the PSSM score, the number of actual mismatches is fixed at 30. If one wants to maintain some fixed limit on the number of mismatches and have a tighter limit on the PSSM score, then the `-z [float]` option can be used. This option tells BWA-PSSM to use a threshold equivalent to [float] high quality mismatches, while retaining the number of mismatches that BWA allows (which can be changed using the `-n` option). ### How to Cite If you find this software useful in your research, please cite the following article: Kerpedjiev, P., Frellsen, J., Lindgreen, S., & Krogh, A. (2014). [Adaptable probabilistic mapping of short reads using position specific scoring matrices](http://dx.doi.org/10.1186/1471-2105-15-100). *BMC bioinformatics, 15*(1), 100. ### Contact If you have any questions or issues, please don't hesitate to contact us at <bwa-pssm@binf.ku.dk>.