The first Pacific Biosciences (PacBio) sequence dataset was published a couple of weeks ago in the New England Journal of Medicine. PacBio, together with a Harvard group, sequenced 5 strains of Vibrio cholerae including two isolates of the strain responsible for the recent cholera outbreak in Haiti. The NEJM main text mentions their short sequencing run times and long read lengths, but has very little technical details regarding the raw and mapped read data. The supplementary paper contains more information and mainly reveals that single pass sequence accuracy is rather low (~81-83%), and that the sequencing process generates spurious G and C insertions and deletions within reads. So I decided to take a closer look at the data. There’s a lot of hype surrounding their sequencing technology, and I was curious to see what the data is really like.
I am only presenting my re-analysis of their N5 sequence data. N5 is the V. cholerae strain N16961 that was first sequenced in 2000. I downloaded the PacBio raw N5 read data from SRA, at ftp://ftp.ncbi.nlm.nih.gov/sra/Submissions/SRA026/SRA026766/provisional/SRX032454/.
First, some statistics on the raw PacBio N5 data. The N5 dataset consists of 50 runs. Except for one run with an unusually low number of reads (13,673), all runs generated between 47,163 and 48,053 reads. Curiously, about half of the runs had exactly 48053 reads; several others has exactly 47927 reads.
I am not sure how to explain this. One possibility is that the PacBio investigators reused the same SMRT cell multiple times, and that each SMRT cell had a given, fixed number of valid ZMW guides. Interestingly, file names for runs that contained similar numbers of reads contained the names or identifiers, e.g. all ‘00114’ files (m101117_004455_00114_c000025512530000000115022402181120_s1_p0.fastq) had 48053 reads, all ‘adrian’ files (m101119_003042_adrian_c000002772559900000100000112311180_s0_p0.fastq) had 47927 reads , etc. This further supports the possibility that there are several run batches among the 50 runs.
The next thing I looked at was average and maximum read length per run. The average “average read length” across all 50 runs was ~850bp, but it doesn’t tell the whole story: some runs had much higher average read length. In particular, 6 runs had average read length > 2,300bp (impressive), and they had the same label too (‘richard’, i.e. m101114_000451_richard_c000022402550000000115009002041172_s1_p0.fastq). There was no correlation between number of reads and average read length per run.
The maximum read length (bp) per run follows the same trend, with most runs reaching around 5kb, and the 6 runs with high average above showing reads up to 15-25kb.
Altogether, the total amount of sequence generated per run ranged from 6.6Mb to 136.3Mb. The combined amount for the 50 runs was 2.0Gb. Of course, keep in mind that this is the raw, unaligned, unfiltered dataset.
I then combined all runs, and turned to analyzing individual sequences. The PacBio fastq files contain quality scores (c) for each nucleotide in each read. I calculated actual quality scores using p=exp[ (ord(c) – 33) / ( -10 * log(10) ) ], then transformed to p’=1-p, and calculated the average p’ for each read. In this way, the higher the average p’, the better overall read quality. I first looked at the distribution of average quality scores:
The distribution is clearly bimodal, possibly trimodal. What’s interesting is that the bulk of the reads (first large mode) have relatively low quality scores. Filtering out reads with p'<0.25 reduces the total number of reads from 2.3M to 329,575. This is not incompatible but somewhat higher than the number of post-filter reads they report in Supplementary Table 1 (252,726), so they must have used a slightly different way to calculate read accuracy. In any case, using the p'<0.25 filter reduces the total amount of sequence generated by all runs from 2.0Gb to 500Mb. You may wonder why reducing the number of reads by 7-fold only reduces the total number of bases by 4-fold. Here's why: there's a very strong positive correlation between read length and average quality score (p'). Yes the correlation is positive (Spearman rho=0.73), that is, long PacBio reads tend to have higher quality than short reads. The trend is readily apparent when I transform the values to ranks and plot ranks vs ranks (and plot a random sample of 10% of total reads):
So it looks like PacBio generates many, potentially spurious short reads but their long reads might be more reliable. Obviously, using the average quality score is not perfect; it does not capture potential read sub-sequences with higher quality scores.
So far, the analysis I presented did not use the reference genome. The most important question remains how many of the reads can be mapped to the reference genome. To answer that question, I downloaded the reference N16961 genome sequence from NCBI, accession numbers AE003853.1 and AE003852.1 (V. cholerae has two chromosomes, whose combined length is 4,033,464bp).
Because of the large read lengths, low sequence accuracy, short-read aligners like Bowtie and BWA cannot be used to align PacBio reads to a reference genome. Tools often used for analysis of 454 reads, such as BLAT, are also not applicable. As far as I know, the only appropriate tool is BLAST (and the NEJM paper mentions using an alignment methodology that seems very similar to the BLAST algorithm).
The default BLAST (I used v2.2.15) parameters are not appropriate either for sequences showing only ~80% similarity with the reference genome. Here I used match reward = 1, mismatch penalty = -1 (default is -3), gap opening penalty = -1, gap extension penalty = -2. These are the lowest penalty/reward values the program would let me set. The reference genome is short (4Mb), and BLAST run times were usually reasonably short (this will become problematic when aligning PacBio reads to the human genome).
If you are not familiar with BLAST, BLAST will try to find regions of reasonably high sequence similarity between reads and reference genome. It does not try to find a match for the entire read (but will find such matches if they occur) and can make different regions of the same read match to different places in the reference genome. For each match, it calculates an E-value, which the expected number of matches with the same or higher score given read and genome sizes, etc. I used an E-value threshold of 0.01 for this analysis; that is, I want my matches to be relatively unlikely to occur by chance.
For each read, I kept the best scoring match (if it had E<0.01), unless it couldn't be reliably positioned in the V. cholerae genome (multi-mapper). I threw out sub-scoring read matches that were not compatible with the best scoring match (because they match to very different locations in the genome). Compatible read matches were those separated from the best match by a SMRTbell sequence, and read matches whose distance on the read matches the distance in the genome (+/-25%).
Here are what the results look like:
– out of the 2.3M reads, 574,572 (24%) had a significant match (E<0.01) to the V. cholerae genome.
– effective read lengths, i.e. lengths of read regions that matched the genome (after removing gaps) were reduced but still impressive: 403bp on average, median = 356bp, 95th percentile = 973bp, and maximum = 3,016bp. Altogether, the total amount of mappable sequence was ~267Mb or 5.3Mb/run on average.
– excluding insertions, only 79.4% of the aligned nucleotides matched the reference genome. In other words, calculated as I did, the PacBio error rate at the nucleotide level is 20.6%. This number is similar to the number presented in Supplementary Table 1 of the NEJM paper (82.9%). The small difference is probably mostly due to their analysis having been done post-filtering, based on smaller number of high quality reads.
– as reported (but with little details), the PacBio sequencing process introduces a lot of indels. Curiously, I found twice more insertions (50/Kb on average) than deletions within reads (21/Kb).
What about the relationship between the existence of a match to the reference genome and average quality scores ? The plot below shows that the majority of reads with low average quality scores do not contain match to the reference genome.
Using p’>0.24-0.25 as a filter would also maximize the chance of the retained read matching the genome, while minimizing the number of non-matching reads. However this would leave out many reads that still match the genome. Based on these results, and if BLAST run times are not an issue, I would probably not recommend filtering out reads using quality scores at this stage.
Several questions remain, regarding the nature of artificially introduced indels, erroneous base calls, nucleotide bias, whether errors are systematic or random, etc but I hope to have addressed some of the questions that many people (and myself) have been asking about PacBio sequencing.
Given these results, how close is PacBio to sequencing a human genome ? not so close. Assuming 5.3Mb/run of useful sequence, it would take >6,400 runs to obtain a 10X coverage of the genome. Not even counting prep time, assuming a 30 min run time, and 10h-long days, it would take 320 days to get the data (prior to the analysis, which might take longer than the sequencing). While 10X coverage should be enough to detect structural variants (duplications, deletions, translocations, etc), it would probably yield few reliable single nucleotide variants and indels, due to the very high sequencing error rates. No doubts that PacBio is working hard to improve error rates and produce more reads per run (but I personally doubt given these results that PacBio can achieve its stated goal of sequencing a human genome in 15 mins by 2013). In the meantime, PacBio will probably mostly be used to sequence bacterial and viral genomes or to quickly sequence DNA fragments obtained using PCR or targeted DNA capture.