Olivier Elemento’s weblog

Olivier’s science weblog

A closer look at the first PacBio sequence dataset January 3, 2011

Filed under: Deep Sequencing,pacbio — oelemento @ 6:17 pm

The first Pacific Biosciences (PacBio) sequence dataset was published a couple of weeks ago in the New England Journal of Medicine. PacBio, together with a Harvard group, sequenced 5 strains of Vibrio cholerae including two isolates of the strain responsible for the recent cholera outbreak in Haiti. The NEJM main text mentions their short sequencing run times and long read lengths, but has very little technical details regarding the raw and mapped read data. The supplementary paper contains more information and mainly reveals that single pass sequence accuracy is rather low (~81-83%), and that the sequencing process generates spurious G and C insertions and deletions within reads. So I decided to take a closer look at the data. There’s a lot of hype surrounding their sequencing technology, and I was curious to see what the data is really like.

I am only presenting my re-analysis of their N5 sequence data. N5 is the V. cholerae strain N16961 that was first sequenced in 2000. I downloaded the PacBio raw N5 read data from SRA, at ftp://ftp.ncbi.nlm.nih.gov/sra/Submissions/SRA026/SRA026766/provisional/SRX032454/.

First, some statistics on the raw PacBio N5 data. The N5 dataset consists of 50 runs. Except for one run with an unusually low number of reads (13,673), all runs generated between 47,163 and 48,053 reads. Curiously, about half of the runs had exactly 48053 reads; several others has exactly 47927 reads.

I am not sure how to explain this. One possibility is that the PacBio investigators reused the same SMRT cell multiple times, and that each SMRT cell had a given, fixed number of valid ZMW guides. Interestingly, file names for runs that contained similar numbers of reads contained the names or identifiers, e.g. all ‘00114’ files (m101117_004455_00114_c000025512530000000115022402181120_s1_p0.fastq) had 48053 reads, all ‘adrian’ files (m101119_003042_adrian_c000002772559900000100000112311180_s0_p0.fastq) had 47927 reads , etc. This further supports the possibility that there are several run batches among the 50 runs.

The next thing I looked at was average and maximum read length per run. The average “average read length” across all 50 runs was ~850bp, but it doesn’t tell the whole story: some runs had much higher average read length. In particular, 6 runs had average read length > 2,300bp (impressive), and they had the same label too (‘richard’, i.e. m101114_000451_richard_c000022402550000000115009002041172_s1_p0.fastq). There was no correlation between number of reads and average read length per run.

The maximum read length (bp) per run follows the same trend, with most runs reaching around 5kb, and the 6 runs with high average above showing reads up to 15-25kb.

Altogether, the total amount of sequence generated per run ranged from 6.6Mb to 136.3Mb. The combined amount for the 50 runs was 2.0Gb. Of course, keep in mind that this is the raw, unaligned, unfiltered dataset.

I then combined all runs, and turned to analyzing individual sequences. The PacBio fastq files contain quality scores (c) for each nucleotide in each read. I calculated actual quality scores using p=exp[ (ord(c) – 33) / ( -10 * log(10) ) ], then transformed to p’=1-p, and calculated the average p’ for each read. In this way, the higher the average p’, the better overall read quality. I first looked at the distribution of average quality scores:

The distribution is clearly bimodal, possibly trimodal. What’s interesting is that the bulk of the reads (first large mode) have relatively low quality scores. Filtering out reads with p'<0.25 reduces the total number of reads from 2.3M to 329,575. This is not incompatible but somewhat higher than the number of post-filter reads they report in Supplementary Table 1 (252,726), so they must have used a slightly different way to calculate read accuracy. In any case, using the p'<0.25 filter reduces the total amount of sequence generated by all runs from 2.0Gb to 500Mb. You may wonder why reducing the number of reads by 7-fold only reduces the total number of bases by 4-fold. Here's why: there's a very strong positive correlation between read length and average quality score (p'). Yes the correlation is positive (Spearman rho=0.73), that is, long PacBio reads tend to have higher quality than short reads. The trend is readily apparent when I transform the values to ranks and plot ranks vs ranks (and plot a random sample of 10% of total reads):

So it looks like PacBio generates many, potentially spurious short reads but their long reads might be more reliable. Obviously, using the average quality score is not perfect; it does not capture potential read sub-sequences with higher quality scores.

So far, the analysis I presented did not use the reference genome. The most important question remains how many of the reads can be mapped to the reference genome. To answer that question, I downloaded the reference N16961 genome sequence from NCBI, accession numbers AE003853.1 and AE003852.1 (V. cholerae has two chromosomes, whose combined length is 4,033,464bp).

Because of the large read lengths, low sequence accuracy, short-read aligners like Bowtie and BWA cannot be used to align PacBio reads to a reference genome. Tools often used for analysis of 454 reads, such as BLAT, are also not applicable. As far as I know, the only appropriate tool is BLAST (and the NEJM paper mentions using an alignment methodology that seems very similar to the BLAST algorithm).

The default BLAST (I used v2.2.15) parameters are not appropriate either for sequences showing only ~80% similarity with the reference genome. Here I used match reward = 1, mismatch penalty = -1 (default is -3), gap opening penalty = -1, gap extension penalty = -2. These are the lowest penalty/reward values the program would let me set. The reference genome is short (4Mb), and BLAST run times were usually reasonably short (this will become problematic when aligning PacBio reads to the human genome).

If you are not familiar with BLAST, BLAST will try to find regions of reasonably high sequence similarity between reads and reference genome. It does not try to find a match for the entire read (but will find such matches if they occur) and can make different regions of the same read match to different places in the reference genome. For each match, it calculates an E-value, which the expected number of matches with the same or higher score given read and genome sizes, etc. I used an E-value threshold of 0.01 for this analysis; that is, I want my matches to be relatively unlikely to occur by chance.

For each read, I kept the best scoring match (if it had E<0.01), unless it couldn't be reliably positioned in the V. cholerae genome (multi-mapper). I threw out sub-scoring read matches that were not compatible with the best scoring match (because they match to very different locations in the genome). Compatible read matches were those separated from the best match by a SMRTbell sequence, and read matches whose distance on the read matches the distance in the genome (+/-25%).

Here are what the results look like:
– out of the 2.3M reads, 574,572 (24%) had a significant match (E<0.01) to the V. cholerae genome.
– effective read lengths, i.e. lengths of read regions that matched the genome (after removing gaps) were reduced but still impressive: 403bp on average, median = 356bp, 95th percentile = 973bp, and maximum = 3,016bp. Altogether, the total amount of mappable sequence was ~267Mb or 5.3Mb/run on average.
– excluding insertions, only 79.4% of the aligned nucleotides matched the reference genome. In other words, calculated as I did, the PacBio error rate at the nucleotide level is 20.6%. This number is similar to the number presented in Supplementary Table 1 of the NEJM paper (82.9%). The small difference is probably mostly due to their analysis having been done post-filtering, based on smaller number of high quality reads.
– as reported (but with little details), the PacBio sequencing process introduces a lot of indels. Curiously, I found twice more insertions (50/Kb on average) than deletions within reads (21/Kb).

What about the relationship between the existence of a match to the reference genome and average quality scores ? The plot below shows that the majority of reads with low average quality scores do not contain match to the reference genome.

Using p’>0.24-0.25 as a filter would also maximize the chance of the retained read matching the genome, while minimizing the number of non-matching reads. However this would leave out many reads that still match the genome. Based on these results, and if BLAST run times are not an issue, I would probably not recommend filtering out reads using quality scores at this stage.

Several questions remain, regarding the nature of artificially introduced indels, erroneous base calls, nucleotide bias, whether errors are systematic or random, etc but I hope to have addressed some of the questions that many people (and myself) have been asking about PacBio sequencing.

Given these results, how close is PacBio to sequencing a human genome ? not so close. Assuming 5.3Mb/run of useful sequence, it would take >6,400 runs to obtain a 10X coverage of the genome. Not even counting prep time, assuming a 30 min run time, and 10h-long days, it would take 320 days to get the data (prior to the analysis, which might take longer than the sequencing). While 10X coverage should be enough to detect structural variants (duplications, deletions, translocations, etc), it would probably yield few reliable single nucleotide variants and indels, due to the very high sequencing error rates. No doubts that PacBio is working hard to improve error rates and produce more reads per run (but I personally doubt given these results that PacBio can achieve its stated goal of sequencing a human genome in 15 mins by 2013). In the meantime, PacBio will probably mostly be used to sequence bacterial and viral genomes or to quickly sequence DNA fragments obtained using PCR or targeted DNA capture.


17 Responses to “A closer look at the first PacBio sequence dataset”

  1. Thanks for posting all this analysis!!!!

    Did you try BWA-SW on these? That would be a possible fast alternative to BLAST. What went wrong when you tried BLAT? It was used extensively for aligning long cDNAs to human genomes. I don’t know if it has a mode to handle circular reference genomes.

    I believe if you join PacBio’s developer network you can get access to their aligner.

    • oelemento Says:

      Hi Keith
      I haven’t tried either BWA-SW or BLAT on these reads; I have used these two programs before to align 454 reads (and settled with BWA-SW for these), but my understanding is that they are geared toward sequences that have >95% identity with the reference genome. This being said, it might just be a matter of choosing the alignment parameters appropriately. I might try them out at some point.

  2. One other thought — you could try taking some of the long, high-quality reads that fail to align to their target genome & BLAST them against the non-redundant nucleotide database. One possibility is that they are a contaminant. Another, of course, is that they are actually the result of an over-confident base caller.

    Looking to see if the high-quality non-matching sequences have long homopolymer runs or overrepresented motifs might be interesting.

    • oelemento Says:

      very good ideas and suggestions, I’ll take a look. I saw a few long and unmappable reads that contained many copies of their 45bp SMRTbell sequences (possibly concatemers), but I still need to quantify how frequent these reads are in their dataset.

  3. Lionel Guy Says:

    Thanks for posting this first (exciting) analysis!
    I was wondering about your p’ quality values. I assume that the formula you used for p returns the probability of errors per base, is that correct? Then the average p’ would be the average probability of not having an error, am I still right?
    Another thought… since there is a correlation between the read length and the quality on one hand, and between the run and and the read length on the other side, could it be that the true correlation is between the run and the quality? In other words, that some runs just generate better and longer reads? And would there be any correlation between that and the SMRT cell used? Just wondering…

  4. aguy Says:

    Great analysis, really good to see public analysis of this new technology. I am puzzled by the per nucleotide error rate of ~20% because of previous PacBio claims that circularization of the template via SMRTbell ligation enables consensus reads from multiple passes on a single molecule (Travers et al NAR 2010 38(15) e159). If the nucleotide error is random then 2-3 reads of the same molecule should lead to a dramatic reduction in the error rate. I found no mention of the use of consensus reads in the NEJM article; Table 1 implies “single-pass accuracy”. Comments?

  5. Look in the supplemental methods at their library prep approach — they used large fragment sizes. The circular consensus method requires much shorter fragments. It is a bit curious they didn’t use it, but perhaps they wanted to show off their long reads.

    Nor did they choose to show off the “strobe sequencing” approach to get long-range sequence information. Again, a bit curious — though perhaps it was seen as nice-to-have and would have required another library prep (and perhaps the informatics aren’t as well worked out)

  6. Guidobot Says:

    Thank you for such an excellent posting.

    I find it interesting that the 45bp SMRTbell sequence is (repeatedly) inserted into the long reads. Are the mutation/INDEL read error rates in these sequences comparable with those for sequence mapping to the genome? Could you say if any single (long) reads appeared to be from concatonated DNA fragments?

  7. If the SMRTbell sequence is followed by the reverse complement of whatproceeded it, this is the “circular consensus” mode – they are not concatamers but reading the same insert multiple times.

  8. Anuj Tyagi Says:

    Very nice and informative article…. Thanks

  9. skk Says:

    Following your work to get more understanding of this sequencing biz. – I come from the math angle not the bio angle to this. I’m trying to replicate your work – somewhat stuck at the point where you mentioned that you used BLAST to do the alignment of PacBio data to the reference genome. I guess that’s blastn right of course, and I’ve got hold of the reference cholera strain data, then used formatdb to make a db out of the reference genome ( 2 chromosome to a total of 3.9Mb ). So far so good ( I think ) but reading the specs for blast, it takes fastA files, the PacBio data is 50 zipped fastQ files right ?

    So I guess you decompressed, concatenated all 50 files, converted fastQ to fastA and then fed it to blastn ?

    What am I missing ?

    If you could post the blastn command line parameters it would help a lot in understanding and replicating this.


  10. Chris Bradburne Says:

    Thanks for posting this analysis! Very informative, particularly given PacBio’s promises. It sounds like BLAST parameters would need to be adjusted on the NCBI side, to allow lower quality matching, or maybe increased Gapped-BLAST parameters. Seems like it would still be hard with a 20$ error rate to accurately find non-random sequence homologies.

  11. Vijai Joseph Says:

    Very informative post.
    Thank you

  12. […] is going on here? Are these quality scores really that low for these reads? An earlier report on the Haiti cholera PacBio dataset (with much shorter reads) showed how the shortest reads had […]

  13. […] v) A closer look at Pac Bio sequence dataset. […]

  14. […] v) A closer look at Pac Bio sequence dataset. […]

  15. redaktroll Says:

    Could you send me a pipeline or some info about counting indels number in PB data, or maybe even how to get a summary of indels distribution. I tried to use samtools, but it crashed on long reads, and galaxy accepts only 1 indel in line.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s