Olivier Elemento’s weblog

Olivier’s science weblog

ICSB2005 talk videos are available online January 18, 2006

Filed under: Uncategorized — oelemento @ 6:57 am

http://csbi.mit.edu/icsb-2005/program/program.htm

Note to other conference organizers: that’s the way it should be; please you too put the talks online after the conference. Think of it as Open Access for conference materials.

 

NY Times article on synthetic biology January 18, 2006

Filed under: Uncategorized — oelemento @ 6:51 am

The article is rather low on details, but it is always interesting to see the NY Times picking up this kind of topic. Here are some ideas in the article that I like:

- Drew Endy and colleagues have created BioBricks, the “Registry of Standard Biological Parts” (http://parts.mit.edu/). The article also mentions ideas for a programming language/environment for creating syhthetic biology circuits, where a programmer would be able to design and simulate the behaviour of an artificial circuit, before sending it to a “printer” to actually synthesize it. The whole idea is obviously pretty cool: regular biologists, not only synthetic biologists, would benefit from such software. Note that softwares for programming and designing (and simulating to some extent) biological networks already exist ( e.g. Little b, http://www.littleb.org/, a LISP-based language, and the graphical environment CellDesigner http://celldesigner.org/).

- Jay Keasling is using genetically modified E. coli to produce artemisinin, a pretty effective Malaria drug. To do so, he is trying to have E. coli express 12 genes from the wormwood tree and yeast (what yeast genes have to do here is unclear; maybe the wormwood equivalent have introns or are toxic to E. coli). That’s pretty exciting, but it seems to me that a significant hurdle in synthetic biology will be the possibilities of unforeseen crosstalks between the foreign genes and the cell core machinery.

- Codon Devices works on technologies to synthesize long stretches of DNA (George Church actually presented talked about that technology last year at Princeton). Anybody who has tried to make a relatively-complicated construct will understand that dramatically improved DNA synthesis will lead to a true revolution in molecular biology.

Finally, the article also mentions that “Christina D. Smolke [..] is trying to develop circuits of biological parts to sit in the body’s cells and guards against cancer”. This research may sound great to non-scientists, however it is tempting to suspect that this kind of ‘nanobots roaming the human body, in a search-and-destroy mission against cancer cells’ will not go anywhere anytime soon. Nonetheless I like some of her previous less ambitious research on differential expression of bacterial operon members using RNAse E site and 3′UTR/5′UTR stabilizing hairpins.

The article is available at:

http://www.nytimes.com/2006/01/17/science/17synt.html?pagewanted=all

 

How to efficiently download a dataset from GEO ? January 17, 2006

Filed under: Uncategorized — oelemento @ 3:19 am

Find the GSE number corresponding to the dataset you wish to download (from the paper, or by browsing the site), and go to the corresponding GSE page:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3069

Select Scope: samples, Format: SOFT, Amount: Full, click “GO” (download starts, saving a default GSE3069.txt file)

run the following Perl script on the downloaded file (perl GEO_to_txt.pl < GSE3069.txt)

while (my $l = <STDIN>) {
if ($l =~ /^!Sample_title = (.+?)$/) {
my $file = $1;
$file =~ s/,'\"//g;
$file =~ s/ /_/g;
open IN, ">$file";
}
if ($l !~ /^[!^#]/) {
print IN $l;
}
}
close IN;
 

New paper in PLoS Computational Biology January 14, 2006

Filed under: Uncategorized — oelemento @ 7:42 am

Our latest paper, Revealing Posttranscriptional Regulatory Elements Through Network-Level Conservation, is available online.

The paper is an interesting followup on another paper we published earlier this year in Genome Biology: Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach.

 

BLAST tutorial January 7, 2006

Filed under: Uncategorized — oelemento @ 11:40 pm

BLAST is certainly one of the most useful tools in modern biology. Since I am often asked how to use BLAST, I figured I would write a short tutorial. Running a BLAST search takes 3 steps: (1) formatting the database, (2) running the blast, (3) analyzing the results.

Step (1):

formatdb -i database.seq -p F -o T

-i specifies the database file, with all your sequences in FASTA format. -p specifies the type of sequences: nt (-p F) or aa (-p T). -o specifies whether you want to create an index on the sequence names / identifiers. This is useful if you want later to retrieve some sequences from the database. What does this program do exactly ? it creates an index from your sequences, so that they can be accessed and compared very efficiently at the next step.

Step (2):

Assuming that both query and database sequences are nucleotide sequences, the minimum command line for step (2) is the following:

blastall -p blastn -i query.seq -d database.seq

-p defines the program used to do the sequence comparison: -p blastn does a nt/nt comparison. -p tblastn does protein/nucleotide, translating the nt database sequences into all possible frames.
blastall has a large number of other options, the ones which I use most often are :
-e : defines a e-value threshold, e.g. -e 1e-10
-W : BLAST uses seeds, i.e. short regions of similarity that it extends into HSP (High Scoring Pairs). The default is -W 11 for blastn, -W 3 for protein blast.
-G : gap opening penalty
-E: gap extension penalty
-q : penalty for nucleotide mismatch
-r : reward for nt match
-M : allows you to modify match/mismatch rewards/penalties for protein BLAST by changing the Matrix (default is BLOSUM62, do a `locate BLOSUM62` to locate the file that contains it in Linux`).
-a : specifies the number of processors to use. This option is quite useful on modern multi-processors computers.

also, it is possible to get a XML output using:
-m 7 this is useful if you want to parse the results, from a Perl script for example (the XML::DOM Perl module is great for that).

Step (3):

The output of blastall consists of one or several hits, and each hit consists of one or several HSPs (High-Scoring-Pair). A HSP is an alignement between a fragment of the query sequence and a fragment of a sequence (the hit) in the database. Each HSP has a score and a e-value. The score is the sum of penalties/rewards for each position in the HSP. The e-value is the expected numbers of HSPs with the same length and same or higher score as the one returned. What e-value threshold should you use ? it depends on your particular problem. If you are doing thousands of BLAST search and trying to reach a statistical conclusion, use a stringent threshold (e less than 1e-10) . If you are looking for something more specific, possibly a short region of similarity, you can go up to e less than 0.1.
BLAST is often used to find the orthologs of a given gene. Here is the way most people do it. Use protein sequences if possible. Assume your reference genome contains X genes and your other genome contains Y genes. Use the aa sequence of the gene as query, and BLAST it against all other Y protein ortholog candidate, with e less than 1e-10. Retain only the best hit. Optionally, you can make sure that the length of the matching region is at least 2/3 of the length of the largest sequence between query and hit. Then take the entire hit sequence and BLAST it back against the X proteins of the reference genome. If the best hit is the original query sequence, you have an ortholog !

That’s all for now. Possible follow-ups for this article: genome annotation using BLAST, details of the algorithm, whys and hows of repeat masking, PSI-BLAST, BLAT and many others.

 

Setting up a RAMDISK for Linux January 5, 2006

Filed under: Uncategorized — oelemento @ 11:39 pm

To improve the speed performances of one of my programs, I tried to use a RAMDISK to store a relatively big file. It turns out it is pretty simple.

- login as root.
- mkdir /tmp/ramdisk0
- /sbin/mke2fs /dev/ram0

here is the output of this command:

mke2fs outputs this
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
2048 inodes, 8192 blocks
409 blocks (4.99%) reserved for the super user
First data block=1
1 block group
8192 blocks per group, 8192 fragments per group
2048 inodes per group

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.

then:
- mount /dev/ram0 /tmp/ramdisk0

That’s it. You can now copy files to /tmp/ramdisk0 (bear in mind that they will be erased when you reboot). The only major problem is that, as far as I know, you cannot change the size of RAMDISK without modifying some configuration files and rebooting. The default size is 8192 blocks, with 1 block = 1kb.

To change it, you have to modify the following line in /etc/grub.conf

kernel /vmlinuz-2.4.21-4.ELsmp ro root=LABEL=/ hdc=ide-scsi apic
to
kernel /vmlinuz-2.4.21-4.ELsmp ro root=LABEL=/ hdc=ide-scsi apic ramdisk_size=16000

then reboot ..

 

Article Review: Disease gene discovery through integrative genomics (VK Mootha and colleagues, Annu Rev Genomics Hum Genet, 2005) January 5, 2006

Filed under: Uncategorized — oelemento @ 8:16 pm
This review article describes several examples of studies in which the gene underlying a certain genetic disease was discovered based on the combination of classical genetics experiments and mining of publicly available genomics datasets. The discovery process is simple. Using a family in which many individuals carry the genetic determinants of the studied disease, geneticist can sometimes map the disease locus to a restricted but often large chromosomal region (e.g. 2Mbp), which generally contain dozens of distinct genes. However, scientist can often guess the function of the disease gene from clinical symptoms and/or clinical tests. To infer which gene(s) among the dozens within the mapped chromosomal region are likely to have that particular function, they use one or several of the following strategies. In some cases, a large-scale genomics experiment has identified a large number of genes involved in the function of interest (e.g. protemics analysis of the mitochondrion). In other cases, they evaluate the degree of relatedness of each of the genes of interest with other genes of the same function in the same genome, using publicly available genomics datasets. For example, they search for genes in their restricted subset that have a similar microarray expression profile (across carefully selected experiments) to genes known to have the same function in the genome.
In other cases, and I have to say quite amazingly, they search for genes witihin their subset that have similar phylogenetic profiles (across the dozen fully sequenced eukaryotic genomes) to genes of the same function.Once they have identified a small number of strong candidate genes, they sequence them from the genomes of individuals in the family, in search for mutations in the coding sequence (frameshift mutations, premature stop codon, mutated splice site, etc). Note that in some cases, not presented in the paper, the mutation lie in the regulatory region (see recent paper by Arnie Levine in Cell).In conclusion, the mining of genomics datasets appears to have been quite useful, in at least a few cases, in helping geneticists to home in the gene underlying a genetic disease. Obviously, the diseases presented here are essentially monogenic. Mapping disease loci for polygenic diseases appears to be a difficult task indeed.