http://csbi.mit.edu/icsb-2005/program/program.htm
Note to other conference organizers: that’s the way it should be; please you too put the talks online after the conference. Think of it as Open Access for conference materials.
http://csbi.mit.edu/icsb-2005/program/program.htm
Note to other conference organizers: that’s the way it should be; please you too put the talks online after the conference. Think of it as Open Access for conference materials.
The article is rather low on details, but it is always interesting to see the NY Times picking up this kind of topic. Here are some ideas in the article that I like:
- Drew Endy and colleagues have created BioBricks, the “Registry of Standard Biological Parts” (http://parts.mit.edu/). The article also mentions ideas for a programming language/environment for creating syhthetic biology circuits, where a programmer would be able to design and simulate the behaviour of an artificial circuit, before sending it to a “printer” to actually synthesize it. The whole idea is obviously pretty cool: regular biologists, not only synthetic biologists, would benefit from such software. Note that softwares for programming and designing (and simulating to some extent) biological networks already exist ( e.g. Little b, http://www.littleb.org/, a LISP-based language, and the graphical environment CellDesigner http://celldesigner.org/).
- Jay Keasling is using genetically modified E. coli to produce artemisinin, a pretty effective Malaria drug. To do so, he is trying to have E. coli express 12 genes from the wormwood tree and yeast (what yeast genes have to do here is unclear; maybe the wormwood equivalent have introns or are toxic to E. coli). That’s pretty exciting, but it seems to me that a significant hurdle in synthetic biology will be the possibilities of unforeseen crosstalks between the foreign genes and the cell core machinery.
- Codon Devices works on technologies to synthesize long stretches of DNA (George Church actually presented talked about that technology last year at Princeton). Anybody who has tried to make a relatively-complicated construct will understand that dramatically improved DNA synthesis will lead to a true revolution in molecular biology.
Finally, the article also mentions that “Christina D. Smolke [..] is trying to develop circuits of biological parts to sit in the body’s cells and guards against cancer”. This research may sound great to non-scientists, however it is tempting to suspect that this kind of ‘nanobots roaming the human body, in a search-and-destroy mission against cancer cells’ will not go anywhere anytime soon. Nonetheless I like some of her previous less ambitious research on differential expression of bacterial operon members using RNAse E site and 3′UTR/5′UTR stabilizing hairpins.
The article is available at:
http://www.nytimes.com/2006/01/17/science/17synt.html?pagewanted=all
Find the GSE number corresponding to the dataset you wish to download (from the paper, or by browsing the site), and go to the corresponding GSE page:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3069
Select Scope: samples, Format: SOFT, Amount: Full, click “GO” (download starts, saving a default GSE3069.txt file)
run the following Perl script on the downloaded file (perl GEO_to_txt.pl < GSE3069.txt)
while (my $l = <STDIN>) {
if ($l =~ /^!Sample_title = (.+?)$/) {
my $file = $1;
$file =~ s/,'\"//g;
$file =~ s/ /_/g;
open IN, ">$file";
}
if ($l !~ /^[!^#]/) {
print IN $l;
}
}
close IN;
Our latest paper, Revealing Posttranscriptional Regulatory Elements Through Network-Level Conservation, is available online.
The paper is an interesting followup on another paper we published earlier this year in Genome Biology: Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach.
BLAST is certainly one of the most useful tools in modern biology. Since I am often asked how to use BLAST, I figured I would write a short tutorial. Running a BLAST search takes 3 steps: (1) formatting the database, (2) running the blast, (3) analyzing the results.
Step (1):
formatdb -i database.seq -p F -o T
-i specifies the database file, with all your sequences in FASTA format. -p specifies the type of sequences: nt (-p F) or aa (-p T). -o specifies whether you want to create an index on the sequence names / identifiers. This is useful if you want later to retrieve some sequences from the database. What does this program do exactly ? it creates an index from your sequences, so that they can be accessed and compared very efficiently at the next step.
Step (2):
Assuming that both query and database sequences are nucleotide sequences, the minimum command line for step (2) is the following:
blastall -p blastn -i query.seq -d database.seq
-p defines the program used to do the sequence comparison: -p blastn does a nt/nt comparison. -p tblastn does protein/nucleotide, translating the nt database sequences into all possible frames.
blastall has a large number of other options, the ones which I use most often are :
-e : defines a e-value threshold, e.g. -e 1e-10
-W : BLAST uses seeds, i.e. short regions of similarity that it extends into HSP (High Scoring Pairs). The default is -W 11 for blastn, -W 3 for protein blast.
-G : gap opening penalty
-E: gap extension penalty
-q : penalty for nucleotide mismatch
-r : reward for nt match
-M : allows you to modify match/mismatch rewards/penalties for protein BLAST by changing the Matrix (default is BLOSUM62, do a `locate BLOSUM62` to locate the file that contains it in Linux`).
-a : specifies the number of processors to use. This option is quite useful on modern multi-processors computers.
also, it is possible to get a XML output using:
-m 7 this is useful if you want to parse the results, from a Perl script for example (the XML::DOM Perl module is great for that).
Step (3):
The output of blastall consists of one or several hits, and each hit consists of one or several HSPs (High-Scoring-Pair). A HSP is an alignement between a fragment of the query sequence and a fragment of a sequence (the hit) in the database. Each HSP has a score and a e-value. The score is the sum of penalties/rewards for each position in the HSP. The e-value is the expected numbers of HSPs with the same length and same or higher score as the one returned. What e-value threshold should you use ? it depends on your particular problem. If you are doing thousands of BLAST search and trying to reach a statistical conclusion, use a stringent threshold (e less than 1e-10) . If you are looking for something more specific, possibly a short region of similarity, you can go up to e less than 0.1.
BLAST is often used to find the orthologs of a given gene. Here is the way most people do it. Use protein sequences if possible. Assume your reference genome contains X genes and your other genome contains Y genes. Use the aa sequence of the gene as query, and BLAST it against all other Y protein ortholog candidate, with e less than 1e-10. Retain only the best hit. Optionally, you can make sure that the length of the matching region is at least 2/3 of the length of the largest sequence between query and hit. Then take the entire hit sequence and BLAST it back against the X proteins of the reference genome. If the best hit is the original query sequence, you have an ortholog !
That’s all for now. Possible follow-ups for this article: genome annotation using BLAST, details of the algorithm, whys and hows of repeat masking, PSI-BLAST, BLAT and many others.
To improve the speed performances of one of my programs, I tried to use a RAMDISK to store a relatively big file. It turns out it is pretty simple.
- login as root.
- mkdir /tmp/ramdisk0
- /sbin/mke2fs /dev/ram0
here is the output of this command:
mke2fs outputs this
mke2fs 1.32 (09-Nov-2002)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
2048 inodes, 8192 blocks
409 blocks (4.99%) reserved for the super user
First data block=1
1 block group
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
then:
- mount /dev/ram0 /tmp/ramdisk0
That’s it. You can now copy files to /tmp/ramdisk0 (bear in mind that they will be erased when you reboot). The only major problem is that, as far as I know, you cannot change the size of RAMDISK without modifying some configuration files and rebooting. The default size is 8192 blocks, with 1 block = 1kb.
To change it, you have to modify the following line in /etc/grub.conf
kernel /vmlinuz-2.4.21-4.ELsmp ro root=LABEL=/ hdc=ide-scsi apic
to
kernel /vmlinuz-2.4.21-4.ELsmp ro root=LABEL=/ hdc=ide-scsi apic ramdisk_size=16000
then reboot ..