This track depicts RNA-Seq data from human RNA between the sizes of 20-200 nt. As part of the ENCODE Consortia, Cold Spring Harbor Laboratories isolated tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome.
This cloning protocol generates directional libraries that largely correspond to the 5′ ends of mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles producing variable length reads. Furthermore, the reads undergo post-processing resulting in trimming of their 3′ ends (Mortazavi et al., 2008).
This track is a multi-view composite track that contains multiple data types (views). For each view, there are multiple subtracks that display individually on the browser. Instructions for configuring multi-view tracks are here. To show only selected subtracks, uncheck the boxes next to the tracks that you wish to hide. Color differences among the views provide a visual cue for distinguishing between the different cell types and compartments.
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
This is release 2 (Summer 2011). This release contains both data remapped from hg18 referred to as Generation 0 data that was produced by the Hannon lab and new experiments produced by the Gingeras lab. When there is both Generation 0 and new data available, only the new data is displayed, the older data is available for downloads only. From the original 11 experiments displayed on release 1, only 2 (prostate and K562 polysome) are still displayed. In addition to improved data, there are several additional cell lines. The Contigs view is available for display and the Gencode Predicted Exons files are available for downloads.
Small RNA between 20-200 nt was Ribominus-edTM; treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5′ structures that would inhibit cloning. The RNA is then "A-tailed" using Poly-A Polymerase. A 5′ linker is ligated using T4 RNA ligase and anchored oligo-dT is used to prime a reverse transcriptase reaction. The libraries are gel purified and clustered for an Illumina GAIIx run. Sequence is obtained from the 5′ end of the inserts for a total of 36 cycles. For a detailed protocol see: ###[Small RNA Cloning Protocol.pdf].
The Illumina reads were initially trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3′ adapter sequence into the 3′ end of the reads. The 3′ sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure are also trimmed. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B. et al). The alignment parameter allowed 0, 1, or 2 mismatches iteratively. We report reads that mapped 20 or fewer times.
Discrepancies between hg18 and hg19 versions of CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly, in that we no longer retain information regarding whether a read was 'clipped' off adapter sequence.
Data from the Gingeras and Guigo labs were mapped with STAR. For a description of STAR, the source code and mapping parameters used see: http://gingeraslab.cshl.edu/STAR/. Reads mapping 10 times or less are reported in the Raw Signal and Alignment files.
Comparison of referential data generated from 8 individual sequencing lanes (Illumina technology). The mapped data were also visually inspected to verify the majority of the reads were mapping the 5′ ends of annotated small RNA classes.
Gingeras and Guigo laboratories: Carrie A. Davis, Lei-Hoon See, Wei Lin.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009 Mar; 10:R25.
Mortazavi A, Williams BA, McCue K, Schaeffer L, and Wold BJ. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods. 2008 Jul; 5(7):621-628.
Data users may freely use ENCODE data, but may not, without prior consent, submit publications that use an unpublished ENCODE dataset until nine months following the release of the dataset. This date is listed in the Restricted Until column, above. The full data release policy for ENCODE is available here.