毕业论文,外语文献翻译

3995
    


来源:
Licence:
联系:
分类:
平台:
环境:
大小:
更新:
标签:
联系方式 :
免费下载 ×

下载APP,支持永久资源免费下载

限免产品服务请联系qq:1585269081

下载APP
免费下载 ×

下载APP,支持永久资源免费下载

下载APP 免费下载
下载 ×

下载APP,资源永久免费


如果出现不能下载的情况,请联系站长,联系方式在下方。

免费下载 ×

下载论文助手APP,资源永久免费

免费获取

如果你已经登录仍然出现不能下载的情况,请【点击刷新】本页面或者联系站长


GENCODE: The reference human genome annotation for The ENCODE Project

gencode:参考 人类基因组注释 encode计划

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation.

Since the first public release of this annotation data set,few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased.

The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq.

It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons.

We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites.

Over one-third of GENCODE proteincoding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.

New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci.

GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
gencode联盟旨在确定使用组合计算分析,确定、人工注释人类基因组中所有基因的功能,并进行实验验证。自本注释数据集的第一次公开发布,一些新的编码蛋白质的基因位点已被添加,但选择性剪接的转录数量稳步增加。gencode 7发布的包含20687个蛋白质编码、9640个长非编码RNA基因,以及33977个未在UCSC基因数据库和RefSeq数据库编码表示的基因的转录。它还具有长链非编码RNA(lncRNA)最全面的注释公开组成的两个外显子的主要转录形成位点。我们验证了文本注释的完整性,并且发现35%的转录起始位点是由笼簇支持,62%的蛋白质编码基因由polyA 处注释。超过三分之一的gencode蛋白编码基因都是由肽点击来自质谱光谱提交肽图谱。来自Illumina体图2 RNA-Seq数据的新模型识别了目前不在gencode中的3689个新位点,其中3127有两个外显子模型,表明他们可能未加说明的长链非编码基因位点。gencode7通过gencodegenes.org、Ensembl和UCSC基因组浏览器公开。

Launched in September 2003, the Encyclopedia of DNA Elements (The ENCODE Project Consortium 2011) project brought together an international group of scientists tasked with identifying all functional elements in the human genome sequence.

Initially focusing on 1% of the genome (The ENCODE Project Consortium 2007), the pilot project was expanded to the whole genome in 2007.

As part of the initiative, the GENCODE collaboration was established whose aim was to annotate all evidence-based gene features on the human genome at high accuracy, again initially focusing on the 1% (Harrow et al. 2006).

The process to create this gene annotation involves manual curation, different computational analysis, and targeted experimental approaches.

Eight groups in Europe and the United States directly contribute data to this project, with numerous additional sources of evidence also used for the annotation. Figure 1 shows how the different elements of the GENCODE Consortium interact together.

The ability to sequence genomes has far exceeded the techniques for deciphering the information they encode.

Selecting the correct reference gene annotation for a particular project is extremely important for any downstream analysis such as conservation, variation, and assessing functionality of a sequence.

The type of gene annotation applied to a particular genome is dependent on its quality; therefore, next-generation sequencing assemblies (Metzker 2010) have had automatic gene annotation applied to them, whereas high-quality finished genomes such as the human (International Human Genome Sequencing Consortium 2004), mouse (Church et al. 2009), and zebrafish (Becker and Rinkwitz 2011) have manual annotation projects associated with them.

Publicly available gene sets such as RefSeq (Pruitt et al. 2012), AceView (Thierry-Mieg and Thierry-Mieg 2006), and GENCODE are generated by a combination of manual and automatic annotation and have developed different methods to optimize their annotation criteria.

For example, RefSeq annotates cDNAs rather than genomic sequence to optimize full-length gene annotation and is thus able to ignore sequencing errors in the genome.

This publication will describe the generation of the GENCODE gene set and its strengths over other publicly available human reference annotation and the reasons it has been adopted by the ENCODE Consortium (The ENCODE Project Consortium 2011), The 1000 Genomes Project Consortium(2010), and The International Cancer Genome Consortium (2010) as their reference gene annotation.
2003年九月,基因组元素的百科全书(encode项目联盟2011)项目聚集了一个国际组的科学家,负责识别人类基因组序列的所有功能元件。试点项目在2007年由最初专注于1%的基因组(encode项目2007)扩大到整个基因组。作为计划的一部分,gencode联盟成立的目的是以人类基因组高精度特点注释所有有证据依据的基因,他们开始再次关注于1%计划(2006年Harrow等人提出)。此项目创造了这种设计人工精选的基因注释法、不同计算分析法以及有针对性的实验方法。欧洲和美国直接贡献到此项目的八组数据,有大量的额外来源的证据也用于注释。图一显示了gencode联盟的不同元素如何相互作用。排列基因组的能力以及远远超出了技术破译编码信息。选择正确的参考基因注释作为一个特定项目是在任何下游分析中都非常重要的,例如保护、变化和评估序列的功能。某种类型的基因注释应用到一个特定的基因型依据于它的质量。因此,下一代测序组件(Metzker 2010)有适用于他们的自动基因注释,而高质量的完成基因组例如人(2004年,国际人类基因组测序组织),老鼠(2009年,Church 等),斑马鱼(2011年,Becker and Rinkwitz等)有与他们相关的人工测序项目。公开的基因组如RefSeq (2012年Pruitt等), AceView (2006年Thierry-Mieg 和Thierry-Mieg), 和GENCODE,采用了手动和自动相结合的注释和开发思想,来优化产生他们的注释标准。例如,RefSeq注释cDNA而不是基因组序列来优化全长基因注释,并且因此可以忽略基因组测序错误。本文将描述gencode基因组,以及它对于其他公开可用的参考注释的优势,还有它由encode协会(2011年,encode项目联盟)以及国际癌症基因组协会(2010)作为参考基因注释采用的原因。

Figure 1. The GENCODE pipeline. This schematic diagram shows the flow of data between the groups of the GENCODE Consortium. Manual annotation is central to the process but relies on specialized prediction pipelines to provide hints to first-pass annotation and quality control (QC) for completed annotation. Automated annotation supplements manual annotation, the two being merged to produce the GENCODE data set and also to apply QC to the completed annotation. A subset of annotated gene models is subject to experimental validation. The Annotrack tracking system contains data from all groups and is used to highlight differences, coordinate QC, and track outcomes.
图一,gencode途径,这个示意图显示了gencode组之间的数据流,手动注释是过程的核心,但它依赖于专门的预测途径,来提供线索去首次通过注释和质量控制(QC)完成注释。自动注释补充人工注释,两者合并产生的gencode数据组合应用质量控制来完成注释。注释基因模型的一个子集接受实验验证。Annotrack跟踪系统包含的所有数据组用于突出显示差异,调整质量,并跟踪结果。

Production of the GENCODE gene set: A merge of manual and automated annotation
The GENCODE reference gene set is a combination of manual gene annotation from the Human and Vertebrate Analysis and Annotation (HAVANA) group (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/) and automatic gene annotation from Ensembl (Flicek et al. 2011).

It is updated with every Ensembl release (approximately every 3 mo). Since manual annotation of the whole human genome is estimated to take until the end of 2012, the GENCODE releases are a combination of manual annotation from HAVANA and automatic annotation from Ensembl to ensure whole-genome coverage.
形成gencode基因组:人工与自动注释相结合

gencode参考基因组是从人类和脊椎动物分析和注释相结合的人工基因注释(HAVANA)组(http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/)和自动基因注释(2011年,Flicek等)。它于每个Ensembl 发布时更新(大约每三个月)。由于整个人类基因组的人工标注工作预计要到2012年底,gencode发布是来自HAVANA的人工注释和来自Ensembl的自动注释的结合,确保覆盖整个基因组。

Manual annotation process
The group’s approach to manual gene annotation is to annotate transcripts aligned to the genome and take the genomic sequences as the reference rather than the cDNAs.

Currently only three vertebrate genomes—human, mouse, and zebrafish—are being fully finished and sequenced to a quality that merits manual annotation.

The finished genomic sequence is analyzed using a modified Ensembl pipeline (Searle et al. 2004), and BLAST results of cDNAs/ESTs and proteins, along with various ab initio predictions, can be analyzed manually in the annotation browser toolOtterlace (http://www.sanger.ac.uk/resources/software/otterlace/).

The advantage of genomic annotation compared with cDNA annotation is that more alternative spliced variants can be predicted, as partial EST evidence and protein evidence can be used, whereas cDNA annotation is limited to availability of full-length transcripts.

Moreover, genomic annotation produces a more comprehensive analysis of pseudogenes.One disadvantage, however, is that if a polymorphism occurs in the reference sequence a coding transcript cannot be annotated, whereas cDNA annotation, for example, performed by RefSeq (Pruitt et al. 2012), can select the major haplotypic form as it is not limited by a reference sequence.
人工注释过程

这个小组的人工基因注释方法是注释转录基因组,并以基因组序列为参考,而不是cDNA。

目前,只有三个脊椎动物的基因组——人类,老鼠和斑马鱼——正在全面完成,并且应由人工注释排序。完成的基因组序列,用改进的Ensembl途径(2004,Searle等),和cDNAs/ESTs 和蛋白质的BLAST结果,随着各种从头预测,可以再注释浏览器toolOtterlace手工分析(http://www.sanger.ac.uk/resources/software/otterlace/)。基因组注释相对于cDNA注释的优势在于,更多剪接变异体可以预测,随着部分EST证据和蛋白质证据可以被使用,然而cDNA注释的全长转录的可用性是有限的。此外,基因组注释产生关于假基因的更全面的分析。然而,有一个缺点,如果一个在编码转录中无法注释的参考序列发生多态现象,而cDNA注释,例如,通过参考RefSeq (2012年,Pruitt),能够选择主要单倍型的形式,并不被参考序列所限制。
Automatic annotation process
Protein-coding genes were annotated automatically using the Ensembl gene annotation pipeline (Flicek et al. 2012). Protein sequences from UniProt (Apweiler et al. 2012) (only ‘‘protein existence’’ levels 1 and 2) were included as input, along with RefSeq sequences.

Untranslated regions (UTRs) were added using cDNA sequences from the EMBL Nucleotide Archive (ENA) (Cochrane et al. 2011). Long intergenic noncoding RNA (lincRNA) genes were annotated using a combination of cDNA sequences and regulatory data from the Ensembl project.

Short noncoding RNAs were annotated using the Ensembl ncRNA pipelines, using data from mirBase (Griffiths-Jones 2010) and Rfam (Gardner et al. 2011) as input.
自动注释过程

蛋白质编码基因自动使用Ensembl基因注释途径(2012年,Flicek等)来进行注释。来自UniProt(2012年,Apweiler等)的蛋白质序列(只有“蛋白质存在”等级1和2)作为输入和RefSeq序列。非翻译区(UTRs)添加使用来自EMBL核苷酸档案(ENA)(2011年,Cochrane等)的cDNA序列。长链非编码RNA(lincRNA)基因的注释,联合使用cDNA序列和Ensembl工程监管数据。短链非编码RNA的注释使用Ensembl ncRNA途径,数据则使用mirBase (2010年,Griffiths-Jones) 和Rfam (2011年,Gardner等)作为输入。
GENCODE gene merge process
This process of combining the HAVANA and Ensembl annotation is complex. During the merge process, all HAVANA and Ensembl transcript models are compared, first by clustering together transcripts on the same strand which have any overlapping coding exons, and then by pairwise comparisons of each exon in a cluster of transcripts.

The merge process is summarized in the Supplemental Figures and Tables, including the rules involved in each step. Ensembl have developed a new module, HavanaAdder, to produce this GENCODE merged gene set. Prior to running the HavanaAdder code, the HAVANA gene models are passed through the Ensembl health-checking system, which aims to identify any inconsistencies within the manually annotated gene set.

Annotation highlighted by this system is passed back to HAVANA for further inspection. In addition, the HAVANA transcript models are queried against external data sets such as the consensus coding sequence (CCDS) (Pruitt et al. 2009) gene set and Ensembl’s cDNA alignments of all human cDNAs.

If annotation described in these external data sets is missing from the manual set, then this is stored in the AnnoTrack system (see below) (Kokocinski et al. 2010) so that a record is kept for the annotators to inspect these loci. The genes in the GENCODE reference gene set are classified into three levels according to their type of annotation.

Level 1 highlights transcripts that have been manually annotated and experimentally validated by RT-PCR-seq (Howald et al. 2012), as well as pseudogenes that have been validated by three-way consensus, namely, that have been independently validated by three different strategies. Level 2 indicates transcripts that have been manually annotated.

Some Level 2 transcripts have been merged with models produced by the Ensembl automatic pipeline, while other Level 2 transcripts are annotated by HAVANA only. Level 3 indicates transcripts and pseudogene predictions arising from Ensembl’s automated annotation pipeline.

GENCODE 7 consists of 9019 transcripts at Level 1, 118,657 transcripts at Level 2, and 33,699 transcripts at Level 3. Many of the protein-coding genes in Level 3 are contributed by Ensembl’s genome-wide annotation in regions where HAVANA has not yet provided manual annotation.
gencode基因结合过程

这种结合了HAVANA和Ensembl注释方法的过程是非常复杂的。在结合过程中,所有的HAVANA和Ensembl转录模型进行比较,首先通过聚类在同一链上的、有任何重叠编码外显子的转录,然后两两比较每个外显子在一个集群的转录。这个结合过程总结了补充数据和表格,包括在每个步骤中的规则。Ensembl研发了一个新模型,HavanaAdder,用于产生这种GENCODE结合基因组。运行HavanaAdder代码之前,HAVANA基因模型要通过Ensembl健康检查系统,用于确定任何与人工注释不一致的基因组。由系统标出的注释返回HAVANA等待进一步检查。此外,HAVANA转录模型查询对外数据集,例如一致性编码序列(CCDS)(2009年,Pruitt等)基因组和全人类cDNA的Ensembl cDNA序列。如果在这些外部数据集描述的注释是从手动设置丢失,那么它会存储在AnnoTrack系统(见下文)(2010年,Kokocinski等)用于记录对这些位点的注释。在GENCODE参考基因检集中的基因可按照其注释类型分为三个等级。一级高亮转录已经被人工注释,并进行了RT-PCR测序(2012年,Howald等)实验验证,并经过三种方式验证了假基因,也就是说,它经过了三种不同的独立策略验证。二级表示已经手动注释的转录本。一些二级转录已经被Ensembl自动途径生产模型合并,而其他二级转录只被HAVANA注释。三级表示转录和基因预测由Ensembl自动注释途径产生。gencode7由9019个一级转录、118657个二级转录、33699个三级转录组成。许多在HAVANA尚未提供手工注释的三级编码蛋白质基因,由Ensembl全基因组注释提供。Locus level classification
Manually annotated GENCODE gene features are subdivided into categories on the basis of their functional potential and the source of the evidence supporting their annotation. Annotated gene models are predominantly supported by transcriptional and/or protein evidence.

Once the structure of a model has been established, it is classified into one of three broad locus level biotypes: protein-coding gene, long noncoding RNA (lncRNA) gene, or pseudogene. In addition, more detailed biotypes are associated with transcripts to attempt to assign a functionality, for example, protein-coding or subject to nonsense mediated decay (NMD) (see landscape Supplemental Tables).

To provide a more complete description of the gene model, a ‘‘status’’ is assigned at both the locus and transcript level. Loci can be assigned the status ‘‘known,’’ ‘‘novel,’’ or ‘‘putative’’ depending on their presence in other major databases and the evidence used to build their component transcripts.

In brief, loci have the status ‘‘known’’ if they are represented in the HUGO Gene Nomenclature Committee (HGNC) database (Seal et al. 2011) and RefSeq (Pruitt et al. 2012); loci with the status ‘‘novel’’ are not currently represented in those databases but are well supported by either locus specific transcript evidence or evidence from a paralogous or orthologous locus.

Finally loci with status ‘‘putative’’ are supported by shorter, more sparse transcript evidence. A similar status categorization is employed at the transcript level (see Supplemental Figures and Tables).

In addition to the information captured by biotype and status, controlled vocabulary attributes are attached to both transcripts and loci. They are used to describe other features relevant to the structure or functional annotation of a transcript.

Attributes may be subdivided into three main categories: those that explain features related to splicing, those related to the translation of the transcript, and those related to the transcriptional evidence used to build the transcript model.

For a comprehensive list of all attributes along with the definitions used in the GENCODE annotation, see the landscape Supplemental Tables. Where further explanation of annotation is required, free text remarks are added. New controlled vocabulary is developed wherever possible so that annotation text strings can be searched computationally.
轨迹等级分类

人工注释的gencode基因特征,根据他们的功能潜力和支持其注释的证据资源为基础,被划分类别。注释基因模型主要由转录和/或蛋白质证据来支持。一但一个模型的结构建立起来,它就被分为三大层次:编码蛋白质基因,长链非编码RNA(lncRNA)基因和假基因。此外,更详细的、结合了转录的生物型试图分配一个功能,例如,蛋白质编码或受无义介导的衰变(NMD)(见景观补充表)。为了提供一个更完整的基因模型的描述,位点和转录水平都被分配一个“状态”。取决于他们在其他主要数据库的存在和被用于建立他们自身组成副本的证据,轨迹被分配“已知”“新颖”或“推测”等状态。

总之,如果轨迹在HUGO基因命名委员会(HGNC)数据库(2011年,Seal等)和RefSeq(2012年,Pruitt等)中被代表,它们将会得到“已知”状态。“新奇”状态的轨迹不代表在这些数据库里,但都通过位点特异性转录证据或者来自同源和同源基因的证据来得到支持。最后,“推测”状态的轨迹由更短、更稀少的转录证据支持。在转录水平(见补充数据和表)采用了类似的状态分类。除了由生物型和状态捕获的信息,控制词汇属性也与转录和位点有关。它们被用于描述其它相关于结构或功能的转录注释的特征。属性可以被分为三大类:那些解释拼接方面特征的,那些有关转录本翻译的,和那些与被用于建立转录模型的转录证据有关的。一个拥有所有属性以及在GENCODE注释中使用的定义的全面的清单,见景观补充表。如果需要进一步解释注释,则添加了自由文本备注。新的控制词汇被开发,在任何可能的情况下,可以搜索注释文本字符串。Analyzing long noncoding transcript annotation
Over the last decade, evidence from numerous high-throughput array experiments has indicated that evolution of the developmental processes regulating complex organisms can be attributed to the noncoding regions and not only to the protein-coding regions of the genome (Bertone et al. 2004; Mattick 2004; Kapranov et al. 2007; Clark et al. 2011).

The GENCODE gene set has always attempted to catalog this noncoding transcription utilizing a combination of computational analysis, human and mammalian cDNAs/ESTs alignments, and extensive manual curation to validate their noncoding potential.

GENCODE 7 contains 9640 lncRNA loci, representing 15,512 transcripts, which is the largest manually curated catalog of human lncRNAs currently publicly available.

All the lncRNA loci in the catalog originate from the manual annotation pipeline and are initially classified as noncoding due to the lack of homology with any protein, no reasonable- sized open reading frame (ORF; not subject to NMD), and no high conservation, confirmed by PhyloCSF (see later section), through the majority of exons.

The transcripts are not required to be polyadenylated but 16.8% are, and chromatin marks have been identified for 13.9% (Derrien et al. 2012). These lncRNAs can be further reclassified into the following locus biotypes based on their location with respect to protein-coding genes:

1. Antisense RNAs: Locus that has at least one transcript that intersects any exon of a protein-coding locus on the opposite strand, or published evidence of antisense regulation of a coding gene.

2. LincRNA: Locus is intergenic noncoding RNA.

3. Sense overlapping: Locus contains a coding gene within an intron on the same strand.

4. Sense intronic: Locus resides within intron of a coding gene but does not intersect any exons on the same strand.

5. Processed transcript: Locus where none of its transcripts contain an ORF and cannot be placed in any of the other categories because of complexity in their structure.

The GENCODE lncRNA data set is larger than other available lncRNA data sets, and it shows limited intersection with them. Forty-two percent (44 out of 96) of the lncRNA database lncRNAdb (Amaral et al. 2011) are represented in GENCODE lncRNAs.

We checked the same strand overlap against recent lncRNA catalogs: GENCODE v7 lncRNAs contain 30% of Jia et al. (2010) lncRNAs, 39% of Cabili et al. lincRNAs (Cabili et al. 2011), and 12% of vlincs (Kapranov et al. 2007) (for more details, see Derrien et al. 2012). While this level of overlap between data sets shows how lncRNA annotation is improving, it also shows that substantial additional work is still required.

There are likely to be a number of reasons for the limited overlap between the published lincRNAs and GENCODE, not least that a substantial fraction of transcript annotations are currently incomplete (see below).

Another reason is that some of the published transcripts are single exons, which up to now have not been annotated in GENCODE unless there is additional support, for example, polyA features, conservation, submitted sequence, or publications.

We are addressing this weakness and re-examining single exons lincRNAs based on annotation from Jia et al. (2010) in collaboration with the Lipovich group, and the data will be incorporated into GENCODE 10.
Although the current definition of lncRNAs requires the transcript to be >200 bp (Wang and Chang 2011), the GENCODE ncRNA set contains 136 spliced transcripts <200 bp (all of them single transcript loci) to highlight that there is evidence of expression at that position in the genome.

We currently group the transcripts into loci, which is different compared with other lncRNA analysis groups, for example, the Fantom Consortium (Katayama et al. 2005). Multiple lncRNA transcripts appear to start from the same transcription start site (TSS), for example, the DLX6-AS1 locus shown in Supplemental Figure 2.

To estimate the completeness of the lncRNA transcripts, we took advantage of CAGE tags from 12 different cell lines and manually annotated polyA features to assess the TSS and 39 end of transcripts (Djebali et al. 2012). The beginning and end of 15% and 16.8% of lncRNA are supported, respectively, indicating that the majority of transcripts are incomplete.

Interestingly lncRNA transcripts have an unusual exon structure compared with proteincoding transcripts, with their distribution peaking at two and five exons, respectively (see Fig. 2).

This lower number does not appear to be an artifact or the product of incomplete annotation but most probably is a bona fide characteristic of the lncRNAs, as it is also observed in potential lncRNA models identified using the Illumina Human Body Map 2.0 (HBM) RNA-seq data generated in 2010 on HiSeq 2000 instruments (described below), which are not built from partial ESTs.
分析长链非编码转录注释
在过去的十年中,从大量的高通量阵列的实验证据表明,调节复杂的生物体发育过程的演变可以归因于非编码区,而不仅是对基因组的蛋白编码区(2004年Bertone等; 2004年Mattick; 2007年Kapranov等; 2011年Clark等)。

gencode基因组一直试图利用计算分析、人和动物cDNA/EST排列、大量人工策展相结合的方式,将非编码转录编为目录,来验证他们非编码的潜力。

gencode7包含9640个长链非编码RNA基因,代表15512个转录,这是目前公开的最大的人类长链非编码RNA人工精选目录。

目录中所有的长链非编码RNA基因来源于人工注释流程,并且由于缺乏任何蛋白质的同源性、没有合理大小的开发阅读框(ORF,不受NMD)、以及没有高度保护,进行非编码初步分类,通过大部分外显子经PhyloCSF(见后面的章节)确认。

转录不需要多聚腺苷酸化,但16.8%已经完成,并且染色质标记已经被确定为13.9%(2012年,Derrien等)。基于它们相对于编码蛋白基因的位置,这些长链非编码RNA可以进一步划分为以下几种类型。

1.反义RNA:这种基因至少有一个相交于任何蛋白质相反链上的基因编码的外显子转录,或者发布一个编码基因的反义调控证据。

2.LincRNA:位点是基因间的非编码RNA。

3.意义重叠:位点包含一个同链内含子的编码基因。

4.意义内含子:位点位于编码基因内含子上,但不与任何外显子相交的同一条链上。

5.加工转录:在位点处没有转录包含可译框架,并且由于其结构的复杂性,不能被放置在任何其他类别。

gencode的长链非编码RNA数据库比其他可用的lncRNA数据库大,并且它显示了他们之间有限的交点。42%(44/96)的长链非编码RNA数据库lncRNAdb(2011年,Amaral等)在GENCODE长链非编码RNA中表示。

我们针对近期的长链非编码RNA目录检查了相同链重叠现象:gencode v7长链非编码RNA包含了30%(2010,Jia等)的lncRNA,39%的lincRNA(2011年,Cabili 等),还有12%的vlinc(2007年,Kapranov等)(想得知更多细节,见2012年Derrien等人著作)。这一级别的数据集重叠现象表明,lncRNA注释正在改善,也表明了仍需大量的额外工作。

似乎有许多原因造成在公布的lincRNA和GENCODE间的有限重叠,尤其是相当大比例的转录注释不完整(见下文)。

另一个原因是,一些公布的转录是单外显子,除非有额外的支持,至今仍未被在GENCODE里注释。例如,polyA特征、保护、服从序列或出版物。

我们正在解决这个弱点,并在Jia等人(2010)与Lipovich团体合作注释的基础上重新检查单外显子lincRNA,这些数据将被纳入GENCODE10。

虽然当前lncRNA的定义要求转录>200bp(2011年,Wang与Chang),GENCODE ncRNA集包含136个<200pb的剪切转录(所有单独转录位点)用于强调基因组中的位置表达的证据。
目前,我们将转录分类为位点,这是与其他lncRNA分析组不同的地方,例如,Fantom协会(2005年,Katayama等)。多个lncRNA转录似乎起始于同一个转录起始位点(TSS),例如,在补充图表2里展示的DLX6-AS1位点。

为了估计lncRNA转录本的完整性,我们利用了来自12个不同细胞系的笼标签和人工标注的polyA特征来评估TSS以及39转录端(2012年,Djebali等)。lncRNA15%的开端、16.8%的末尾分别得到支持,表示大多数的转录是不完整的。

有趣的是,lncRNA转录本与蛋白质转录本相比,有一个不寻常的外显子结构,分别分布在两个和五个外显子峰(见图表二)。

这个较低的数量似乎并不是一个人工制品,或是不完全注释的产品,但这最可能是lncRNA善意的特点,这也被在潜在的lncRNA模型中观察到,这种模型确认使用了Illumina人体图2.0(HBM),于2010在HiSeq2000仪器(如下所述)生成的RNA序列数据,而不是由部分EST建造。Figure 2. Analysis of exon number of protein-coding and noncoding RNA transcripts. The numbers of exons for each individual transcript annotated at protein-coding and lncRNA loci are plotted for GENCODE 3c (red lines) and GENCODE 7 (blue lines). For each release, darker lines indicate proteincoding transcripts, and lighter lines indicate lncRNA transcripts. The 59 and 39 UTR exons of proteincoding transcripts are included.
图二 对蛋白质编码和非编码RNA转录的外显子数量的分析。每个独立转录注释的外显子数量被绘制出来,蛋白质编码使用GENCODE 3c(红色线),lncRNA位点使用GENCODE7(蓝色线)。对于每个版本,深色的线表明蛋白的编码基因,浅色的线表明lncRNA转录。蛋白质转录的59和39非编码区外显子也包含在内。Small noncoding RNAs are automatically annotated from the Ensembl pipeline and included within the GENCODE gene set. The number has remained relatively stable at 8801 since release 4.

Protein-coding and noncoding transcripts that contain a small ncRNA within at least one intron or exon will be annotated with the attribute ncrna_host. Thirty-three percent of small ncRNAsmap within the boundaries of a GENCODE gene, the majority of which reside in introns.

The GENCODE 7 release contains 1679 protein coding and 301 lncRNA genes with ncrna_host attributes, and there is a sixfold enrichment of small nuclear RNAs (snoRNAs) within exons of lncRNAs (Derrien et al. 2012).

In summary, the lncRNAs data set in GENCODE 7 consists of 5058 lincRNA loci, 3214 antisense loci, 378 sense intronic loci, and 930 processed transcripts loci. Manually evaluating the RNA-seq models generated from HBM data and ENCODE data could potentially double this number in later releases of GENCODE and produce a uniform data set.
小的非编码RNA从Ensembl途径自动注释,并且包含在GENCODE基因集中。自从发行4后,这个数量保持相对稳定在8801。

至少有一个内含子或外显子的蛋白质编码和包含在小ncRNA中的非编码转录将会被标注属性ncrna_host。百分之三十三的小ncRNA遗传图在GENCODE基因边界内,其大部分包含内含子。

GENCODE7发布了拥有ncma_host属性的1679个蛋白质编码和301个lncRNA基因,并且在lncRNA的外显子有一个六倍富集的小核RNA(snoRNA)(2012年,Derrien等)。

总之,GENCODE 7的lncRNA数据集由5058个lincRNA位点、3214个反义基因、378个意义内含子位点,和930个加工转录位点。人工评估从HBM数据和ENCODE数据生成的RNA序列模型,在之后的GENCODE版本中可能会使这个数字加倍,并且产生一个统一的数据集。

Integration of pseudogenes into GENCODE
Within most gene catalogs, pseudogenes have been annotated as a byproduct of protein-coding gene annotation, since a transcript has been identified with a frameshift or deletion, rather than an important entity in its own right.

However recent analysis of retrotransposed pseudogenes such at PTENP1 (Poliseno et al. 2010) and DHFRL1 (McEntee et al. 2011) have found some retransposed pseudogenes to be expressed and functional and to have major impacts on human biology. The GENCODE catalog is unique in its annotation of the comprehensive pseudogene landscape of the human genome using a combination of automated, manual, and, more recently, experimental methods.

The assignment process for pseudogenes is described in detail by Pei et al. (2012). Briefly, in silico identification of pseudogenes is obtained from routine implementation of Yale’s Pseudopipe (Zhang et al. 2006) with every new release of Ensembl. Pseudopipe identified 18,046 pseudogenes based on the human genome release in Ensembl 61.

These pseudogenes were compared to a recent run of UCSC’s RetroFinder, which included 13,644 pseudogenes, and HAVANA’s latest annotations of 11,224 pseudogenes based on GENCODE 7, level 2. A three-way Yale, UCSC, and HAVANA pseudogene consensus set was obtained by using an overlap criteria of 50 bp and was developed for the annotation of 1% ENCODE Regions (Zheng et al. 2007).

This resulted in a consensus set of 7183 pseudogenes,which are tagged level 1. The functional paralog of a pseudogene is often referred to as the ‘‘parent’’ gene. Currently, we have successfully identified parents for 9369 pseudogenes of the manually annotated pseudogenes, whereas the parents for the remaining 1847 pseudogenes are still ambiguous and may require further investigation.

It is important to note, however, that it is not always possible to identify the true parent of a pseudogene with certainty, for example, when a pseudogene is highly degraded and is derived from a parent gene with highly similar paralogs o when the parent contains a common functional domain.

We have added this information to the pseudogene annotation if known (based on protein alignments), and it is also available from the pseuodgene decorated resource (psiDR described in Pei et al. 2012) http://www.pseudogene.org/psidr/psiDR.v0.txt.
A pseudogene ontology was created to associate a variety of biological properties—such as sequence features, evolution, and potential biological functions—to pseudogenes and is incorporated into the GENCODE annotation file. The hierarchy of these properties is shown in Supplemental Figure 3.

The ontology allows not only comprehensive annotation of pseudogenes but also automatic queries against the pseudogene knowledge database (Holford et al. 2010). The breakdown of the different biotypes within the GENCODE data set can be seen in Supplemental Table 4.

A schematic to describe the different manually annotated pseudogene biotypes is presented in Figure 3. For example, unitary pseudogenes (i.e., genes that are active in mouse but pseudogenic in the human lineage) were all manually checked for false positives due to genomic sequencing errors or incorrect automated gene predictions in the mouse (Zhang et al. 2010).

Computational approaches followed by experimental validation were implemented to examine how many GENCODE pseudogenes appeared to be transcribed (Pei et al. 2012). Briefly, transcribed pseudogenes were identified manually and tagged by the HAVANA team examining locus-specific transcription evidence (by aligning of mRNAs or ESTs).
This identified 171 transcribed processed and 309 unprocessed pseudogenes. The locus-specific transcriptional evidence must indicate a best-ingenome alignment and clear differences compared with the parent locus.

Interestingly, there was over one-third more unprocessed pseudogenes annotated as transcribed compared with processed pseudogenes, even though there are approximately four times as many processed pseudogenes present in the genome than unprocessed pseudogenes (see Supplemental Table 4).

In addition, automated pipeline analysis of RNA-seq data from the total RNA of ENCODE cell line GM12878 and K562 plusHBMRNA-seq resource (Pei et al. 2012) generated an additional 110 and 344 transcribed processed and unprocessed pseudogenes, respectively.

Specific primers could be designed for 162 potentially transcribed pseudogenes and have been subjected to experimental validation of transcription by the RT-PCR-seq pipeline within the GENCODE Consortium (Howald et al. 2012). After the validation experiments, 63 pseudogenes were found to be transcribed within at least one of eight tissues.假基因整合入GENGODE

在大多数基因目录,假基因已经被标注为蛋白质编码基因注释的一个副产品,由于转录已经被确定了移码突变或缺失,而不是自己权利的一个重要主体。

然而,在最近的旧转录假基因分析,例如PTENP1(2010年,Poliseno等人)和DHFRL1(2011年,McEntee等)发现一些再转录假基因被表达和发挥功能,在人类生物学中有重大影响。GENCODE目录在人类基因组中的假基因的综合景观中的独特诠释,结合使用了自动,手动,以及最近使用的实验方法。

假基因的详细分配过程由Pei等人于2012年提出。简而言之,计算机识别假基因于每个Ensembl更新期从耶鲁的假流程常规执行情况得到,假流程依据在Ensembl 61公布的人类基因,确定了18046个假基因。

这些假基因与最近运行的UCSC的包含13644个假基因的旧测距器,和HAVANA的基于2级GENCODE 7的11224个假基因的最新注释相比。一个三向的,耶鲁、UCSC和HAVANA假基因一致集通过使用重叠的50bp标准而建立,并且确立了1% ENCODE注释编码区(2007年,Zheng等人)。

这导致了7183个假基因的一致集合,被标记为一级。一个假基因的功能性同源经常被称作“亲本”基因。目前,我们已经成功地确定了9369个人工注释的假基因的亲本,而剩下1847个假基因的亲本仍然模糊,可能需要进一步的调查。

不过需要注意的是,它并不总是能够准确识别假基因的真正亲本,例如,当一个假基因高度退化,并且来自一个具有高度相似同源的亲本基因,或者亲本包含一个共同的功能域。

如果已知(基于蛋白质序列),我们已经将这些信息添加到了假基因注释,它也可以从假基因装饰资源(Pei于2012年提出的 psiDR,http://www.pseudogene.org/psidr/psiDR.v0.txt.)得到。
一个假基因本体论被创建来联系多种生物学特性——例如序列特征、进化和嵌在的生物学功能——来研究假基因并纳入GENCODE注释文件。这些属性的层次结构如补充的图三所示。

本体论不仅可以全面注释假基因,也可以自动查询假基因知识数据库(2010年,Holford等)。GENCODE数据集中不同生物型的分解可以在补充的表四中看到。

如图三所示,我们使用一个示意图来描述人工注释假基因生物型的不同。例如,单一假基因(即活跃在小鼠内的基因,而不是人类谱系的假基因)都是人工检查由于基因组测序错误产生的假阳性,或是错误的小鼠自动基因预测(2010年,Zhang等)。

实验验证的计算方法被用于检查有多少GENCODE假基因似乎是被转录的(2012年,Pei等)。简而言之,转录假基因通过检查位点特异性转录证据(通过mRNA或EST矫正)被人工确定,并且被HAVANA团队加上标签。
这确定了171个转录加工和309个未加工假基因。位点特异性转录证据必须表明最佳基因组比对和与亲本位点的明显差异。

有趣的是,相比于加工假基因,有超过三分之一的未加工假基因被注释为转录,尽管基因中存在有未加工假基因大约四倍的加工假基因(见补充表四)。

此外,来自于全体ENCODE细胞系GM12878和K562加HBMRNA序列资源(2012年,Pei等)的RNA序列数据自动流程分析分别产生了额外的110个转录加工假基因和344个未加工假基因。

特异性引物可以由162个潜在的转录假基因,以及GENCODE联盟(2012年,Howald等)中PT-PCR序列流程进行实验性质的验证的转录来进行设计。在实验验证之后,63个假基因会至少在八分之一的组织中被发现已经进行转录。Figure 3. A schematic showing the structural annotation of different pseudogene biotypes. The schematic diagram illustrates the categorization of GENCODE pseudogenes on the basis of their origin. Processed pseudogenes are derived by a retrotransposition event and unprocessed pseudogenes by a gene duplication event in both cases, followed by the gain of a disabling mutation.

Both processed and unprocessed pseudogenes can retain or gain transcriptional activity, which is reflected in the transcribed_processed and transcribed_unprocessed_pseudogene classification. Polymorphic pseudogenes contain a disabling mutation in the reference genome but are known to be coding in other individuals, while unitary pseudogenes have functional protein-coding orthologs in other species (we have used mouse as a reference) but contain a fixed disabling mutation in human.
图三 此示意图表示不同假基因生物型的结构注释。示意图表明GENCODE假基因的分类方法是建立在它们起源的基础上。加工假基因起源于一个反转录转座事件,未加工假基因起源于两种情况下的基因复制事件,随后获得无效突变。

无论是加工或者未加工假基因都可以保留或增加转录活性,这反映在转录加工和转录未加工假基因分类中。多态假基因在参考基因中含有一个无效突变,但这在其它个体中被称为编码,而在其它物种中(我们使用小鼠作为参考)单一假基因含有功能性蛋白质编码同源基因,而人类含有一个固定的无效突变。

免费下载 ×

下载APP,支持永久资源免费下载

下载APP 免费下载
温馨提示
请用电脑打开本网页,即可以免费获取你想要的了。
扫描加我微信 ×

演示

×
登录 ×


下载 ×
论文助手网
论文助手,最开放的学术期刊平台
				暂无来源信息			 
回复
来来来,吐槽点啥吧

作者联系方式

×

向作者索要->