NBT：宏基因组二、三代混合组装软件OPERA-MS_宏基因组组装软件-程序员宅基地

技术标签： papers

文章目录

宏基因组二、三代测序混合组装软件OPERA-MS

宏基因组二、三代测序混合组装软件OPERA-MS

混合组装宏基因组实现高精度分析人体微生物组中的抗性基因和移动元件

Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes

Nature Biotechnology [IF:31.864]

2019-07-29 Articles

DOI: https://doi.org/10.1038/s41587-019-0191-2

第一作者：Denis Bertrand¹

通讯作者：Niranjan Nagarajan^1,7*

其它作者： Jim Shaw, Manesh Kalathiyappan, Amanda Hui Qi Ng, M. Senthil Kumar, Chenhao Li(李陈浩), Mirta Dvornicic, Janja Paliska Soldo, Jia Yu Koh, Chengxuan Tong, Oon Tek Ng, Timothy Barkham, Barnaby Young, Kalisvar Marimuthu, Kern Rei Chng, Mile Sikic

作者单位：

¹ 计算与系统生物学，新加坡基因组所(Computational & Systems Biology, Genome Institute of Singapore, Singapore, Singapore)

⁷ 新加坡国立大学(National University of Singapore, Singapore, Singapore.)

热心肠日报

Nature子刊：宏基因组二、三代混合组装新软件OPERA-MS

创作：刘永鑫审核：刘永鑫 08月02日

原标题：混合宏基因组组装实现人体微生物组中的抗性基因和移动元件的高精度分析

OPERA-MS采用重复感知聚类和精确的支架方法结合，实现二、三代序列的混合宏基因组组装；
基于模拟和真实宏基因组样本评估，获得目前最高质量的宏基因组，比长读长更高的碱基准确度，比短读长更高的连续性和比混合组装更少的错误，可获得低丰度物种的高质量基因组；
软件还可实现同一物种内菌株水平组装，获得稀有物种的高质量参考基因组；
结合纳米孔读长，实现80个完整质粒或噬菌体序列组装，为研究肠道抗生素抗性组精细研究提供可能。

二代测序通量高、准确度高，但读长短；三代测序读长长，但错误率高、成本高。将这两者的优势结合，目前在宏基因组领域还没有得到广泛应用，存在很多技术难题没有解决。近日，来自新加坡基因组所的Niranjan Nagarajan课题组发布了一款二、三代测序混合组装软件OPERA-MS，组装结果不仅碱基准确率高，而且短读长数据拼接长度提升了一个数量级。

OPERA-MS整合了宏基因组聚类和精确支架算法，基于虚拟肠道微生物组和人工群落数据测序，研究者仅用9×长读长覆盖深度组装出了接近目前最完整的宏基因组，也组装出低丰度（＜1%）物种的高质量基因组。值得一提的是，OPERA-MS还可在亚种水平上获得基因组结果。将Nanopore测序应用于抗生素治疗病人的肠道宏基因组研究，发现长读长组装质量较短读长提升了200倍。这一重镑成果于7月29日发表于世界顶级期刊《Nature Biotechnology》。

摘要

通过高通量宏基因组测序已经实现了微生物组的组成分析。然而，现有方法不是设计用于组装来自短读长和长读长混合序列。我们提出了一个名为OPERA-MS的混合宏基因组组装软件，它将组装宏基因组采用重复感知聚类和精确的支架方法结合，实现精确地组装复杂群落。使用预定义的体外和虚拟肠道微生物组进行评估，OPERA-MS组装的宏基因组具有比长读长（> 5×; Canu）更高的碱基对准确度，比短读长更高的连续性（~10× NGA50; MEGAHIT，IDBA-UD），metaSPAdes）和比非宏基因组混合组装软件（2×; hybridSPAdes）更少的组装错误。 OPERA-MS在同一物种的多个基因组存在下提供菌株分辨率的组装结果，可在~9倍长读取覆盖率下获得稀有物种的高质量参考基因组（<1％）。我们使用OPERA-MS组装28个抗生素治疗患者的肠道宏基因组，并显示包含长纳米孔读长产生更多连续组装（比短读长组装提高200倍），包括超过80个成环质粒或噬菌体序列和一个新的263 kbp巨型噬菌体。高质量的混合组软件可以对人类患者的肠道抗生素抗性组进行精细的观察。

Characterization of microbiomes has been enabled by high-throughput metagenomic sequencing. However, existing methods are not designed to combine reads from short- and long-read technologies. We present a hybrid metagenomic assembler named OPERA-MS that integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities. Evaluation using defined in vitro and virtual gut microbiomes revealed that OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes). OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species (<1%) with ~9× long-read coverage and near-complete genomes with higher coverage. We used OPERA-MS to assemble 28 gut metagenomes of antibiotic-treated patients, and showed that the inclusion of long nanopore reads produces more contiguous assemblies (200× improvement over short-read assemblies), including more than 80 closed plasmid or phage sequences and a new 263 kbp jumbo phage. High-quality hybrid assemblies enable an exquisitely detailed view of the gut resistome in human patients.

主要结果

图1. OPERA-MS工作流程图

Fig. 1: OPERA-MS workflow.

首先将宏基因组的短读长拼接为重叠群，并将短读取和长读长比对至重叠群以获得覆盖信息和跨越序列（步骤1）。然后绑定跨越读长获得组装图中重叠群之间的边，该组装图表示整个宏基因组的连续性信息（步骤2）。将重叠群组织成层次聚类，其中重叠群之间的距离随基因组距离及其覆盖差异而增加（步骤3）。然后基于BIC（贝叶斯信息准则）将树切割成最佳簇（步骤4）。可选步骤，为了改善可获得参考基因组物种的聚类，计算每个聚类与完整细菌基因组数据库之间的Mash基因组距离（步骤5）。然后，如果在装配图中存在支持信息以形成物种特定的超级簇，则合并簇（步骤6）。进一步分析这些超级簇以解卷积来自可区分的亚种基因组的重叠群（步骤7）。最后，使用针对分离基因组的程序（OPERA-LG;步骤8），独立地构建每个簇并填充间隙。

Short reads are first assembled by a metagenomic assembler into contigs, and short and long reads are mapped to them to obtain coverage information and spanning reads (Step 1). Spanning reads are then bundled to get edges between contigs for an assembly graph that represents the contiguity information of the whole metagenome (Step 2). Contigs are organized into a hierarchical clustering where the distance between contigs increases with genomic distance and their difference in coverage (Step 3). The tree is then cut into optimal clusters based on the BIC (Step 4). Optionally, to improve the clustering for species where a reference genome is available, the Mash genomic distance between each cluster and a database of complete bacterial genomes is computed (Step 5). Clusters are then merged if there is supporting information in the assembly graph to form species-specific super-clusters (Step 6). These super-clusters are further analyzed to deconvolute contigs that come from distinguishable subspecies genomes (Step 7). Finally, each cluster is independently scaffolded and gap-filled using a program meant for isolate genomes (OPERA-LG; Step 8).

图2. 宏基因组数据混合组装基因组评测

Fig. 2: Benchmarking hybrid assembly of genomes from metagenomes.

a-c，作为短读长代表性组装软件metaSPAdes（a），长读长组装软件Canu（b）和混合组装软件OPERA-MS（c）的测序覆盖率增加与组装连续性的增加。请注意，混合装配在跨越覆盖方面有效改进了短读长和长读长的装配结果，可在低至9×长读长覆盖度下产生接近完整的基因组（NGA50 > 1 Mbp）。未组装的基因组显示为带有黑色边框的圆圈。 d，OPERA-MS与其他组装软件相比较提高的装配连续性（NGA50）。对于MEGAHIT和IDBA-UD，组装基因组中覆盖度上升的数量为3,12,20和19，对于metaSPAdes和hybridSPAdes为3,13,21和19，对于Canu为4和16。请注意，Canu不会组装低覆盖率的基因组，因此在这些范围内不提供指标。数据以箱形图表示（中心线，中位数；箱限，上下四分位数; 须线，1.5×四分位数间距; 点，异常值）。 e，不同组装软件的组装错误率，实线表示中值。除了hybridSPAdes之外，大多数组装软件每 Mbp（虚线）产生大约1个错误的组装。在每个部分中，每个数据点代表来自模拟群落的一个基因组。

a–c, Increase in assembly contiguity as a function of read coverage for a representative short-read assembler (a), long-read assembler (b) and hybrid assembler . Note that hybrid assembly improves over short- and long-read assembly in terms of scaling across coverage ranges and producing near-complete genomes (NGA50 >1 Mbp) with as little as 9× long-read coverage. Unassembled genomes are shown as circles with black borders. d, Improvements in assembly contiguity (NGA50) provided by OPERA-MS in comparison with other assemblers as a function of long-read coverage. The number of assembled genomes, in ascending order of coverage is 3, 12, 20 and 19 for MEGAHIT and IDBA-UD, 3, 13, 21 and 19 for metaSPAdes and hybridSPAdes and 4 and 16 for Canu. Note that Canu does not assemble low-coverage genomes and hence metrics are not provided in those ranges. Data are presented as box plots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). e, Misassembly rates for different assemblers, with solid lines indicating median values. Most assemblers produce ~1 large misassembly per Mbp (dashed line), except for hybridSPAdes. In each part, each data point represents one genome from the mock communities.

图3. 组装虚拟肠道微生物组

Fig. 3: Assembly of a virtual gut microbiome.

a，构建虚拟肠道微生物组，代表复杂的宏基因组数据集，同时保留评估组装与金标准参考的能力。

b，与不同覆盖范围内的其他组装软件相比，使用OPERA-MS获得组装连续性（NGA50）的改进情况。点代表在宏基因组中具有至少两个菌株的物种（在GIS20和S2中存在的物种，如MetaPhlAn2报道的丰度 > 0.1％（参考文献49）（v.2.6.0））。按照覆盖度的上升，组装的基因组的数量对于Canu是1，对于其他方法是2,6,4和5个。数据以箱形图表示（中心线，中位数；箱限，上下四分位数; 须线，1.5×四分位数间距; 点，异常值）。

c，不同组装软件的组装错误率（每个基因组一个点）的比较，实线表示中值。

d，在分箱后评估仅Illumina数据（M，MEGAHIT）和混合（H，hybridSPAdes; O，OPERA-MS）组装宏基因组组装以用于下游分析。包含最大部分参考基因组的区域（GIS20参考文献；具有粗体名称的物种在宏基因组中具有至少两个菌株）评估以下参数：（1）基因组完整性，在分箱中基因组的比例，（2）基因组纯度，分箱中碱基对应正确参考的百分比，（3）基因完整性，在分箱中完全组装的基因比例和（4）通路完整性，其组成基因超过90％的通路出现在组装的分箱中。

a, Construction of a virtual gut microbiome that represents a complex metagenomic data set while retaining the ability to evaluate assemblies against gold-standard references. b, Improvement in assembly contiguity (NGA50) obtained using OPERA-MS compared with other assemblers over different coverage ranges. Dots represent species that have at least two strains in the metagenome (species present in GIS20 and S2 with an abundance >0.1% as reported by MetaPhlAn2 (ref. 49) (v.2.6.0)). The number of assembled genomes, in ascending order of coverage, was 1 for Canu and 2, 6, 4 and 5 for the other methods. Data are presented as box plots (center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers). c, Comparison of misassembly rates (one dot per genome) for different assemblers, with solid lines indicating median values. d, Evaluation of Illumina-only (M, MEGAHIT) and hybrid (H, hybridSPAdes; O, OPERA-MS) metagenomic assemblies after binning for their utility in downstream analysis. Bins that contained the largest fraction of a reference genome (GIS20 references; species with bold names have at least two strains in the metagenome) were evaluated for (1) genome completeness, the fraction of the genome represented in the bin, (2) genome purity, percentage of bases in the bin that correspond to the correct reference, (3) gene completeness, fraction of genes that were fully assembled in the bin and (4) pathway completeness, fraction of pathways with over 90% of their constituent genes being assembled and binned together.

图4. 移动元件和与人肠道微生物组中宿主物种的关联

Fig. 4: Mobile elements and association with host species in the human gut microbiome.

a，来自OPERA-MS的28个人肠道宏基因组数据集中完全组装成环序列的基因组大小分布，说明了组装不同大小和复杂性的环状基因组的能力（质粒，噬菌体和细菌基因组）。

b，与NCBI核苷酸（nt）数据库中的序列（基于BLAST搜索）比对，覆盖序列的比例与组装的环状序列的平均序列相似度。许多组装序列从端到端（右上角）显示出与已知序列的良好比对和相似度，但是一些仅具有局部相似性（左上角），并且一些似乎是新的（左下角; 18个序列）。

c，观察到最大的新（在nt数据库中没有匹配）环状序列（263kbp）的注释，发现与噬菌体生命周期相关的蛋白，包括复制、组装和宿主裂解相关，表明组装的序列是假定的巨型噬菌体。

d，OPERA-MS从耐受碳青霉烯的肠杆菌科细菌定植患者的肠道微生物组中组装出新的多重抗性区域。除临床相关的碳青霉烯酶基因区域外，该区域还含有赋予氨基糖苷类、甲氧苄氨嘧啶和磺胺类抗性的基因，限制了治疗选择。

e，OPERA-MS菌株水平组装可以进行质粒与基因组基于跨越时间点的测序覆盖信息进行关联（n = 12）。左图：来自第76天的数据的杂合宏基因组装配中观察到的两种大肠杆菌菌株基因组的覆盖度的变化（黑色箭头）。右图：质粒覆盖度与两种大肠杆菌菌株之间的相关性表明它是可能含有IMP基因的质粒的菌株L使用R中的学生t-检验（双侧）计算P值。

a, Distribution of genomes sizes for fully assembled circular sequences from OPERA-MS in 28 human gut metagenome data sets, illustrating the ability to assemble circular genomes of varying sizes and complexity (plasmids, phages and bacterial genomes). b, Fraction of sequence covered versus average sequence identity of the assembled circular sequences in comparison to sequences in the NCBI nucleotide (nt) database (based on BLAST searches). Many of the assembled sequences showed good alignment and homology to known sequences from end to end (top right corner), but some only had local similarities (top left corner), and a few appear to be new (bottom left corner; 18 sequences). c, Annotation of the largest (263 kbp) observed new circular sequence (no matches in nt database) revealed proteins associated with a phage life cycle, including replication, assembly and host lysis, indicating that the assembled sequence is a putative jumbo phage. d, A new multiple resistance region assembled by OPERA-MS from the gut microbiome of a patient colonized by carbapenem-resistant Enterobacteriaceae. Apart from the clinically relevant carbapenemase gene cassette, the region also harbors genes that confer resistance to aminoglycosides, trimethoprim and sulfonamides, limiting treatment options. e, Strain level assembly with OPERA-MS enabled association of plasmid to genome based on correlation in read coverage across timepoints (n = 12). Left panel: Variation in coverage of two Escherichia coli strain genomes seen in the hybrid metagenomic assembly of data from day 76 (black arrow). Right panel: Correlation between the coverage of the plasmid and the two E. coli strains reveals that it is strain L that likely harbors the IMP gene containing plasmid. The P value was computed using Student’s t-test in R (two-sided).

总结

本文介绍了一种基于混合数据的宏基因组组装软件OPERA-MS，比较分析了其与其他几种短读长、长读长数据组装软件对宏基因组研究的效能。它能够显著的提升组装的连续性，并且还能够解决亚种级基因组的组装，解决了长读长数据的原始错误率、覆盖度问题和短读长数据的读长缺陷，即使对于低深度覆盖的数据也能有出色的表现。为了验证软件的应用能力，研究者还模拟了人体肠道微生物组的数据，发现其对于临床宏基因组、抗生素耐药性基因的研究上面也能提供较好的帮助。

Reference

Denis Bertrand, Jim Shaw, Manesh Kalathiyappan, Amanda Hui Qi Ng, M. Senthil Kumar, Chenhao Li, Mirta Dvornicic, Janja Paliska Soldo, Jia Yu Koh, Chengxuan Tong, Oon Tek Ng, Timothy Barkham, Barnaby Young, Kalisvar Marimuthu, Kern Rei Chng, Mile Sikic, and Niranjan Nagarajan. (2019). Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nature Biotechnology.10.1038/s41587-019-0191-2
有了OPERA-MS，人体肠道微生物不用愁！

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA

本文链接：https://blog.csdn.net/woodcorpse/article/details/98475794

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

oracle 12c 集群安装后的检查_12c查看crs状态-程序员宅基地

文章浏览阅读1.6k次。安装配置gi、安装数据库软件、dbca建库见下：http://blog.csdn.net/kadwf123/article/details/784299611、检查集群节点及状态：[root@rac2 ~]# olsnodes -srac1 Activerac2 Activerac3 Activerac4 Active[root@rac2 ~]_12c查看crs状态

解决jupyter notebook无法找到虚拟环境的问题_jupyter没有pytorch环境-程序员宅基地

文章浏览阅读1.3w次，点赞45次，收藏99次。我个人用的是anaconda3的一个python集成环境，自带jupyter notebook，但在我打开jupyter notebook界面后，却找不到对应的虚拟环境，原来是jupyter notebook只是通用于下载anaconda时自带的环境，其他环境要想使用必须手动下载一些库：1.首先进入到自己创建的虚拟环境(pytorch是虚拟环境的名字)activate pytorch2.在该环境下下载这个库conda install ipykernelconda install nb__jupyter没有pytorch环境

国内安装scoop的保姆教程_scoop-cn-程序员宅基地

文章浏览阅读5.2k次，点赞19次，收藏28次。选择scoop纯属意外，也是无奈，因为电脑用户被锁了管理员权限，所有exe安装程序都无法安装，只可以用绿色软件，最后被我发现scoop，省去了到处下载XXX绿色版的烦恼，当然scoop里需要管理员权限的软件也跟我无缘了（譬如everything）。推荐添加dorado这个bucket镜像，里面很多中文软件，但是部分国外的软件下载地址在github，可能无法下载。以上两个是官方bucket的国内镜像，所有软件建议优先从这里下载。上面可以看到很多bucket以及软件数。如果官网登陆不了可以试一下以下方式。_scoop-cn

Element ui colorpicker在Vue中的使用_vue el-color-picker-程序员宅基地

文章浏览阅读4.5k次，点赞2次，收藏3次。首先要有一个color-picker组件 <el-color-picker v-model="headcolor"></el-color-picker>在data里面data() { return {headcolor: ’ #278add ’ //这里可以选择一个默认的颜色} }然后在你想要改变颜色的地方用v-bind绑定就好了，例如：这里的:sty..._vue el-color-picker

迅为iTOP-4412精英版之烧写内核移植后的镜像_exynos 4412 刷机-程序员宅基地

文章浏览阅读640次。基于芯片日益增长的问题，所以内核开发者们引入了新的方法，就是在内核中只保留函数，而数据则不包含，由用户（应用程序员）自己把数据按照规定的格式编写，并放在约定的地方，为了不占用过多的内存，还要求数据以根精简的方式编写。boot启动时，传参给内核，告诉内核设备树文件和kernel的位置，内核启动时根据地址去找到设备树文件，再利用专用的编译器去反编译dtb文件，将dtb还原成数据结构，以供驱动的函数去调用。firmware是三星的一个固件的设备信息，因为找不到固件，所以内核启动不成功。_exynos 4412 刷机

Linux系统配置jdk_linux配置jdk-程序员宅基地

文章浏览阅读2w次，点赞24次，收藏42次。Linux系统配置jdkLinux学习教程，Linux入门教程（超详细）_linux配置jdk

随便推点

matlab(4)：特殊符号的输入_matlab微米怎么输入-程序员宅基地

文章浏览阅读3.3k次，点赞5次，收藏19次。xlabel('\delta');ylabel('AUC');具体符号的对照表参照下图：_matlab微米怎么输入

C语言程序设计-文件(打开与关闭、顺序、二进制读写)-程序员宅基地

文章浏览阅读119次。顺序读写指的是按照文件中数据的顺序进行读取或写入。对于文本文件，可以使用fgets、fputs、fscanf、fprintf等函数进行顺序读写。在C语言中，对文件的操作通常涉及文件的打开、读写以及关闭。文件的打开使用fopen函数，而关闭则使用fclose函数。在C语言中，可以使用fread和fwrite函数进行二进制读写。‍ Biaoge 于2024-03-09 23:51发布阅读量：7 ️文章类型：【 C语言程序设计】在C语言中，用于打开文件的函数是____，用于关闭文件的函数是____。

Touchdesigner自学笔记之三_touchdesigner怎么让一个模型跟着鼠标移动-程序员宅基地

文章浏览阅读3.4k次，点赞2次，收藏13次。跟随鼠标移动的粒子以grid（SOP）为partical（SOP）的资源模板，调整后连接【Geo组合+point spirit（MAT)】，在连接【feedback组合】适当调整。影响粒子动态的节点【metaball(SOP)+force(SOP)】添加mouse in（CHOP)鼠标位置到metaball的坐标，实现鼠标影响。..._touchdesigner怎么让一个模型跟着鼠标移动

【附源码】基于java的校园停车场管理系统的设计与实现61m0e9计算机毕设SSM_基于java技术的停车场管理系统实现与设计-程序员宅基地

文章浏览阅读178次。项目运行环境配置：Jdk1.8 + Tomcat7.0 + Mysql + HBuilderX（Webstorm也行）+ Eclispe（IntelliJ IDEA,Eclispe,MyEclispe,Sts都支持）。项目技术：Springboot + mybatis + Maven +mysql5.7或8.0+html+css+js等等组成，B/S模式 + Maven管理等等。环境需要1.运行环境：最好是java jdk 1.8，我们在这个平台上运行的。其他版本理论上也可以。_基于java技术的停车场管理系统实现与设计

Android系统播放器MediaPlayer源码分析_android多媒体播放源码分析时序图-程序员宅基地

文章浏览阅读3.5k次。前言对于MediaPlayer播放器的源码分析内容相对来说比较多，会从Java-&amp;gt;Jni-&amp;gt;C/C++慢慢分析，后面会慢慢更新。另外，博客只作为自己学习记录的一种方式，对于其他的不过多的评论。MediaPlayerDemopublic class MainActivity extends AppCompatActivity implements SurfaceHolder.Cal..._android多媒体播放源码分析时序图

java 数据结构与算法 ——快速排序法-程序员宅基地

文章浏览阅读2.4k次，点赞41次，收藏13次。java 数据结构与算法 ——快速排序法_快速排序法