CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

Oh, Jeongsu; Kim, Kyung Mo; Cho, Wan-Sup; Arshan Nasir; Hong, Soon Gyu; Lee, Sang Heon; Hwang, Kyuin; Kim, Byung Kwon; Park, Minkyu; Choi, Chi-Hwan

KOPRI Repository

About Login

tab

검색

Search result

CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment

Cited 8 time in wos

Cited 10 time in

Export

URL Copy

Full metadata record

DC Field	Value	Language
dc.contributor.author	Oh, Jeongsu	-
dc.contributor.author	Kim, Kyung Mo	-
dc.contributor.author	Cho, Wan-Sup	-
dc.contributor.author	Arshan Nasir	-
dc.contributor.author	Hong, Soon Gyu	-
dc.contributor.author	Lee, Sang Heon	-
dc.contributor.author	Hwang, Kyuin	-
dc.contributor.author	Kim, Byung Kwon	-
dc.contributor.author	Park, Minkyu	-
dc.contributor.author	Choi, Chi-Hwan	-
dc.date.accessioned	2018-03-29T06:10:56Z	-
dc.date.available	2018-03-29T06:10:56Z	-
dc.date.issued	2016	-
dc.identifier.uri	https://repository.kopri.re.kr/handle/201206/7445	-
dc.description.abstract	High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence reads corresponding to different organisms present in the environmental samples. Typically, analysis of microbial diversity in bioinformatics starts from pre-processing followed by clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream analysis time. However, existing hierarchical clustering algorithms that are generally more accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology？a distributed data structure to store all data in the main memory of multiple computing nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM, while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated on published 16S rRNA human microbiome sequence datasets using the small laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments. Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K reads regardless of the complexity of the human microbiome data. In turn, one million reads were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes on the Amazon EC2 cloud-computing environment. The running time evaluation indicates that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is also a scalable distributed processing system. The comparative accuracy test using 16S rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm. CLUSTOMCLOUD is written in JAVA and is freely available at http://clustomcloud.kopri.re.kr.	-
dc.language	English	-
dc.title	CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment	-
dc.title.alternative	CLUSTOM-CLOUD: 클라우드 환경에서 16S rRNA 염기서열을 클러스터링하기 위한 인메모리 데이터그리드 소프트웨어	-
dc.type	Article	-
dc.identifier.bibliographicCitation	Oh, Jeongsu, et al. 2016. "CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment". <em>PLOS ONE</em>, 11(3(e0151064)): 1-20.	-
dc.citation.title	PLOS ONE	-
dc.citation.volume	11	-
dc.citation.number	3(e0151064)	-
dc.identifier.doi	10.1371/journal.pone.0151064	-
dc.citation.startPage	1	-
dc.citation.endPage	20	-
dc.description.articleClassification	SCIE	-
dc.description.jcrRate	JCR 2014:15.789	-
dc.subject.keyword	clustering	-
dc.subject.keyword	in-memory data grid	-
dc.identifier.localId	2016-0035	-
dc.identifier.scopusid	2-s2.0-84961154581	-
dc.identifier.wosid	000371991300079	-