CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
Cited 8 time in
Cited 10 time in
-
Title
-
CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment
-
Authors
-
Oh, Jeongsu
Choi, Chi-Hwan
Park, Minkyu
Kim, Byung Kwon
Hwang, Kyuin
Lee, Sang Heon
Hong, Soon Gyu
Nasir, Arshan
Cho, Wan-Sup
Kim, Kyung Mo
-
Subject
-
Science & Technology - Other Topics
-
Keywords
-
Clustering; In-memory data grid
-
Issue Date
-
2015
-
Citation
-
Oh, Jeongsu, et al. 2015. "CLUSTOM-CLOUD: In-Memory Data Grid-Based Software for Clustering 16S rRNA Sequence Data in the Cloud Environment". PLoS ONE, 11(3): e0151064.
-
Abstract
-
High-throughput sequencing can produce hundreds of thousands of 16S rRNA sequence
reads corresponding to different organisms present in the environmental samples. Typically,
analysis of microbial diversity in bioinformatics starts from pre-processing followed by
clustering 16S rRNA reads into relatively fewer operational taxonomic units (OTUs). The
OTUs are reliable indicators of microbial diversity and greatly accelerate the downstream
analysis time. However, existing hierarchical clustering algorithms that are generally more
accurate than greedy heuristic algorithms struggle with large sequence datasets. To keep
pace with the rapid rise in sequencing data, we present CLUSTOM-CLOUD, which is the
first distributed sequence clustering program based on In-Memory Data Grid (IMDG) technology-a distributed data structure to store all data in the main memory of multiple computing
nodes. The IMDG technology helps CLUSTOM-CLOUD to enhance both its capability
of handling larger datasets and its computational scalability better than its ancestor, CLUSTOM,
while maintaining high accuracy. Clustering speed of CLUSTOM-CLOUD was evaluated
on published 16S rRNA human microbiome sequence datasets using the small
laboratory cluster (10 nodes) and under the Amazon EC2 cloud-computing environments.
Under the laboratory environment, it required only ~3 hours to process dataset of size 200 K
reads regardless of the complexity of the human microbiome data. In turn, one million reads
were processed in approximately 20, 14, and 11 hours when utilizing 20, 30, and 40 nodes
on the Amazon EC2 cloud-computing environment. The running time evaluation indicates
that CLUSTOM-CLOUD can handle much larger sequence datasets than CLUSTOM and is
also a scalable distributed processing system. The comparative accuracy test using 16S
rRNA pyrosequences of a mock community shows that CLUSTOM-CLOUD achieves higher accuracy than DOTUR, mothur, ESPRIT-Tree, UCLUST and Swarm
-
DOI
-
http://dx.doi.org/10.1371/journal.pone.0151064
-
Type
-
Article
- Appears in Collections
- 2014-2016, Long-Term Ecological Researches on King George Island to Predict Ecosystem Responses to Climate Change (14-16) / Hong; Soon Gyu (PE14020; PE15020; PE16020)
- Files in This Item
-
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.