The Bioinformatics Shared Resource, a core function of Duke Cancer Institute (DCI) supports the bioinformatics research needs of DCI investigators, including their needs for complex genomic data management, data integration, computing and statistical analysis. Its mission is to provide a high-quality, service-oriented, coordinated and cost efficient bioinformatics infrastructure for DCI researchers, one which increases collaborations across DCI programs, and among the DCI, other Duke programs and external investigators. This mission is accomplished within the framework of adherence to sound data provenance and statistical principles, literate programming, and reproducible analysis.
The Bioinformatics Shared Resource covers every facet of analysis of high-dimensional genomic data starting from the design stage to pre-processing (background, normalization and summarization of RNA microarrays; genotype and copy number calling from GWAS platforms; and alignment, normalization (RNA-seq) and SNV calling (DNA-seq) of next-generation sequencing [NGS] platforms), high-level association analyses with complex phenotypes, and genomic annotation of results.
This core provides resources to DCI programs and individual laboratories, coordinates institutional efforts in bioinformatics, helps drive the development of the biotechnology and pharmaceutical sectors within Duke, and creates synergy between scientific and clinical groups. The importance of the Bioinformatics Shared Resource within DCI and at Duke has amplified with the explosion of genomic, proteomic, metabolomic, and other high-throughput data types; these data carry vast potential utility for clinical and translational research, especially when combined with clinical, imaging, and other scientific data.
Investigators spanning scientific disciplines that use high-dimensional bioinformatics data (e.g., genomics, metabolomics, proteomics) can leverage our expertise to increase the quality and efficiency of complex, integrative, collaborative cancer research.
- Pre-processing, analysis and annotation of molecular assays including microarray, GWAS, RNA-seq, DNA-seq, Chip-seq and flow cytometry platforms
- Manuscript writing and review
- Grant writing support
- Provision of rigorous genomic study designs (including power and sample size calculations using simulation techniques)
- Provision of turnkey computing solutions for routine analyses
- Development of theoretical and applied statistical methods for rigorous and efficient analysis of complex genomic data
- Management and provision of high performance computing (HPC) resources including CPU, GPU and MapReduce computing, large data storage and cloud computing
- Data programming including merging across heterogeneous high throughput platforms for integrative genomic analysis
- Web-interface and database programming
- User training in bioinformatics software and hardware
- Enable the use of computing resources managed throughout Duke University and available through commercial vendors (e.g., Amazon Web Services cloud computing)
- Assistance with depositing genomic data into research databases (e.g., GEO and dbGaP)
Services include consultation and bioinformatics programming to assist with study design and analysis, high-performance computing (HPC) leveraging CPUs and GPUs, data storage, and a strong commitment to training and education of clinical, translational, and basic science investigators. Our overall goal is to apply expertise in bioinformatic technologies, statistics and information systems to the creation of systems for conducting reproducible research of high-dimensional data types. These systems will allow for all raw data, analytical processes, and results to be stored and made publicly available under common standards such that they can be independently verified. Services include state-of-the-art hardware and software to support a full range of research involving "-omics" with a particular emphasis on open development and open source solutions.
Integration with other Duke Resources. Under the leadership of Dr. Owzar, the DCI Bioinformatics Shared Resource formally and actively collaborates with the Department of Biostatistics and Bioinformatics in the School of Medicine, the Duke Translational Medicine Institute (DTMI), the Duke Office of Information Technology (OIT), and the Duke Office of Clinical Research (DOCR) to provide expertise and resources specific to data storage, management and analysis of high-dimensional data types in an efficient manner. Dr. Owzar's overarching goal for these collaborations is to ensure that the DCI Bioinformatics Shared Resource takes full advantage of resources within Duke and avoids duplication of effort and expenditures as it meets the needs of DCI researchers.
The DCI Bioinformatics Shared Resource provides data management and analysis support for traditional microarray platforms (mRNA microarrays and genome-wide DNA arrays) and for next generation high throughput assays (DNA-seq and RNA-seq). It also provides support for candidate biomarker studies and cell-based assays including flow cytometry data. The Resource covers every facet of analysis and management of project data from the design stage, through pre-processing and downstream analyses through annotation.
Software. The Bioinformatics Shared Resource adheres to an open-source software model to the fullest extent. To this end, the resource uses GNU/Linux as the operating system for its servers and several of its individual workstations. The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on its servers. The resource also maintains a number of other R extension packages developed by its faculty members along with developmental packages from RForge. Commercial software packages, available through Duke University site-license agreements, include SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed. For the production of reproducible reports, the shared resource uses the Sweave, knitr, Python sphinx and IPython notebook systems.
The Bioinformatics Shared Resource has begun use of the Mercurial source code management (SCM) software for its projects. Staff members work on local repositories and push their changes to a common server.
Assays and Platforms Supported by the Resource. A representative listing of cellular assays and platforms that are supported by the Bioinformatics Shared Resource is provided here.
- Affymetrix microarray and GWAS platforms: The resource maintains an installation of the Affymetrix Power Tools (APT) for pre-processing of Affymetrix arrays. The shared resource also uses extension packages from the Bioconductor project for pre-processing, analysis and annotation of these platforms.
- Illumina microarray and GWAS platforms: The resource maintains a license for the microarray and GWAS modules of the Illumina GenomeStudio software. The resource also uses extension packages from the Bioconductor project for pre-processing, analysis and annotation of these platforms.
- Next Generation Sequencing (NGS) platforms: The following pipelines for Next Generation Sequencing data analysis have been deployed. or are undergoing testing.
- The GATK pipeline for pre-processing and variant calling of DNA-Seq data is in production
- The GATK-Queue framework (using scala to design GATK pipelines) is undergoing testing
- The mutect pipeline for calling somatic mutations is in production
- The bowtie2/HT-Seq pipeline for pre-processing and analyzing RNA-Seq data is in production
- the rMATS pipeline for detection of differential alternative splicing of RNA-Seq data is undergoing testing
- The ANNOVAR toolkit for annotating variants is in production
- Genome/Exome Variation data analysis and Chip-seq data analysis: GATK has been installed on a single node production server and is currently under testing.
- Flow Cytometry: To manage flow cytometry data from single and multi-laboratory studies, with the goal of automated analysis, the Resource uses the ReFlow system. Data preprocessing is necessary to coerce heterogeneous flow cytometry data into a consistent structure for machine-based analysis. When dealing with data from multiple laboratories, data preprocessing often becomes a bottleneck for automated analysis because metadata may be encoded differently by each laboratory. ReFlow consists of a relational database backend, a web interface, a user-friendly client and a REST (Representational State Transfer) API for data and metadata transfer. Additionally, a number of packages for management and analysis of flow cytometry data from the Bioconductor project are available.
- Hardware owned and managed by the Bioinformatics shared resource:
- 64 core Opteron server with 512GB of RAM
- 32 core Opteron server with 128GB of RAM
- 48TB of local storage (twenty-four 4TB drives in RAID 10)
- Quadro K6000 GPU (2880 streaming cores and 12GB of memory)
- Quadro 2000 GPU (192 streaming cores and 1GB of memory)
- HPC clusters managed by the Duke University
- Duke Shared Cluster Resource (DSCR) CPU Cluster providing 5220 cores distributed over 458 nodes and eleven M2070 Tesla GPU cards
- Blue Devil GPU cluster consisting of GT2000, GTX275 and Tesla C1060 cards
- Forty nodes, each with 16 CPU-cores and either 128 GB or 256 GB of RAM currently providing 530TB of NetApp E-series storage using the IBM General Parallel File System (GPFS).
- Storage Nodes:
- Duke University Netapp SAS, Netapp SATA and Dell SATA drive storage
- Amazon Glacier storage (AWS)
Hardware. The computing hardware infrastructure of the Bioinformatics Shared Resource consists of dedicated hardware owned and managed by the resource and is further extended by hardware resources maintained by Duke OIT and by the DTMI. Additionally, the Bioinformatics Shared Resource takes advantage of commercial cloud computing resources including Amazon Web Services (AWS).
The personal and server computing resources owned by the Bioinformatics Shared Resource are managed by the DCI Information Systems (DCI IS) Shared Resource. The servers are housed on the 7th floor of Hock Plaza in a secure, temperature and humidity controlled computer room with FM200 fire suppression, UPS and emergency generator power protection. DCI IS provides both protected and DMZ network connections.
The HPC clusters managed by the Bioinformatics Shared Resource operate within Duke Medicine's protected networks and security, access, and authorization measures have been taken that allow for the analysis of protected health information (PHI). It should be pointed out that despite this level of protection, whenever possible the phenotypic data will be anonymized or de-identified. Duke computing resources that are not authorized for storing PHI are exclusively used for simulation studies.
Data Storage Options. A crucial and expensive aspect of genomic analysis is access to storage. We provide access to various data storage options to meet project requirements related to storage speed and space. The Bioinformatics Shared Resource offers storage to DCI investigators on its local storage servers for small to moderately sized projects. For large projects, the Bioinformatics Shared Resource uses storage solutions managed by Duke University and Amazon Web Services.
DHTS offers a wide variety of storage options that can be chosen based on the specific needs. These include the following four options (the pricing is of 01/07/2016):
- EMC VMax – Tier 1
- Moderate acquisition cost of $4.19 per GB ($4,290 / TB)
- Integrated large scale disk array, Centralized controller and cache system, Ability to replicate between one or more devices , 10+K IOPS, Primarily structured data
- Use Case - Database - Transaction Processing - Mission critical application
- Connectivity – SAN
- EMC VNX – Tier 2
- Moderate acquisition cost of $3.01 per GB ($3,082 / TB)
- Higher capacity (100's Terabytes), High speed drives (15K to 10K RPM drives),
- Sequential Performance, Scale-out design, decreased disk-to-controller or increased sub-system to gain performance, Both structured and unstructured data
- Use Case - Application data, Transformation and transitional data, Tier 4 thru 5 cache storage Connectivity –SAN, iSCSI, NAS
- EMC Isilon – Tier 3
- Lower acquisition cost of $0.48 per GB ($492 / TB)
- Highest capacity drives (1 TB or greater), Lower speed drives (less than 10K RPM), Higher disk to controller ratio, scale-up, hundreds of drives per controller, primarily unstructured data
- Use Case - Access storage, objects, file shares
- Connectivity–iSCSI, NAS, SAS
- EMC DataDomain – Tier 4 – Used for backups
- Moderate acquisition cost of $1.58 per GB ($1,618 / TB)
- Characteristics - Mixture of disk, tape and software, Back-up storage product, Administrator assistance required for data recall.
- Connectivity - NAS, SAN, Server Agents
The Duke University Research Computing offers large data storage resources through the Duke Data Commons (https://rc.duke.edu/data-storage-2/)
For long-term storage of large data files that are not sensitive and do not need to be accessed frequently, the Amazon Glacier system from Amazon Web Services is used. Currently the monthly charge for storage is $0.01 per GB. For 1TB of space this amounts to an annual charge of $120. The Bioinformatics Shared Resource will facilitate transfer DCI investigators to this resource when feasible.
Accessing the Bioinformatics Shared Resource
The services and resources provided by the Bioinformatics Shared Resource are available to all Duke faculty who have been designated as DCI members; for the shared resource is, at present, exclusively focused on the research needs of DCI members (other bioinformatics resources at Duke are available for non-DCI members). Currently, all requests for resource services are communicated directly to Dr. Owzar by phone or email. Dr. Owzar evaluates each request and delegates tasks to an appropriate staff or faculty member. Upon receipt of the initial request, Dr. Owzar schedules an in-person or phone meeting with the requesting DCI research team and appropriate shared resource personnel. Presently, the initial response time is less than one business day. Dr. Owzar is responsible for prioritizing the resources, staff and hardware.
Currently, DCI members are not charged for Bioinformatics Shared Resource services in order to lay the foundation for long-term scientific collaborations between DCI investigators and shared resource personnel. These long-term collaborations are expected to lead to grant and federal and industry contract applications in which shared resource staff and faculty are included as co-investigators. Nor does the Bioinformatics Shared Resource charge DCI members for using its in-house computational hardware, including CPU cycles, GPUs, and local storage; these resources are available to DCI investigators on a 24-7-365 basis through secure access mechanisms (VPN and ssh). As described under Equipment, the resource also heavily leverages other computing resources, including storage, available at Duke and through Amazon Web Services. Any cost incurred for use of the latter resources is passed to the investigator; the Bioinformatics Shared Resource staff assists and trains at no cost DCI members interested in those resources.