Bioinformatics

Overview

The Bioinformatics Shared Resource supports the bioinformatics research needs of DCI members, including their needs for complex genomic and imaging data management, data integration, computing, statistical analysis, and machine learning. The support provided by the Bioinformatics Shared Resource is critical to DCI as it enables its members to analyze and interpret results from experimental and study data to their full potential with rigor and efficiency.

The group supports every facet of analysis of high-dimensional genomic and imaging data, from the design stage to pre-processing, to high-level association analyses with complex phenotypes, to annotation of results. The faculty and staff of the Bioinformatics Shared Resource not only answer DCI members’ data analysis needs but also contribute to writing abstracts and manuscripts as well as grant and contract applications.

Mission

The mission of the Bioinformatics Shared Resource is to provide research support for DCI members that: Enhances DCI research rigor and reproducibility. Increases collaborations across and among DCI programs, other Duke programs, and external investigators. This is accomplished by adhering to principles such as sound data provenance and statistical inference, literate programming, and reproducible analysis. In addition, whenever existing quantitative methods or computational tools fail to meet the researcher’s needs, the faculty and staff of the Bioinformatics Shared Resource leverage their deep and broad theoretical background and computing expertise to develop methodology or tools customized to solve the problem at hand.

This shared resource provides support to DCI programs and individual laboratories, coordinates institutional efforts in bioinformatics, helps drive the development of biotechnology and pharmaceutical sectors within Duke, and creates synergy between scientific and clinical groups. The importance of the Bioinformatics Shared Resource within DCI and at Duke has amplified with the explosion of genomic, proteomic, metabolomic, and other high-throughput data types; these data carry vast potential utility for breakthroughs in clinical and translational research, especially when combined with clinical, imaging, and other scientific data. Investigators spanning scientific disciplines that use high-dimensional bioinformatics data (e.g., genomics, metabolomics, proteomics) can leverage our expertise to increase the quality and efficiency of complex, integrative, collaborative cancer research. Given that the bioinformatics and biostatistics needs of DCI members often intersect, this shared resource is tightly integrated with the Biostatistics Shared Resource and serves as a liaison between DCI members and other DCI cores, notably the Integrated Cancer Genomics (ICG), the Functional Genomics (FG), and the BioRepository and Precision Pathology Center (BRPC) shared resources.

Services

The Bioinformatics Shared Resource serves as a centralized resource for expertise in applied and theoretical cancer bioinformatics, supporting DCI members across the continuum of research, and throughout all stages of an investigation.

Early-stage studies

Grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques)
Optimal selection of computational and data storage resources at Duke
Data query and analysis of public research data (e.g., from TCGA, dbGaP)
Design of primers for Sanger sequencing validation of breakpoints and fusion transcripts
Support for validation and meta-analyses

Pre-processing, analysis, and annotation of high-throughput sequencing and other assays, including:

DNA-Seq -- Germline, tumor, and cell-free assays based on candidate markers, whole-exome, or whole-genome sequencing
RNA-Seq -- Bulk and single-cell
ChIP-Seq
ATAC-Seq -- Bulk and single-cell
T and B Cell Receptor (TCR/BCR) sequencing-- Bulk and single cell
Metagenome -- shotgun and 16S bacterial sequencing
NanoString GeoMx Spatial Transcriptomics and Proteomics Assays
10x Visium Spatial Gene Expression
CRISPR screens and single-guide RNA detection
Flow cytometry
mRNA and genotyping arrays

Identification of:

Methylation
Alternative splicing
Copy number variation
Neopeptide prediction
Feature and variant annotation (e.g., VEP, ANNOVAR)
Associations of genetic and genomic variation with clinical outcome
Novel translocation breakpoints from DNA-Seq data
Novel gene fusion transcripts from RNA-Seq data

Analyses, including integrative analyses

Support for data programming including merging across heterogeneous data sources
Statistical genetics (e.g., candidate SNP, genome-wide association studies, analysis of rare variants, local and global ancestry inference, admixture mapping, haplotype regression)
Development of theoretical and applied methods for rigorous and efficient analysis of complex genomic data
Development of novel statistical methodologies for emerging sequencing technologies or the integration of multiple data types

Late-stage studies

Manuscript writing and review, with emphasis on methodology and results reporting
Reproducible, manuscript-quality figures
Deposition of genomic data into online research databases (e.g., GEO, SRA, or dbGaP)
Follow-up analyses for revisions and reviewer responses

Support for:

Data transfer (e.g., Globus) both within Duke’s IT infrastructure and among external collaborators
Data archiving, including optimal selection of data storage resources at Duke
Web interface and database programming assistance
User training in bioinformatics software and hardware, facilitating the use of computing resources offered by Duke University and available through commercial vendors (e.g., cloud computing)

The shared resource continues to refine existing and develop new workflows for analysis such as:

CRISPR targeted library screens, featuring our bcSeq R package
Single-cell and spatial transcriptomic sequencing
Detection of enrichment or depletion of sgRNAs
Microbiome data, including measures of diversity and dominance of selected microbes
Estimation of global and local genetic ancestry
Haplotype regression
Imaging technologies (e.g., 10x Visium, CODEX, and MIBI)

The shared resource pipelines utilize container technologies, allowing them to be run on local servers, university compute clusters, or cloud services providers (e.g., Amazon Web Services, Microsoft Azure, or Google Cloud).

High-Performance Computing and Storage

Compute
The Bioinformatics Shared Resource leverages both local and cloud computing environments to meet the needs of large-scale omics research.

We have exclusive access to three local compute servers:

40 cores (80 threads) Intel Xeon E5-2698V4 server with 1TB of RAM and 146TB RAID 10 storage array
64-core Opteron 6386 SE server with 512GB of RAM and 44TB RAID 10 storage array
48-core Opteron 6180 SE server with 256GB of RAM and 34TB RAID 10 storage array

We regularly utilize Duke’s cluster computing resources:

The Duke Compute Cluster (DCC) consists of over 30,000 vCPU-cores and 730 GPUs, with underlying hardware from Cisco Systems UCS blades in Cisco chassis. GPU-accelerated computers are Silicon Mechanics with a range of Nvidia GPUs, including high-end “computational” GPUs (V100, P100) and “graphics” GPUs (TitanXP, RTX2080TI). Interconnects are 10 Gbps. General partitions are on Isilon, 40Gbps or 10Gbps network-attached storage arrays. The cluster provides 1TB of group storage and 10GB for each personal home directory. The cluster also provides 450TB of scratch storage and archival storage at the cost or 0.08/GB/year. This system may not be used for storage or analysis of sensitive data. See dcc.duke.edu for additional information.

The HARDAC cluster consists of 1512 physical CPU cores and 15TB of RAM distributed over 60 computer nodes. For computing with high-volume genomics data, HARDAC is equipped with high-performance network interconnects (Infiniband) and an attached high-performance parallel file system, providing roughly 1.2 petabytes of mass storage. All nodes are interconnected with 56Gbps FDR InfiniBand, and the data transfer node of the cluster is linked to the Duke Health Technology Services (DHTS) network through pair-bonded 10GB Ethernet switches. The attached mass storage runs IBM’s General Parallel File System (GPFS), which is managed through two redundant GPFS NSD server nodes and designed to sustain ~5GB per second average input/output read rate. See this page for additional details.

Finally, the DHTS Azure School of Medicine HPC (DASH) cloud-based cluster can scale up to 13 nodes, including up to 10 “Execute” partition nodes with 32 vCPUs, 256 GB of RAM, 1200 GB of attached SSD temporary storage, and 16,000 Mbps network bandwidth, and up to three “highmem” partition nodes with 96 vCPUs, 672 GB of RAM, 3600 GB of attached SSD temporary storage, and 35,000 Mbps network bandwidth. The available scratch space includes a 2 TiB Lustre Marketplace Filesystem with backend to Azure Blob container storage, and up to 5 PB of storage.

Storage
To meet the high-capacity storage demands of high-throughput sequencing data, the Bioinformatics Shared Resource integrates its workflows with and promotes the adoption of DHTS-supported cloud storage, including Azure storage containers and Amazon Web Services (AWS) Simple Storage Service (S3). Both services offer secure data storage that automatically expands to match our needs and the needs of our collaborators, both can be accessed directly from all active compute environments, and both utilize intelligent storage tiering, allowing them to also serve as study data archives.

Software The Bioinformatics Shared Resource adheres to the principles of sound data provenance, literate programming, and reproducible analysis. To this end, we utilize the open-source software model to the fullest extent possible. The resource uses GNU/Linux as the operating system for its servers and individual workstations.

The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on our servers. The resource also maintains several other R extension packages developed by its faculty members. Duke University site-license agreements provide access to commercial software packages, including SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed.

For the production of reproducible reports, the shared resource uses the knitr, RMarkdown, and Jupyter notebook systems. Analysis code and software pipelines are maintained under strict source code management using Duke’s internal GitLab repository, managed by OIT. The GitLab infrastructure includes functionality for automated container building, which further supports the development of the shared resource’s preprocessing and analysis pipelines. The associated repositories can then be made public to accompany the manuscript publication.

We are actively working to translate our suite of pipelines to the Nextflow DSL, utilizing Singularity software containers. The container software environment allows for the standardization and portability of workflows while still allowing them to be optimized for the available computational resources, as we work to both improve workflow efficiency and reduce computing costs. This transition will ensure total portability of our pipelines across computing environments. Utilization of these tools, in addition to the deposition of source data into public repositories (e.g., GEO, SRA or dbGaP), ensures end-to-end reproducibility of all of our statistical analyses.

Accessing the Bioinformatics Shared Resource

The services and resources provided by the Bioinformatics Shared Resource are available to all Duke faculty who have been designated as DCI members as the shared resource is, at present, exclusively focused on the research needs of DCI members (other bioinformatics resources at Duke are available for non-DCI members). Requests for resource services can be sent to dcibioinformatics@duke.edu.

Upon receipt of the initial request, Dr. Owzar schedules an in-person or phone meeting with the requesting DCI research team and appropriate shared resource personnel. Presently, the initial response time is less than one business day. Dr. Owzar is responsible for prioritizing the resources, staff, and hardware.

Services and Fees

The Bioinformatics Shared Resource is not a fee-for-service data analysis core. Instead, the shared resource seeks to establish externally funded scientific collaborations with DCI members. The Bioinformatics Shared Resource is able to provide grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques) and budget estimations (including computing, data storage, and bioinformatic staff and faculty effort) in pursuit of these collaborations.

The first step in establishing a new collaboration with the Bioinformatics Shared Resource is for the principal investigator to request a set of initial meetings with the shared resource leadership and staff to discuss the scientific objectives and scope of the proposed project. These meetings provide the opportunity for the shared resource faculty and staff to learn the background and objectives of the project, and for the investigators to learn more about the resources and expertise of the shared resource. These meetings will also help with determining if resources or expertise from other DCI shared resources, for example the DCI Biostatistics Shared Resource, need to be included. There are no charges for these initial discussions. The Bioinformatics Shared Resource faculty are also highly knowledgeable in providing advice on sequencing technologies, statistical methodology, and computing tools and environments. This support is generally provided at no charge to DCI members.

There are limited faculty, staff, and computing resources available for conducting initial analyses during the pre-award stage (e.g., to develop preliminary data for a grant application). If these initial resources are deemed to be insufficient considering the scope of the requisite analyses, the Bioinformatics Shared Resource will assist in developing a budget to conduct the requisite analyses.

The costs of provisioning and using cluster resources and cloud storage from the Duke Health Technology Solutions (DHTS) are charged directly to DCI members.

Contact

Kouros Owzar, Director
Alex Sibley, Manager

Location

Hock Plaza  
2424 Erwin Road, Suite 8113,
Durham, NC 27705

This page was reviewed on 09/27/2023