The Bioinformatics Shared Resource, a core function of Duke Cancer Institute (DCI), supports the bioinformatics research needs of DCI members, including their needs for complex genomic and imaging data management, data integration, computing, statistical analysis, and machine learning. The support provided by the Bioinformatics Shared Resource is critical to DCI as it enables its members to analyze and interpret results from experimental and study data to its full potential with rigor and efficiency. The group supports every facet of analysis of high-dimensional genomic and imaging data, from the design stage, to pre-processing, to high-level association analyses with complex phenotypes, to annotation of results. The faculty and staff of the Bioinformatics Shared Resource not only answer DCI members’ data analysis needs, but also contribute to writing abstracts and manuscripts as well as grant and contract applications.
It is the mission of the Bioinformatics Shared Resource to provide research support for DCI members that: 1) enhances DCI research rigor and reproducibility and 2) increases collaborations across and among DCI programs, other Duke programs, and external investigators. This is accomplished by adhering to principles such as sound data provenance and statistical inference, literate programming, and reproducible analysis. In addition, whenever existing quantitative methods or computational tools fail to meet researcher’s needs, the faculty and staff of the Bioinformatics Shared Resource leverage their deep and broad theoretical background and computing expertise to develop methodology or tools customized to solve the problem at hand.
This shared resource provides support to DCI programs and individual laboratories, coordinates institutional efforts in bioinformatics, helps drive the development of biotechnology and pharmaceutical sectors within Duke, and creates synergy between scientific and clinical groups. The importance of the Bioinformatics Shared Resource within DCI and at Duke has amplified with the explosion of genomic, proteomic, metabolomic, and other high-throughput data types; these data carry vast potential utility for breakthroughs in clinical and translational research, especially when combined with clinical, imaging, and other scientific data. Investigators spanning scientific disciplines that use high-dimensional bioinformatics data (e.g., genomics, metabolomics, proteomics) can leverage our expertise to increase the quality and efficiency of complex, integrative, collaborative cancer research.
Given that the bioinformatics and biostatistics needs of DCI members often intersect, this shared resource is tightly integrated with the Biostatistics shared resource and serves as a liaison between DCI members and other DCI cores, notably the Integrated Cancer Genomics (ICG), the Functional Genomics (FG), and the BioRepository and Precision Pathology Center (BRPC) shared resources.
- Pre-processing, analysis, and annotation of high-throughput sequencing assays:
- Germline, tumor, and cell-free DNA-Seq (candidate marker, whole-exome or whole-genome)
- Bulk and single-cell RNA-seq
- Bulk and single-cell ATAC-seq
- Bulk and single cell TCR/BCR
- 16S and metagenomics microbiome
- CRISPR and sgRNA screens
- Flow cytometry
- mRNA and genotyping arrays
- Neopeptide prediction
- Alternative splicing
- Copy number variation
- Support for feature and variant annotation (e.g., VEP, ANNOVAR)
- Support for statistical downstream analyses (e.g., modeling associations of genetic and genomic variation with clinical outcome)
- Support for integrative genomic analyses
- Support for statistical genetics (e.g., candidate SNP, genome-wide association studies, analysis of rare variants, local and global ancestry inference, admixture mapping, haplotype regression)
- Data query and analysis of public research data (e.g., from TCGA, dbGaP), and assistance with depositing genomic data into research databases (e.g., GEO, SRA or dbGaP)
- Manuscript writing and review
- Grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques)
- Development of theoretical and applied methods for rigorous and efficient analysis of complex genomic data
- Management and provision of high-performance computing (HPC) resources including CPU and GPU computing, large data storage, and cloud computing
- Data programming including merging across heterogeneous data sources for integrative genomic analysis
- Web-interface and database programming
- User training in bioinformatics software and hardware and facilitating the use of computing resources offered by Duke University and available through commercial vendors (e.g., cloud computing)
- Development of novel statistical methodologies for emerging sequencing technologies or the integration of multiple data types
- Support for validation studies
- Design of primers for Sanger sequencing validation of breakpoints and fusion transcripts
- Identification of novel translocation breakpoints from DNA-Seq data
- Identification of novel gene fusion transcripts from RNA-Seq data
The shared resource provides multiple pipelines for processing and analysis of high throughput sequencing (HTS) technologies. The pipelines utilize container technologies, allowing them to be run on local servers or university compute clusters, with a view towards compatibility with other cloud services providers (e.g., Amazon Web Services, Microsoft Azure or Google Cloud). When available, the pipelines leverage container management software to automate the allocation of computing resources for the workflow. Additionally, the shared resource is now collaborating with Duke IT to optimize these pipelines for use on DHTS’s Virtual Private Cluster (Duke VPC) hosted on Amazon Web Services (AWS). Integration with the Duke Center for Genomic and Computational Biology Center (GCB) HARDACcluster and Duke VPC has allowed the shared resource to adopt the Globus data transfer service as a secure and efficient means for moving large data sets, both within Duke’s IT infrastructure, and among external collaborators. The Bioinformatics Shared Resource also leverages Duke’s internal GitLab repository, managed by OIT, to house its extensive source code management repositories, used to ensure reproducibility of analyses. The GitLab infrastructure includes functionality for automated container building, which further supports the development of the shared resource’s preprocessing and analysis pipelines.
The shared resource continues to refine existing and develop new workflows for analysis such as:
- CRISPR targeted library screens, featuring our bcSeq R package
- Single-cell and spatial transcriptomic sequencing
- Detection of enrichment or depletion of sgRNAs
- Microbiome data, including measures of diversity and dominance of selected microbes
- Estimation of global and local genetic ancestry
- Haplotype regression
- Imaging technologies (e.g., CODEX and MIBI)
Hardware: In support of DCI research, the Bioinformatics Shared resource has exclusive access to three local compute servers:
- 40 cores (80 threads) Intel Xeon E5-2698V4 server with 1TB of RAM and 146TB RAID 10 storage array
- 64 core Opteron 6386 SE server with 512GB of RAM and 44TB RAID 10 storage array
- 48 core Opteron 6180 SE server with 256GB of RAM and 34TB RAID 10 storage array
The DCC cluster consists of over 30,000 vCPU-cores and 730 GPUs, with underlying hardware from Cisco Systems UCS blades in Cisco chassis. GPU-accelerated computers are Silicon Mechanics with a range of Nvidia GPUs, including high end “computational” GPUs (V100, P100) and “graphics” GPUs (TitanXP, RTX2080TI). Interconnects are 10 Gbps. General partitions are on Isilon, 40Gbps or 10Gbps network attached storage arrays. The cluster provides 1TB of group storage and 10GB for each personal home directory. The cluster also provides 450TB of scratch storage and archival storage at the cost or 0.08/GB/year. This system may not be used for storage or analysis of sensitive data. See https://rc.duke.edu/dcc/ and https://rc.duke.edu/dcc/cluster-storage/ for additional information.
The HARDAC cluster consists of 1512 physical CPU cores and 15TB of RAM distributed over 60 computer nodes. For computing with high-volume genomics data, HARDAC is equipped with high-performance network interconnects (Infiniband) and an attached high-performance parallel file system, providing roughly 1.2 petabytes of mass storage. All nodes are interconnected with 56Gbps FDR InfiniBand, and the data transfer node of the cluster is linked to the Duke Health Technology Services (DHTS) network through pair-bonded 10GB Ethernet switches. The attached mass storage runs IBM’s General Parallel File System (GPFS), which is managed through two redundant GPFS NSD server nodes and designed to sustain ~5GB per second average input/output read rate. See https://genome.duke.edu/cores-and-services/computational-solutions/compute-environments-genomics for additional details
Software: The Bioinformatics Shared Resource adheres to an open-source software model to the fullest extent possible. To this end, the resource uses GNU/Linux as the operating system for its servers and several of its individual workstations. The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on its servers. The resource also maintains a number of other R extension packages developed by its faculty members along with developmental packages from RForge. Duke University site-license agreements provide access to commercial software packages, including SAS, Matlab, Maple and Mathematica. Linux ports for these software products are available and currently installed. To create reproducible reports, the shared resource uses the knitr, Python sphinx and Jupyter notebook systems.
Accessing the Bioinformatics Shared Resource
The services and resources provided by the Bioinformatics Shared Resource are available to all Duke faculty who have been designated as DCI members; for the shared resource is, at present, exclusively focused on the research needs of DCI members (other bioinformatics resources at Duke are available for non-DCI members). Currently, all requests for resource services are communicated directly to Dr. Owzar by phone or email. Dr. Owzar evaluates each request and delegates tasks to an appropriate staff or faculty member. Upon receipt of the initial request, Dr. Owzar schedules an in-person or phone meeting with the requesting DCI research team and appropriate shared resource personnel. Presently, the initial response time is less than one business day. Dr. Owzar is responsible for prioritizing the resources, staff and hardware.
Service and Fees
The Bioinformatics Shared Resource is not a fee-for-service data analysis core. Instead, the shared resource seeks to establish externally-funded scientific collaborations with DCI members. The Bioinformatics Shared Resource is able to provide grant writing support, including rigorous genomic study designs (e.g., power and sample size calculations using simulation techniques) and budget estimations (including computing, data storage, and bioinformatic staff and faculty effort) in pursuit of these collaborations.
The first step in establishing a new collaboration with the Bioinformatics Shared Resource is for the principal investigator to request a set of initial meetings with the shared resource leadership and staff to discuss the scientific objectives and scope of the proposed project. These meetings provide the opportunity for the shared resource faculty and staff to learn the background and objectives of the project, and for the investigators to learn more about the resources and expertise of the shared resource. These meetings will also help with determining if resources or expertise from other DCI shared resources, for example the DCI Biostatistics Shared Resource, need to be included. There are no charges for these initial discussions. The Bioinformatics Shared Resource faculty are also highly knowledgeable in providing advice on sequencing technologies, statistical methodology, and computing tools and environments. This support is generally provided at no charge to DCI members.
There are limited faculty, staff, and computing resources available for conducting initial analyses during the pre-award stage (e.g., to develop preliminary data for a grant application). If these initial resources are deemed to be insufficient considering the scope of the requisite analyses, the Bioinformatics Shared Resource will assist in developing a budget to conduct the requisite analyses.
The costs of provisioning and using cluster resources from the Duke Center for Genomic and Computational Biology Center (GCB) and cloud resources from the Duke Health Technology Solutions (DHTS) Virtual Private Cloud (VPC) are charged directly to DCI members by these service providers.
It should be noted that the support of the Bioinformatics Shared resource is limited to DCI members for their cancer-focused research.
Data and Code Availability
The Bioinformatics Shared Resource adheres to the principles of sound data provenance, literate programming, and reproducible analysis. To this end, we utilize the open-source software model to the fullest extent possible. The resource uses GNU/Linux as the operating system for its servers and individual workstations. The R Statistical environment, along with the Python and C/C++ programming languages, constitute the main programming toolkit. R extension packages from the Comprehensive R Archive Network (CRAN) and Bioconductor project are actively maintained on our servers. The resource also maintains several other R extension packages developed by its faculty members. For the production of reproducible reports, the shared resource uses the knitr and Jupyter notebook systems. Analysis code and software pipelines are maintained under strict source code management, and the associated repositories are made public to accompany manuscript publication. This, in addition to deposition of source data into public repositories (e.g., GEO, SRA or dbGaP), ensures end-to-end reproducibility of all statistical analyses.