Journal Article A comparison of clustering models for inference of T cell receptor antigen specificity

The vast potential sequence diversity of TCRs and their ligands has presented an historic barrier to computational prediction of TCR epitope specificity, a holy grail of quantitative immunology. One common approach is to cluster sequences together, on the assumption that similar receptors bind similar epitopes. Here, we provide the first independent evaluation of widely used clustering algorithms for TCR specificity inference, observing some variability in predictive performance between models, and marked differences in scalability. Despite these differences, we find that different algorithms produce clusters with high degrees of similarity for receptors recognising the same epitope. Our analysis strengthens the case for use of clustering models to identify signals of common specificity from large repertoires, whilst highlighting scope for improvement of complex models over simple comparators.

Journal Article Processes in DNA-damage response from a whole-cell multi-omics perspective

Technological advances have made it feasible to collect multi-condition multi-omic time-courses of cellular response to perturbation, but the complexity of these datasets impede discovery due to challenges in data management, analysis, visualization, and interpretation. Here, we report a whole-cell mechanistic analysis of HL-60 cellular response to bendamustine. We integrate both enrichment and network analysis to show the progression of DNA-damage and programmed cell death over time in molecular, pathway, and process-level detail using an interactive analysis framework for multi-omics data. Our framework, Mechanism of Action Generator Involving NEtwork analysis (MAGINE), automates network construction and enrichment analysis across multiple samples and platforms, which can be integrated into our annotated gene-set network (AGN) to combine the strengths of networks and ontology-driven analysis. Taken together, our work demonstrates how multi-omics integration can be used to explore signaling processes at various resolutions and demonstrates multi-pathway involvement beyond the canonical bendamustine mechanism.

Journal Article Microbench: Automated metadata management for systems biology benchmarking and reproducibility in Python

Motivation: Computational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments. This can introduce differences in performance and reproducibility. Capturing metadata (e.g. package versions, GPU model) currently requires repetitious code and is difficult to store centrally for analysis. Even where virtual environments and containers are used, updates over time mean that versioning metadata should still be captured within analysis pipelines to guarantee reproducibility. Results: Microbench is a simple and extensible Python package to automate metadata capture to a file or Redis database. Captured metadata can include execution time, software package versions, environment variables, hardware information, Python version and more, with plugins. We present three case studies demonstrating Microbench usage to benchmark code execution and examine environment metadata for reproducibility purposes. Availability and implementation: Install from the Python Package Index using pip install microbench. Source code is available from

Journal Article MuSyC is a consensus framework that unifies multi-drug synergy metrics for combinatorial drug discovery

Drug combination discovery depends on reliable synergy metrics but no consensus exists on the correct synergy criterion to characterize combined interactions. The fragmented state of the field confounds analysis, impedes reproducibility, and delays clinical translation of potential combination treatments. Here we present a mass-action based formalism to quantify synergy. With this formalism, we clarify the relationship between the dominant drug synergy principles, and present a mapping of commonly used frameworks onto a unified synergy landscape. From this, we show how biases emerge due to intrinsic assumptions which hinder their broad applicability and impact the interpretation of synergy in discovery efforts. Specifically, we describe how traditional metrics mask consequential synergistic interactions, and contain biases dependent on the Hill-slope and maximal effect of single-drugs. We show how these biases systematically impact synergy classification in large combination screens, potentially misleading discovery efforts. Thus the proposed formalism can provide a consistent, unbiased interpretation of drug synergy, and accelerate the translatability of synergy studies.

Journal Article Selective Inhibition of JAK1 Primes STAT5-Driven Human Leukemia Cells for ATRA-Induced Differentiation

Background: All-trans retinoic acid (ATRA), a derivate of vitamin A, has been successfully used as a therapy to induce differentiation in M3 acute promyelocytic leukemia (APML), and has led to marked improvement in outcomes. Previously, attempts to use ATRA in non-APML in the clinic, however, have been underwhelming, likely due to persistent signaling through other oncogenic drivers. Dysregulated JAK/STAT signaling is known to drive several hematologic malignancies, and targeting JAK1 and JAK2 with the JAK1/JAK2 inhibitor ruxolitinib has led to improvement in survival in primary myelofibrosis and alleviation of vasomotor symptoms and splenomegaly in polycythemia vera and myelofibrosis. Objective: While dose-dependent anemia and thrombocytopenia limit the use of JAK2 inhibition, selectively targeting JAK1 has been explored as a means to suppress inflammation and STAT-associated pathologies related to neoplastogenesis. The objective of this study is to employ JAK1 inhibition (JAK1i) in the presence of ATRA as a potential therapy in non-M3 acute myeloid leukemia (AML). Methods: Efficacy of JAK1i using INCB52793 was assessed by changes in cell cycle and apoptosis in treated AML cell lines. Transcriptomic and proteomic analysis evaluated effects of JAK1i. Synergy between JAK1i+ ATRA was assessed in cell lines in vitro while efficacy in vivo was assessed by tumor reduction in MV-4-11 cell line-derived xenografts. Results: Here we describe novel synergistic activity between JAK1i inhibition and ATRA in non-M3 leukemia. Transcriptomic and proteomic analysis confirmed structural and functional changes related to maturation while in vivo combinatory studies revealed significant decreases in leukemic expansion. Conclusions: JAK1i+ ATRA lead to decreases in cell cycle followed by myeloid differentiation and cell death in human leukemias. These findings highlight potential uses of ATRA-based differentiation therapy of non-M3 human leukemia.

Journal Article Thunor: Visualization and Analysis of High-Throughput Dose-response Datasets

High-throughput cell proliferation assays to quantify drug-response are becoming increasingly common and powerful with the emergence of improved automation and multi-time point analysis methods. However, pipelines for analysis of these datasets that provide reproducible, efficient, and interactive visualization and interpretation are sorely lacking. To address this need, we introduce Thunor, an open-source software platform to manage, analyze, and visualize large, dose-dependent cell proliferation datasets. Thunor supports both end-point and time-based proliferation assays as input. It provides a simple, user-friendly interface with interactive plots and publication-quality images of cell proliferation time courses, dose–response curves, and derived dose–response metrics, e.g. IC50, including across datasets or grouped by tags. Tags are categorical labels for cell lines and drugs, used for aggregation, visualization, and statistical analysis, e.g. cell line mutation or drug class/target pathway. A graphical plate map tool is included to facilitate plate annotation with cell lines, drugs, and concentrations upon data upload. Datasets can be shared with other users via point-and-click access control. We demonstrate the utility of Thunor to examine and gain insight from two large drug response datasets: a large, publicly available cell viability database and an in-house, high-throughput proliferation rate dataset. Thunor is available from

Journal Article SynLeGG: analysis and visualization of multiomics data for discovery of cancer 'Achilles Heels' and gene function relationships

Achilles’ heel relationships arise when the status of one gene exposes a cell's vulnerability to perturbation of a second gene, such as chemical inhibition, providing therapeutic opportunities for precision oncology. SynLeGG ( identifies and visualizes mutually exclusive loss signatures in ‘omics data to enable discovery of genetic dependency relationships (GDRs) across 783 cancer cell lines and 30 tissues. While there is significant focus on genetic approaches, transcriptome data has advantages for investigation of GDRs and remains relatively underexplored. SynLeGG depends upon the MultiSEp algorithm for unsupervised assignment of cell lines into gene expression clusters, which provide the basis for analysis of CRISPR scores and mutational status in order to propose candidate GDRs. Benchmarking against SynLethDB demonstrates favourable performance for MultiSEp against competing approaches, finding significantly higher area under the Receiver Operator Characteristic curve and between 2.8-fold to 8.5-fold greater coverage. In addition to pan-cancer analysis, SynLeGG offers investigation of tissue-specific GDRs and recovers established relationships, including synthetic lethality for SMARCA2 with SMARCA4. Proteomics, Gene Ontology, protein-protein interactions and paralogue information are provided to assist interpretation and candidate drug target prioritization. SynLeGG predictions are significantly enriched in dependencies validated by a recently published CRISPR screen.

Journal Article Programmatic modeling for biological systems

Computational modeling has become an established technique to encode mathematical representations of cellular processes and gain mechanistic insights that drive testable predictions. These models are often constructed using graphical user interfaces or domain-specific languages, with community standards used for interchange. Models undergo steady state or dynamic analysis, which can include simulation and calibration within a single application, or transfer across various tools. Here, we describe a novel programmatic modeling paradigm, whereby modeling is augmented with software engineering best practices. We focus on Python – a popular programming language with a large scientific package ecosystem. Models can be encoded as programs, adding benefits such as modularity, testing, and automated documentation generators, while still being extensible and exportable to standardized formats for use with external tools if desired. Programmatic modeling is a key technology to enable collaborative model development and enhance dissemination, transparency, and reproducibility.

Journal Article Functional Transcription Factor Target Networks Illuminate Control of Epithelial Remodelling

Cell identity is governed by gene expression, regulated by transcription factor (TF) binding at cis-regulatory modules. Decoding the relationship between TF binding patterns and gene regulation is nontrivial, remaining a fundamental limitation in understanding cell decision-making. We developed the NetNC software to predict functionally active regulation of TF targets; demonstrated on nine datasets for the TFs Snail, Twist, and modENCODE Highly Occupied Target (HOT) regions. Snail and Twist are canonical drivers of epithelial to mesenchymal transition (EMT), a cell programme important in development, tumour progression and fibrosis. Predicted “neutral” (non-functional) TF binding always accounted for the majority (50% to 95%) of candidate target genes from statistically significant peaks and HOT regions had higher functional binding than most of the Snail and Twist datasets examined. Our results illuminated conserved gene networks that control epithelial plasticity in development and disease. We identified new gene functions and network modules including crosstalk with notch signalling and regulation of chromatin organisation, evidencing networks that reshape Waddington’s epigenetic landscape during epithelial remodelling. Expression of orthologous functional TF targets discriminated breast cancer molecular subtypes and predicted novel tumour biology, with implications for precision medicine. Predicted invasion roles were validated using a tractable cell model, supporting our approach.

Journal Article ACDC: Automated Cell Detection and Counting for Time-Lapse Fluorescence Microscopy

Advances in microscopy imaging technologies have enabled the visualization of live-cell dynamic processes using time-lapse microscopy imaging. However, modern methods exhibit several limitations related to the training phases and to time constraints, hindering their application in the laboratory practice. In this work, we present a novel method, named Automated Cell Detection and Counting (ACDC), designed for activity detection of fluorescent labeled cell nuclei in time-lapse microscopy. ACDC overcomes the limitations of the literature methods, by first applying bilateral filtering on the original image to smooth the input cell images while preserving edge sharpness, and then by exploiting the watershed transform and morphological filtering. Moreover, ACDC represents a feasible solution for the laboratory practice, as it can leverage multi-core architectures in computer clusters to efficiently handle large-scale imaging datasets. Indeed, our Parent-Workers implementation of ACDC allows to obtain up to a 3.7× speed-up compared to the sequential counterpart. ACDC was tested on two distinct cell imaging datasets to assess its accuracy and effectiveness on images with different characteristics. We achieved an accurate cell-count and nuclei segmentation without relying on large-scale annotated datasets, a result confirmed by the average Dice Similarity Coefficients of 76.84 and 88.64 and the Pearson coefficients of 0.99 and 0.96, calculated against the manual cell counting, on the two tested datasets.

Pre-print Accelerated Simulations of Chemical Reaction Systems using the Stochastic Simulation Algorithm on GPUs

Stochasticity due to fluctuations in chemical reactions can play important roles in cellular network-driven processes. Although the Stochastic Simulation Algorithm (SSA, aka Gillespie Algorithm) has long been accepted as a suitable method to solve the time-dependent chemical master equation, its computational cost is prohibitive for large scale complex networks such as those found in cellular processes. Here we present GPU-SSA, an implementation of the SSA formalism utilizing Graphics Processing Units for use in Python using the PySB modeling framework. We show that the GPU implementation of SSA can achieve significant speedup compared to parallel CPU or single-core CPU implementations. We further include supplementary didactic material to demonstrate how to incorporate GPU-SSA workflows for interested readers.

Letter to Editor Accredit scientific software for sustainability

Journal Article Ursprung: Provenance for Large-Scale Analytics Environments

Modern analytics has produced wonders, but reproducing and verifying these wonders is difficult. Data provenance helps to solve this problem by collecting information on how data is created and accessed. Although provenance collection techniques have been used successfully on a smaller scale, tracking provenance in large-scale analytics environments is challenging due to the scale of provenance generated and the heterogeneous domains. Without provenance, analysts struggle to keep track of and reproduce their analyses. We demonstrate Ursprung, a provenance collection system specifically targeted at such environments. Ursprung transparently collects the minimal set of system-level provenance required to track the relationships between data and processes. To collect domain specific provenance, Usprung enables users to specify capture rules to curate application-specific logs, intermediate results etc. To reduce storage overhead and accelerate queries, it uses event hierarchies to synthesize raw provenance into compact summaries.

Journal Article Overcoming intratumoural heterogeneity for reproducible molecular risk stratification: a case study in advanced kidney cancer

Metastatic clear cell renal cell cancer (mccRCC) portends a poor prognosis and urgently requires better clinical tools for prognostication as well as for prediction of response to treatment. Considerable investment in molecular risk stratification has sought to overcome the performance ceiling encountered by methods restricted to traditional clinical parameters. However, replication of results has proven challenging, and intratumoural heterogeneity (ITH) may confound attempts at tissue-based stratification.

We investigated the influence of confounding ITH on the performance of a novel molecular prognostic model, enabled by pathologist-guided multiregion sampling (n = 183) of geographically separated mccRCC cohorts from the SuMR trial (development, n = 22) and the SCOTRRCC study (validation, n = 22). Tumour protein levels quantified by reverse phase protein array (RPPA) were investigated alongside clinical variables. Regularised wrapper selection identified features for Cox multivariate analysis with overall survival as the primary endpoint.

The optimal subset of variables in the final stratification model consisted of N-cadherin, EPCAM, Age, mTOR (NEAT). Risk groups from NEAT had a markedly different prognosis in the validation cohort (log-rank p = 7.62 × 10−7; hazard ratio (HR) 37.9, 95% confidence interval 4.1–353.8) and 2-year survival rates (accuracy = 82%, Matthews correlation coefficient = 0.62). Comparisons with established clinico-pathological scores suggest favourable performance for NEAT (Net reclassification improvement 7.1% vs International Metastatic Database Consortium score, 25.4% vs Memorial Sloan Kettering Cancer Center score). Limitations include the relatively small cohorts and associated wide confidence intervals on predictive performance. Our multiregion sampling approach enabled investigation of NEAT validation when limiting the number of samples analysed per tumour, which significantly degraded performance. Indeed, sample selection could change risk group assignment for 64% of patients, and prognostication with one sample per patient performed only slightly better than random expectation (median logHR = 0.109). Low grade tissue was associated with 3.5-fold greater variation in predicted risk than high grade (p = 0.044).

This case study in mccRCC quantitatively demonstrates the critical importance of tumour sampling for the success of molecular biomarker studies research where ITH is a factor. The NEAT model shows promise for mccRCC prognostication and warrants follow-up in larger cohorts. Our work evidences actionable parameters to guide sample collection (tumour coverage, size, grade) to inform the development of reproducible molecular risk stratification methods.

Journal Article GPU-powered model analysis with PySB/cupSODA

A major barrier to the practical utilization of large, complex models of biochemical systems is the lack of open-source computational tools to evaluate model behaviors over high-dimensional parameter spaces. This is due to the high computational expense of performing thousands to millions of model simulations required for statistical analysis. To address this need, we have implemented a user-friendly interface between cupSODA, a GPU-powered kinetic simulator, and PySB, a Python-based modeling and simulation framework. For three example models of varying size, we show that for large numbers of simulations PySB/cupSODA achieves order-of-magnitude speedups relative to a CPU-based ordinary differential equation integrator.

Availability and implementation
The PySB/cupSODA interface has been integrated into the PySB modeling framework (version 1.4.0), which can be installed from the Python Package Index (PyPI) using a Python package manager such as pip. cupSODA source code and precompiled binaries (Linux, Mac OS/X, Windows) are available at (requires an Nvidia GPU; Additional information about PySB is available at

Supplementary information Supplementary data are available at Bioinformatics online.

Journal Article Integrated, High-Throughput, Multiomics Platform Enables Data-Driven Construction of Cellular Responses and Reveals Global Drug Mechanisms of Action

An understanding of how cells respond to perturbation is essential for biological applications; however, most approaches for profiling cellular response are limited in scope to pre-established targets. Global analysis of molecular mechanism will advance our understanding of the complex networks constituting cellular perturbation and lead to advancements in areas, such as infectious disease pathogenesis, developmental biology, pathophysiology, pharmacology, and toxicology. We have developed a high-throughput multiomics platform for comprehensive, de novo characterization of cellular mechanisms of action. Platform validation using cisplatin as a test compound demonstrates quantification of over 10 000 unique, significant molecular changes in less than 30 days. These data provide excellent coverage of known cisplatin-induced molecular changes and previously unrecognized insights into cisplatin resistance. This proof-of-principle study demonstrates the value of this platform as a resource to understand complex cellular responses in a high-throughput manner.

Journal Article Sunitinib Treatment Exacerbates Intratumoral Heterogeneity in Metastatic Renal Cancer

Purpose: The aim of this study was to investigate the effect of VEGF-targeted therapy (sunitinib) on molecular intratumoral heterogeneity (ITH) in metastatic clear cell renal cancer (mccRCC).

Experimental Design: Multiple tumor samples (n = 187 samples) were taken from the primary renal tumors of patients with mccRCC who were sunitinib treated (n = 23, SuMR clinical trial) or untreated (n = 23, SCOTRRCC study). ITH of pathologic grade, DNA (aCGH), mRNA (Illumina Beadarray) and candidate proteins (reverse phase protein array) were evaluated using unsupervised and supervised analyses (driver mutations, hypoxia, and stromal-related genes). ITH was analyzed using intratumoral protein variance distributions and distribution of individual patient aCGH and gene-expression clustering.

Results: Tumor grade heterogeneity was greater in treated compared with untreated tumors (P = 0.002). In unsupervised analysis, sunitinib therapy was not associated with increased ITH in DNA or mRNA. However, there was an increase in ITH for the driver mutation gene signature (DNA and mRNA) as well as increasing variability of protein expression with treatment (P < 0.05). Despite this variability, significant chromosomal and transcript changes to key targets of sunitinib, such as VHL, PBRM1, and CAIX, occurred in the treated samples.

Conclusions: These findings suggest that sunitinib treatment has significant effects on the expression and ITH of key tumor and treatment specific genes/proteins in mccRCC. The results, based on primary tumor analysis, do not support the hypothesis that resistant clones are selected and predominate following targeted therapy. Clin Cancer Res; 21(18); 4212–23. ©2015 AACR.

Journal Article Carbonic Anhydrase 9 Expression Increases with Vascular Endothelial Growth Factor–Targeted Therapy and Is Predictive of Outcome in Metastatic Clear Cell Renal Cancer

There is a lack of biomarkers to predict outcome with targeted therapy in metastatic clear cell renal cancer (mccRCC). This may be because dynamic molecular changes occur with therapy.

To explore if dynamic, targeted-therapy-driven molecular changes correlate with mccRCC outcome.

Design, setting, and participants
Multiple frozen samples from primary tumours were taken from sunitinib-naïve (n = 22) and sunitinib-treated mccRCC patients (n = 23) for protein analysis. A cohort (n = 86) of paired, untreated and sunitinib/pazopanib-treated mccRCC samples was used for validation. Array comparative genomic hybridisation (CGH) analysis and RNA interference (RNAi) was used to support the findings.

Three cycles of sunitinib 50 mg (4 wk on, 2 wk off).

Outcome measurements and statistical analysis
Reverse phase protein arrays (training set) and immunofluorescence automated quantitative analysis (validation set) assessed protein expression.

Results and limitations
Differential expression between sunitinib-naïve and treated samples was seen in 30 of 55 proteins (p < 0.05 for each). The proteins B-cell CLL/lymphoma 2 (BCL2), mutL homolog 1 (MLH1), carbonic anhydrase 9 (CA9), and mechanistic target of rapamycin (mTOR) (serine/threonine kinase) had both increased intratumoural variance and significant differential expression with therapy. The validation cohort confirmed increased CA9 expression with therapy. Multivariate analysis showed high CA9 expression after treatment was associated with longer survival (hazard ratio: 0.48; 95% confidence interval, 0.26–0.87; p = 0.02). Array CGH profiles revealed sunitinib was associated with significant CA9 region loss. RNAi CA9 silencing in two cell lines inhibited the antiproliferative effects of sunitinib. Shortcomings of the study include selection of a specific protein for analysis, and the specific time points at which the treated tissue was analysed.

CA9 levels increase with targeted therapy in mccRCC. Lower CA9 levels are associated with a poor prognosis and possible resistance, as indicated by the validation cohort.

Patient summary
Drug treatment of advanced kidney cancer alters molecular markers of treatment resistance. Measuring carbonic anhydrase 9 levels may be helpful in determining which patients benefit from therapy.

Journal Article TMA Navigator: network inference, patient stratification and survival analysis with tissue microarray data

Tissue microarrays (TMAs) allow multiplexed analysis of tissue samples and are frequently used to estimate biomarker protein expression in tumour biopsies. TMA Navigator ( is an open access web application for analysis of TMA data and related information, accommodating categorical, semi-continuous and continuous expression scores. Non-biological variation, or batch effects, can hinder data analysis and may be mitigated using the ComBat algorithm, which is incorporated with enhancements for automated application to TMA data. Unsupervised grouping of samples (patients) is provided according to Gaussian mixture modelling of marker scores, with cardinality selected by Bayesian information criterion regularization. Kaplan–Meier survival analysis is available, including comparison of groups identified by mixture modelling using the Mantel-Cox log-rank test. TMA Navigator also supports network inference approaches useful for TMA datasets, which often constitute comparatively few markers. Tissue and cell-type specific networks derived from TMA expression data offer insights into the molecular logic underlying pathophenotypes, towards more effective and personalized medicine. Output is interactive, and results may be exported for use with external programs. Private anonymous access is available, and user accounts may be generated for easier data management.