5 minute read

Over the past two decades, the field of biology has been transitioning from small data to big data, and my PhD journey followed a similar trend. While working with whole-genome sequencing data from cancers was hardly ‘small’ data–afterall, this involved sequence data from thousands of tumour cells across the three billion bases of DNA contained in each–the number of patient samples I had access to was small. At the time (this would have been around 2012), our lab group took the approach of deeply characterising a small number of patients with aggressive cancer using RNA sequencing, methylation arrays, SNP arrays, and even whole-genome sequencing–a rarity in those days. Having access to multi-omic data was a privilege, and I was lucky enough to be part of one of the first research groups to observe cross-metastatic seeding in prostate cancer. Groups had been using point mutation frequencies1 from exomes and targeted sequencing to investigate tumour evolution for some time now, but now that we had whole genomes, what else could we find lurking in the data?

As part of a joint collaboration between National ICT Australia (now data61), the University of Melbourne and the Royal Melbourne Hospital, we developed strategies for minimising false-positive structural variation calls, and performed some ‘back of the envelope’ estimations of the variant allele frequency (VAF), a measure that forms the basis of evolutionary analysis in cancer. The VAF is straightforward to calculate for point-mutations but much more difficult for structural variants. Throughout my PhD, I worked to refine a methodology for this calculation and used machine-learning techniques to identify potential clones, which are groups of cancer cells with a similar genetic makeup. While our group effectively leveraged the ‘small data’ we had access to by verifying our approach with in silico mixtures of samples, we were lacking a large cohort of samples to apply our method to. As each cancer is unique, it was difficult to establish whether the patterns of SV clonality we were observing in our prostate cancer samples were representative of the aggressive, metastatic subtype in our small cohort. We had a method, but demonstrating novel biological insight was another matter.

The real turning point came when, in 2015, we joined the Pan Cancer Analysis of Whole Genomes (PCAWG, pronounced ‘pea-cog’) tumour heterogeneity working group. Due to the efforts of the thousands of scientists involved with the Pan Cancer project, we now had access to over 2,600 whole-genomes across 38 cancer types. Our algorithm required data on the mutations found within each cancer sample, these included genome rearrangements, point mutations and copy-number estimates2. Working groups involved with curating and refining variants from these specific variant types enabled us to use a curated, high-confidence data set to feed into our method. It’s impossible to overstate how much collaborative energy went into the international undertaking of the PCAWG–a project that involved over a thousand contributing authors.

Just as the PCAWG projects displayed collaboration on a large scale, the SVclone paper involved its own inter-institute and international collaboration. I was awarded an Endeavour Award, which allowed me to spend four months at the Markowetz lab at Cancer Research UK Cambridge Institute, where Geoff Macintyre, my primary supervisor for the SVclone project, was then based. Combined with a collaboration with Ke Yuan, also a Cancer Research UK alumnus, who had developed a variational inference approach that allowed us to perform clonal analyses in a fraction of the time, we could quickly run experiments across large numbers of whole-genome sequenced tumours. We were then able to contribute our allele frequency calculations back to the Pan Cancer project to help with characterising the evolutionary history of structural variation across cancers. We didn’t know whether patterns of clonality in point-mutations differed from patterns of structural variant clonality–nobody had looked at this before because a combination of the data and methods weren’t readily available. Our analysis revealed distinct patterns of clonality in different cancer types in both point-mutations and SVs, likely reflecting numerous different mechanisms at play. This was (and still is) just scratching the surface.

Access to histological and clinical characteristics further allowed us to link some of these patterns that we were observing with survival characteristics. This led to us to identifying a subset of patients with reduced survival, possessing what we termed the subclonal neutral rearrangement (SCNR) genotype. These cancers contained more genomic rearrangements that don’t change the total copy-number of DNA (such as inversions and translocations), and were primarily found at lower frequencies of cells within the tumour. These findings are preliminary, and will require many more years of research to continue to uncover interesting biology through mining of these big data. Without the immense efforts of the thousands of scientists involved in the Pan Cancer project, we would not have this vast resource at our fingertips. It is with great excitement that I watched the deluge of papers and resources emerge from the Pan Cancer project into the scientific community, solidifying the trend of cancer genomics into world of big data. It also marked the end of my PhD journey, which had come to its formal end two years prior, but required further collaboration and refining for my central project to reach its full potential. Researchers around the world now have access to the immense Pan Cancer resource, which makes this an ideal time to be part of the cancer research community, and I greatly anticipate the new findings that this will enable over the coming years.

The paper | The code | PhD thesis | PCAWG collection

  1. Point mutations here refers to single base-pair changes that commonly occur in cancer cells. As bulk DNA-sequencing involving many cancer cells is pervasive, most algorithms for inferring tumour heterogeneity in such samples use proportions of point mutations as a proxy to estimate groups of genetically similar tumour cell populations. 

  2. Sometimes referred to as somatic copy-number alterations (SCNAs), which are large-scale duplications and deletions in the tumour genome. Technically, they are a subclass of genome rearrangements, which also includes copy-neutral rearrangements such as inversions and translocations.