Information

What's a segmented copy number profile

What's a segmented copy number profile


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am studying copy-number variation. I am reading

C. H. Mermel, S. E. Schumacher, B. Hill, M. L. Meyerson, R. Beroukhim, and G. Getz, “GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers,” Genome Biol., vol. 12, no. 4, p. R41, 2011.

Here, it is written that

Segmented copy number profiles represent the summed outcome of all the SCNAs [somatic copy number alterations] that occurred during cancer development. Accurate modeling of the background rate of copy-number alteration requires analysis of the individual SCNAs. However, because SCNAs may overlap, it is impossible to directly infer the underlying events from the final segmented copy-number profile alone.

It is not clear to me how a segmented copy number profile represents the summed outcome of all the SCNAs. Is it because different alterations can be present in the same sample, or can alter the copy-number in different moments, or both?

And, do they overlap spatially, temporally, or both?


Yes, one sample can contain different alterations. For each patient there is typically one tumour specimen that is removed. That specimen may be divided up into several samples (e.g. one for DNA sequencing, one for RNA sequencing, one for methylation microarray, and one for copy-number variation microarray), however each sample contains thousands of individual cells and two adjacent cells may have different CNVs (depending on their clonal ancestry, etc.). In other words, a tumour is a heterogeneous collection of cells. For some tumour types there can even be healthy cells mixed in.

The term in the literature is clonal evolution, there is a nice image in this article: Tumor Heterogeneity


To directly respond to the quote presented,

Segmented copy number profiles represent the summed outcome of all the SCNAs [somatic copy number alterations] that occurred during cancer development.

As a tumor progresses, genomic instability can often increase. That is, more and more SCNAs occur. Because of this, one SCNA can overlap another.

For example, look at the tumor progression of the chromosome above. Let's say you have a loss on this chromosome (Event 1 the MATERNAL chromosome). This loss can occur by many mechanisms which I won't go into. The tumor proliferates, splitting many times, and mutations are accumulating. These mutations can cause more events to occur, and perhaps on this same chromosome arm, the distal part of the remaining copy is duplicated (Event 2 the PATERNAL chromosome).

In the copy number profile, it will look as though you had a loss on just an interstitial part of the chromosome. But looking closely at the data, you can see that you've lost heterozygosity on the distal portion of the chromosome as well (we now have two copies of the PATERNAL chromosome, and 0 copies of the MATERNAL). This is a simplified example, and many events can occur along the same chromosome arm. If the paternal chromosome holds a mutation, this can mean a selective advantage for the tumor or resistance to drug therapies.

Therefore, the CN profile represents a snapshot in time of what the copy number state was at that moment, with no explicit information on how that copy number state was obtained.


Estimating Copy Number From Log Ratios

However in practice, the number of copies is difficult to estimate from the log2 ratio values of a segment due to various reasons for example, sample-based issues such as polyploidy, mosaicism, contamination with normal samples, tumor necrosis, etc. and array-based issues such as probe dynamic range, hybridization quality, batching, etc.

Agilent arrays have good dynamic range while SNP arrays like Illumina have a smaller range. You can see this in our default settings for one-copy gain/loss for these platforms (see File->Settings and drop down to Illumina). The new Affymetrix OSCHP platform can directly estimate copy number.

Given these caveats, you can estimate copy number using the probe median value for each call. In Nexus, you can get the probe median values from the data table in the Sample Drill Down window (individual sample window). So, if your data is in log2 space, the call has a probe median of about 0.5-0.57 (=log2(3/2)), the platform was Agilent, and this is not a tumor sample, you can say this is a 1 copy gain (3 copies/2) and if the value was approx. 1.0 (=log2(4/2)) you can say this is 2 copy gain (4 copies/2) and so forth 3 copy gain=log2(5/2). For SNP arrays which have a smaller dynamic range, you would make the calls at slightly lower log ratios.


Abstract

Single-cell DNA sequencing technologies are enabling the study of mutations and their evolutionary trajectories in cancer. Somatic copy number aberrations (CNAs) have been implicated in the development and progression of various types of cancer. A wide array of methods for CNA detection has been either developed specifically for or adapted to single-cell DNA sequencing data. Understanding the strengths and limitations that are unique to each of these methods is very important for obtaining accurate copy number profiles from single-cell DNA sequencing data. We benchmarked three widely used methods–Ginkgo, HMMcopy, and CopyNumber–on simulated as well as real datasets. To facilitate this, we developed a novel simulator of single-cell genome evolution in the presence of CNAs. Furthermore, to assess performance on empirical data where the ground truth is unknown, we introduce a phylogeny-based measure for identifying potentially erroneous inferences. While single-cell DNA sequencing is very promising for elucidating and understanding CNAs, our findings show that even the best existing method does not exceed 80% accuracy. New methods that significantly improve upon the accuracy of these three methods are needed. Furthermore, with the large datasets being generated, the methods must be computationally efficient.


The Who, What, Where, When, Why, and How of Using DNA Segment Data

Are you up for an adventurous foray into DNA segment data and ready to use some programs to understand it, analyze it, and use it in your Family History research? Let’s go!

You may have already read some Family Locket blog posts about using DNA segment data. The Chromosome Browser: A Tool for Visualizing Segment Data by Nicole, explains how to use chromosome browsers to visualize the DNA that you share with your DNA matches and triangulate shared autosomal DNA segments. Segment Triangulation: Proving an Ancestral Line, by Diana, explains how to use MyHeritage segments to confirm a set of 2nd great grandparents. Chromosome Mapping – Visualize Your DNA and Identify the Ancestors Who Passed It On To You, written previously by me, gives an overview of using DNA segment data to create chromosome maps and do visual phasing.

To start, let’s look at the Who, What, Where, When, Why, and How of Using DNA Segment Data.

When you explore your DNA results and use them in family history research, who is involved?

– DNA matches – relatives with known and unknown relationships

– Your common ancestor(s) – the direct ancestor(s) to whom both you and a DNA match are related

The images below illustrate the relationship pathways between you and your DNA matches. If you are 3rd cousins you share great-great-grandparents.

If your great grandparent is your DNA match’s great-great-grandparent, your relationship is 2nd cousin once removed which is often shortened to 2C1R.

The goal is to identify segments of DNA that you inherited from specific ancestors. DNA segment information helps you identify/find/locate new ancestral connections or verify known relationships.

Remember that you have 2 copies of each chromosome. You inherited one copy of each chromosome from your father and one copy from your mother. The overlapping segments need to be on the same chromosome to be informative about ancestors.

If you examine the DNA segments that you share with your DNA matches and know how you are related and which common ancestor(s) you share, you can assign the segments to that ancestor. After the segment has been “assigned” or identified as belonging to a specific ancestor(s), you can compare the segment data from additional DNA matches. If the new DNA match(s) also share the same segment on the same chromosome, it indicates which ancestral line you have in common and helps you learn how you and the new DNA match(s) are related.

As time goes on and more people test their DNA, there is a potential for you to identify more and more ancestors. So, if you don’t have many DNA matches today, don’t feel left out. Over time, the probability of having more DNA matches will grow. It’s a good idea to check back periodically and look for new matches.

The place you’ll actually be able to compare DNA data is the “Where.” But first, download or copy segment data from 23andMe, FTDNA, MyHeritage, and GEDmatch. Some 3rd-party websites and tools help make the segment data meaningful and useful for your family history research. Some of the websites and tools are the Genetic Affairs AutoSegment tool, DNA Painter chromosome maps, DNA Gedcom, spreadsheets, and more.

Segments shared with DNA matches were inherited from common ancestors.

– Some cousins may have more family history information about your common ancestors than you do. They may even know about ancestors that are beyond a “brick wall” for you.

– Identifying segments can help discern between or confirm hypothesized ancestors, or confirm known ancestors.

Download CSV files of segment data. Once you have the DNA data, you can analyze it. You can use the data to create chromosome maps, triangulate, and verify relationships. You can import the CSV files into the Genetic Affairs AutoSegment tool and the DNA Painter chromosome map tool. Other options are to use the DNA Gedcom Client to extract data from the DNA testing companies, then use the data in JWorks and KWorks, or look at the data in a spreadsheet.

I tried the new AutoSegment tool at Genetic Affairs, and I’m excited about the possibilities! First, I selected the AutoSegment tool at https://members.geneticaffairs.com/autosegment. In the image shown below, there is a link to instructions (https://members.geneticaffairs.com/img/AutoSegmentTutorial.pdf) about how to gather the segment data needed for the AutoSegment analysis. The companies from which the segment data may be obtained are MyHeritage, Family Tree DNA, 23andMe, and GEDmatch.

I chose MyHeritage since I recently uploaded raw DNA data, and I wanted to see the results.

After selecting MyHeritage, this page opened which gave me instructions on exporting the DNA matches file and the shared DNA segment file. It took about 15 minutes to receive both files from MyHeritage via email. I extracted the zipped files, then used the “Choose File” box to upload the match file, then the segment file.

After selecting “PERFORM AUTOSEGMENT ANALYSIS,” this screen opened, and I selected “Yes, Perform AutoSegment analysis.”

After just a few minutes, I received an email from Genetic Affairs with a zipped file attached. After downloading and unzipping the AutoSegment file, I opened the HTML file and chose to order the DNA matches by the cluster. This image opened, it is a segment cluster analysis that looks similar to the AutoCluster analysis that we have come to know and love.

“Explanation of AutoSegment analysis
AutoSegment organizes your matches into clusters that likely represent branches of your family. Each of the colored cells represents an intersection between two of your matches, meaning they both match you and each other based on an overlapping segment. These cells, in turn, are grouped together both physically and by color to create a powerful visual chart of your clusters.

Each color represents one cluster. Members of a cluster match you and most or all of the other cluster members. Everyone in a cluster will likely be on the same ancestral line, although the MRCA between any of the matches and between you and any match may vary. The generational level of the clusters may vary as well. One may be your paternal grandmother’s branch another maybe your paternal grandfather’s father’s branch.

Please be aware that an AutoSegment analysis is different as compared to regular AutoCluster analysis. The AutoCluster analysis employs shared matches to form clusters of matches. Some of these shared matches will also share the same DNA segment with you. AutoSegment is based on overlapping segments of DNA matches. Based on an overlapping segment of your DNA matches, a link between those DNA matches is created. However, please keep in mind the following:

An overlapping segment, as calculated, is not proof of a triangulating segment!”

AutoCluster Chromosome Browser

There is a chromosome browser under the clusters that shows with whom you share a certain DNA segment on a specific chromosome. If you hover over a colored bar on a chromosome, a box with details about that segment opens. The boxes give information such as: “segment from cluster 52 sharing 21.4 cM with [DNA match] chr 22: 32,950,104 – 45,651,045,” meaning that the chromosome 22 segment is shared with a certain DNA match which also shares that same 21.4 cM segment with the people in cluster 52, and the start point is 32,950,104 with the endpoint 45,651,045.”

If you click on a chromosome, it swings into a vertical position. Then you can hover over the larger chromosome image and move the mouse down to see the other people who share various segments of the chromosome with you.

AutoSegment Cluster Information

Next, there is an interactive table of listing the cluster number, segments, chromosomes, star and stop points, number of SNPs, match name, company, amount of shared DNA in that segment, and the total amount of shared DNA in cM. The table allows for sorting and a link to message the DNA match through the DNA company messaging system in this case, the company was MyHeritage.

Segment clusters chart using a visualization of the individual segments

Also included in the file are interactive images of each separate colored cluster shown in the image above. These files allow you to see the information in even more focused detail. The image below shows that the first 7 clusters from the overall cluster image are shown in greater detail. The accompanying chromosome browser highlights the shared segments in the smaller group of clusters.

The information in the AutoSegment report shows which DNA segments you share with other DNA matches, and you can examine and work to discern the common ancestors who passed that DNA segment on to you and your DNA match. If you can identify 2 or more DNA matches that share the same segment from the same ancestor, [Triangulation], you can use that along with traditional genealogy records to confirm that you are genetically related to the ancestor who passed that DNA segment on to you.

Pile-up Regions

Another fascinating part of the AutoSegment report is the section about pile-up regions. In my graph of chromosome 1, 14 people share the highest peak in the graph. This means that 14 people share the same small segment of DNA. Genetic Affairs AutoSegment allows you to filter out the segments of DNA that are located in known pile-up regions.

Pile-up regions are segments of DNA that are common in a population. The segments have been passed down through many generations and are not indicative of recent shared ancestry. The Timber algorithm at AncestryDNA removes the commonly known pile-up regions from its calculation of shared DNA for matches that share under 90 cM of DNA.

Genetic Affairs launched an additional segment analysis tool last week called Hybrid AutoSegment. This tool combines data from 23andMe, FTDNA, MyHeritage, and GEDmatch to compile an AutoSegment report. It is very exciting to now have the ability to compare DNA segments from 4 DNA companies and GEDmatch in one report. This saves time and energy of going back and forth between reports and spreadsheets to compare DNA segments.

Both the AutoSegment and Hybrid AutoSegment tools are set up to allow “easy integration into the DNA Painter website.” Wow, I can’t wait to explore this some more and share the results with you!

Give AutoSegment a try – it may be the breakthrough tool you are looking for!


Results

Rheological Measures

In the microviscoamylograph, the impact of saliva on starch viscosity varied between individuals from almost no impact to a rapid decline in starch viscosity within seconds. Figure 1 portrays the data from four subjects with the highest overall decreases in viscosity from 120 to 425 seconds and four subjects with the lowest changes. For all subjects, 100 ul of fresh saliva were added to 100 g of gelled 6% starch at time “0”. The viscosity decay curves overlapped for approximately 10 seconds for all individuals, indicating that amylase requires active mixing to become effective. After mixing, however, salivary amylolytic activity was highly individualized among subjects (See Table S1 in File S1). The inset in Figure 1 depicts the curves of all subjects to illustrate the full range of salivary activity.

This graph represents the four subjects with the least overall change in viscosity (upper curves) and the four with the greatest overall change (lower curves). The inset graph shows the data from all saliva samples analyzed in the MVAG (n = 42). In both graphs, the data from each subject is represented by a different colored line. 100 ul of each subject's saliva were added to 100 g of starch at 37.5°C. Saliva was added to the starch at time “0” and constituted ∼0.1% of the starch solution.

Salivary Amylase Measures

Immunoblotting and an enzymatic assay were used to independently quantify amylase amount/ml and activity/ml, respectively, in each saliva sample. We observed significant variation among individuals in terms of the amount and activity of amylase produced per unit saliva (Table S2 in File S1). The average amount (±SD) of amylase was 2.64 mg/ml (±1.8), with a range of 0 to 7.5 mg/ml, while the average concentration per minute was 5.7 mg/min (±7.1) (range 0–42.8 mg/min). The average activity per unit saliva was 93 U/ml (±62), ranging from 1 to 371 U/ml. The average activity per minute was 177 U/min (±166), with a range of 2 to 900 U/min. Males and females did not differ significantly in either their amylase amounts or activity.

All three salivary measures (1. amylase amount per ml of saliva, 2. enzymatic activity per ml of saliva, 3. reduction of starch viscosity by 100 ul saliva injection into the MVAG) were significantly correlated with one another. The relationship between amylase amount (mg/ml) and overall viscosity change in the MVAG (Figure 2A) had an r value of 0.58 (P<0.0001) and the correlation between amylase activity (U/ml) and change in MVAG (Figure 2B) had an r value of 0.67 (P<0.001). As seen in Figure 2C, amylase amount and activity were significantly correlated with one another (r = 0.61 P<0.001), as well.

Both salivary amylase amount/ml (A) and salivary activity/ml (B) were significantly related to the overall change in viscosity measured by the MVAG. Salivary amylase amount/ml and salivary activity/ml were also significantly correlated with one another (C). Note that the saliva samples analyzed in the MVAG (n = 41) are a subset of those samples analyzed by Western blot and enzymatic assay (n = 73).

AMY1 Gene Copy Number and Salivary Amylase

DNA samples were collected from 62 subjects and analyzed by qPCR to determine gene copy number. Values were standardized to a human DNA sample with a known AMY1 gene copy number verified by Fiber FISH. The median number of AMY1 gene copies was four (mean = 4.4±2), with a range of 1 to 11 (Table S2 in File S1). Salivary amylase amount/ml and gene copy number were significantly correlated (r = 0.50 P<0.0001 Figure 3). Salivary amylase activity/ml also increased as gene copy number increased (r = 0.52 P<0.0001) (not shown), consistent with the correlation between salivary amylase concentration and salivary enzyme activity (Figure 2C).

There was a significant positive relationship between AMY1 diploid copy number and amylase amount/ml (n = 62).

Oral Perception of Viscosity

The mean perceived time-intensity viscosity functions of the three stimuli (starch, gum, and water) are presented in Figure 4A (See Table S3 in File S1 for data). As expected, subjects rated water as having a perceived viscosity very close to zero, which did not fluctuate during the 60 second measurement. After reaching a peak, ratings for the xanthan gum stimulus slightly decreased over the trial period, most likely due to volumetric thinning from salivary mixing, but otherwise remained stable over time. The shape of the starch viscosity rating curve suggested a two-stage process: an initial “mixing” phase, in which the subject manipulated the bolus in their mouth and mixed it with saliva (in Figure 4A, approximately 0–10 seconds) and a second “amylolytic activity” stage characterized by a negatively accelerating decrease in starch viscosity ratings over the remaining 50 seconds.

Average time-intensity ratings for the three stimuli (A). As demonstrated by LMS ratings from six individuals (each portrayed by a different colored line/shape), subjects were highly variable in their use of the LMS scale when rating starch viscosity during the trial (B). LMS ratings were normalized to 100 at 5 seconds in order to remove subjective noise and enable observation of the effects of amylase on viscosity ratings (C). Note that Panels B and C contain LMS rating data from the same six subjects each individual is represented by the same color line in each panel.

There were large individual differences in the viscosity ratings of starch (Figure 4B). To diminish the impact of subjective ratings, LMS ratings were normalized to 100, beginning at 5 seconds into the function (Figure 4C). The data were analyzed over the remaining 55 seconds by calculating 1) the overall change in ratings from peak to nadir and 2) the time at which the curve reached ½ viscosity rating following peak for each curve.

In order to assess the relationship between the amount/activity of salivary amylase during the 60 second testing session and viscosity ratings, the enzyme concentration/minute of saliva flow and activity/minute of flow were divided into quartiles. Subjects with higher salivary amylase concentrations (Figure 5A) (F(3,69) = 2.28, P<0.05) and salivary activity (Figure 5B) (F = 3.1, P<0.05) had greater overall changes in perceived starch viscosity than subjects with lower enzyme levels. Furthermore, these subjects also reported significantly faster decreases in viscosity over the course of the following 60 seconds (Figures 5C and D) (F = 3.12, P<0.05 and 3.2, P<0.05, respectively). Importantly, there was no significant relationship between overall change in the control stimulus (xanthan gum) viscosity and amylase levels (mg/min P = 0.64) or activity (U/min P = 0.51), which demonstrates the specificity of the enzyme for starch.

Subjects with higher salivary amylase concentrations/ml (A) and salivary activity/ml (B) had greater overall changes in perceived viscosity. These subjects also reached ½ perceived viscosity levels significantly faster (C and D). The dashed line within each box represents the mean value, while the upper and lower boundaries of the box represent the 75 th and 25 th percentiles, respectively. The error bars above and below the box indicate the 90 th and 10 th percentiles. Points with different letters are significantly different from one another. Mg/min quartiles: 1 = 0–1.5 2 = 1.51–2.99 3 = 3–10 and 4 = >10 mg/min. U/min quartiles: 1 = 0–60 2 = 61–120 3 = 121–220 and 4 = >220 U/min.

It is also worth noting that the in vivo LMS ratings of starch viscosity at 60 seconds were significantly related to the in vitro viscosity measurements from the MVAG at 7 minutes (r = 0.27 P<0.05). This highlights that the perception of starch viscosity as it breaks down in the mouth is directly related to the activity of salivary amylase on starch, since this is the only variable measured by the microviscoamylograph.

The relationship between AMY1 gene copy number and the perception of starch viscosity was also examined. Overall change in perceived viscosity over time and the time to reach ½ perceived viscosity were not significantly related to the number of gene copies in this data set (P = 0.19 and P = 0.54, respectively) (not shown).


INTRODUCTION

Massive sequencing efforts, such as those of The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), have generated a comprehensive collection of sequenced genomes of cancer patients, opening a new era for genomics. Advanced analyses of genomic sequencing data require accurate estimation of DNA cellularity (purity, 1 – DNA admixture) and tumor ploidy to allow appropriate comparative computation. DNA admixture refers to the amount of non-cancer cells in a tumor sample, whereas ploidy represents the average number of chromosome set in a cell. Human healthy cells are diploid, whereas tumor cells often demonstrate a dramatically variable ploidy number, depending on the tumor type (Chunduri & Storchova, 2019 Danielsen, Pradhan, & Novelli, 2016 ). The impact of ploidy changes on tumor evolution and prognosis is as yet unclear, but recent pan-cancer studies have shed some light on this issue. In a primary tumor pan-cancer cohort from the TCGA project, cell proliferation and immune evasion, two hallmarks of cancer, were deregulated in high-aneuploidy samples (Davoli, Uno, Wooten, & Elledge, 2017 Taylor et al., 2018 ). In a pan-cancer cohort of 9,692 patients with advanced disease, aneuploidy was associated with poor survival (Bielski et al., 2018 ).

A recent review (Aran, Sirota, & Butte, 2015 ) highlighted the importance of purity estimation in analyzing sequencing data. For instance, phylogenetic reconstruction of tumor evolution from multisample DNA sequencing data from a single patient stringently relies on the quantification of the variant allelic fraction (VAF) of single-nucleotide variants (SNV) (Gundem et al., 2015 ), which is affected by both the DNA admixture (normal cells dilute SNV VAFs) and the ploidy (polyploidy increases the total number of alleles) of each tumor sample. The same issues also affect the determination of the absolute number of copies of a genomic segment in a tumor sample (Carter et al., 2012 ). Many computational methods identify somatic copy-number aberrations from the relative amounts of DNA in a tumor and its matched normal sample, but accurate estimation of the integer number of copies of each allele requires purity and ploidy adjustments (Bao, Pu, & Messer, 2014 ).

These considerations call for the development of computational tools to quantify tumor purity and ploidy. In the pre-sequencing era, several tools were developed for high-density single-nucleotide polymorphism (SNP) array data (e.g., Carter et al., 2012 Van Loo et al., 2010 ) with these, typically the tumor-to-control-signal log ratio (hereafter logR) and the abundance of allele-specific signal (B allele frequency, BAF) distributions are jointly analyzed to infer DNA admixture and ploidy. However, array-based tools are limited by the number of the genomic bases assayed (mainly in the range of 0.5 million to 2 million sites) and by the signal dynamic range. Next-generation sequencing platforms overcome these limitations while preserving the same data features to exploit (Aran et al., 2015 ): allelic fraction (AF) of inherited heterozygous SNP loci (hereafter called informative SNPs) and sequencing coverage resemble the BAF and logR data of SNP arrays, respectively. The statistically richer data offered by sequencing makes it possible to perform more complex analyses such as allele-specific copy-number and clonality estimates.

In general, available methods to estimate ploidy and DNA admixture adopt a global approach, and the distributions of AFs and logR values are conjointly used to infer DNA admixture and ploidy. Intuitively, it is evident that the AF of informative SNPs is distributed around 0.5 in a 100% admixed tumor sample (up to the reference mapping bias Degner et al., 2009 ), and lower AFs imply lower DNA admixture. LogR data are used as a covariate, as AF also depends on the number of available alleles. If no tumor cell subpopulations are present (that is, if the copy-number profile of a tumor sample is homogeneous, i.e., the ratio of subclonal deletions/amplifications is low), global inference approaches capture the DNA admixture content well. However, in the presence of complex genomic events, such as chromothripsis (Stephens et al., 2011 ) or chromoplexy (Baca et al., 2013 ), or after multiple treatments that diversify the tumor cell population, global approaches are suboptimal.

CLONET (CLONality Estimate in Tumor Prandi et al., 2014 ) is a stand-alone tool specifically designed with a local approach to clonality estimation to handle heterogeneous tumor samples. Briefly, consider a tumor sample T with a hemizygous deletion HeD and the set of informative SNPs S lying within HeD. The AF value of SNPs in S is the convolution of the AF of the different cell populations composing T. If HeD is subclonal (that is, not all tumor cells harbor this deletion), the tumor sample comprises three main cell populations: (i) non-tumor cells contributing to DNA admixture, with expected AFs of SNPs in S around 0.5 (ii) tumor cells not harboring HeD, such that the AFs of SNPs in S cannot be distinguished from those of non-tumor cells and (iii) tumor cells harboring HeD, in which the AF could either be equal to 1 (if the deleted allele harbors the alternative base) or to 0 (if the deleted allele harbors the reference allele). Based on the observation that apparent DNA admixture is higher in subclonal deletions than in clonal deletions, CLONET estimates DNA admixture at each hemizygous deletion and then identifies the most clonal deletions to finally designate the sample DNA admixture. This results in a more accurate estimation of DNA admixture, which would otherwise be overestimated, in tumors with a significant fraction of subclonal deletions.

Here, we present CLONET version 2 (CLONETv2), an R package (R Core Team, 2017 ) available at The Comprehensive R Archive Network (https://cran.r-project.org/) that includes significant improvements over the original CLONET implementation. This is the result of its application to several clinical cohorts, including tissue and plasma samples, and to a variety of sequencing platforms, such as whole-genome, whole-exome, and targeted sequencing panels. In Carreira et al. ( 2014 ), CLONET was used to estimate DNA admixture from a custom sequencing panel of ∼40 kb designed to analyze circulating tumor DNA of plasma samples from metastatic patients, and the algorithm was modified to improve sensitivity in samples with <10% tumor cells. In Beltran et al. ( 2016 ), CLONET was extended to provide allele-specific copy-number data from whole-exome sequencing experiments for each genomic segment in each study cohort tumor, the study reports the number of copies of each allele using ploidy, DNA admixture, logR, and the AF of informative SNPs. In Faltas et al. ( 2016 ), the clonality analysis capability of CLONET was improved to account for complex allele-specific combinations and SNVs. Since its initial conception and application to whole-genome sequencing data (Baca et al., 2013 Prandi et al., 2014 ), CLONET improvements have been used in several studies (including Beltran et al., 2015 Boysen et al., 2015 Cancer Genome Atlas Research Network, 2015 and Mu et al., 2017 ). Here, we present a documented version of CLONETv2 to uniformly highlight the approach features and propose it as an R package to make the tool available to a broader audience.

All reads of a human DNA next-generation sequencing experiment that map within a genomic segment derive from either one of the parental chromosomes of origin. Reads can be split into two sets: a copy-number-neutral set that contains equal numbers of reads from the maternal and paternal chromosomes, and an active reads set that includes sequences from only one parent. Generally speaking, given two random reads, it is impossible to determine whether or not they represent the same allele however, if the two reads span an informative SNP, the allele of origin can be identified. For reads over informative SNPs, the number of reads (local coverage) supporting the reference or the alternative SNP represents the number of copies and the origin of the alleles present in the tumor sample. Each informative SNP can be characterized by its allelic fraction (AF), which depends on the genomic context. For instance, let us consider the two informative SNPs within a monoallelic deletion of the genomic segment denoted A in Figure 1A. At position p1, only the alternative allele is present and AF = 1, whereas at position pn, the alternative allele is deleted and AF = 0. In contrast, in the wild-type genomic segment B, the AF values of informative SNPs at positions pn+1 and pm are distributed around 0.5, as both alleles contribute equally to the local coverage. Now, the percentage of neutral reads (known as beta, β) at p1 and pn is equal to 0, regardless of which allele is deleted, whereas at wild-type genomic positions, pn+1 and pm each approximate 1, as no active reads are present. Overall, SNPs within somatically aberrant segments are easier to characterize using the beta values as compared to the AFs, as the former is independent from the deleted allele. In a heterogeneous tumor sample, the distributions of AFs and betas result from the convolution of the distribution observed in basic wild-type and monoallelic deleted segments. As an example, Figure 1B depicts the distribution of the AF and the associated beta of the informative SNPs in genomic segments A and B in the case of a normal cell, whereas Figures 1C and 1D show how the distributions change in tumor cells with monoallelic deletion of only genomic segment A, or of both A and B, respectively. Figure 1E represents the case of a tumor sample with one normal cell (Fig. 1B) and nine tumor cells 1 (Fig. 1C). The DNA admixture is 1/10, and the AF could assume values around 1/11 or 10/11, whereas beta is 2/11. Genomic segment B is not deleted, and therefore the AF and the beta are as in the normal cell. Figure 1F represents a more complex situation involving one normal cell (Fig. 1B), three “tumor cells 1” (Fig. 1C), and six “tumor cells 2” (Fig. 1D). The AF and beta of informative SNPs in genomic segment A are as in Figure 1E, but only the six tumor cells 2 carry the monoallelic deletion of genomic segment B. In this case, the AF distribution modes are centered on 4/14 and 10/14, depending on the depleted base, whereas beta is 8/14. The full characterization of beta is described by Prandi et al. ( 2014 ), and in Beltran et al. ( 2016 ) we defined CLONET master equations that describe allele-specific copy number of maternal and paternal alleles, cnM and cnP, as a function of the percentage of neutral reads beta, the log2 ratio values adjusted by ploidy logRp, and the DNA admixture G, as:

(1)

where maternal and paternal allele are arbitrarily assigned. Figure 2 sketches the transformation of the log2 ratio space implied by Equation 1. Figure 2A reports the histogram of the log2 ratio signal in a tumor sample: peaks in the distribution correspond to different copy-number states, whereas deviations from the position of the expected peaks (below) depend on ploidy and DNA admixture values. It is difficult to identify the peak that corresponds to wild-type segments using only log2 ratio signal. When we expand the monodimensional logR space with beta (Fig. 2B), segments that contribute to the same peak along the logR dimension form different clusters in the beta-vs.-logR space. Of note, the beta-vs.-logR plot still reflects ploidy and DNA admixture, whereas the cnM and cnP space (see Equation 1) allows straightforward interpretation of the copy number and clonality status of each genomic segment.

  1. seg_tb : a table resulting from DNA segmentation for each genomic segment, the table reports chromosome, start/end position and the log2 ratio of the tumor over the normal coverage, as defined in the Circular Binary Segmentation algorithm (Olshen, Venkatraman, Lucito, & Wigler, 2004 )
  2. pileup_normal , pileup_tumor : two tables reporting allelic fraction and coverage of SNPs in normal and matched tumor samples, respectively for each SNP, each table reports genomic coordinates (chromosome and position), allelic fraction, and coverage
  3. min_af_het_snps , max_af_het_snps : for each SNP in the pileup_normal table, set minimum and maximum allelic fraction to consider the SNP as informative
  4. min_required_snps: the minimum number of informative SNPs in a genomic segment from seg_tb to retain the segment
  5. min_coverage : the minimum mean coverage of informative SNPs to retain a segment.
  1. beta : estimated value for the input segment
  2. nsnps : number of informative SNPs in the input segment
  3. cov : mean coverage of informative SNPs in the input segment
  4. n_beta : estimated value for the input segment considering the matched normal sample. This value is expected to be 1, except in the case of germline copy-number variation or sequencing-related errors.
  1. number of processed segments: the number of segments in the input seg_tb table
  2. number of segments with a valid beta estimate: the number input segments for which beta value is computed this value is affected by the number of informative SNPs and their mean coverage
  3. quantiles of input segment lengths: the quantiles of the distribution of the length of the input segments the expected distribution depends on the segmentation algorithm used to produce the seg_tb table, but in general small values result in a low number of informative SNPs, whereas large segments may indicate undersegmentation that in turn affects beta estimates
  4. quantiles of informative SNPs input segment coverage: the quantiles of the distribution of the mean coverage of the input segments expected coverage depends on the sequencing experiment, but a low value may indicate problems with the input sample
  5. quantiles of number of informative SNPs per input segment: the quantiles of the distribution of the number of informative SNPs in the input segments expected number of informative SNPs per kb is ∼0.33 (based on common SNPs), and therefore, this value combined with input segment lengths gives information about the quality of the pileup data.

Necessary Resources

Hardware

64-bit computer running Linux with ≥8 GB RAM

Software

The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

1. Prepare tumor and normal pileups as described in Support Protocol 1 or with other computational tools. The output of this step comprises two files, tumor.pileup and normal.pileup .

2. Prepare tumor segmented data in the file tumor_segments.txt with columns compatible with the parameter seg_tb described above.

  • > seg_tb <- read.table(system.file(“sample.seg”, package = “CLONETv2”),header = T, as.is=T)
  • > pileup_tumor <- read.table(system.file(“sample_tumor_pileup.tsv”, package = “CLONETv2”),header = T, as.is=T)
  • > pileup_normal <- read.table(system.file(“sample_normal_pileup.tsv”, package = “CLONETv2”),header = T, as.is=T)
  • Computed beta table of sample “sample1”
  • Number of processed segments: 65
  • Number of segments with valid beta: 49 (75%)
  • Quantiles of input segment lengths:
    • 0%: 2860
    • 25%: 17504185
    • 50%: 38004799
    • 75%: 59311449
    • 100%: 147311449
    • 0%: 47.0000
    • 25%: 137.7893
    • 50%: 168.3820
    • 75%: 186.6769
    • 100%: 695.6145
    • 0%: 0
    • 25%: 12
    • 50%: 99
    • 75%: 213
    • 100%: 404

    This protocol describes the steps used to prepare pileup data from a set of SNPs and matched tumor and normal .bam (BAM) files (Li et al., 2009 ). The tables pileup_normal and pileup_tumor report allelic fraction and coverage for a set of SNP positions. Candidate SNP positions can be downloaded directly from the dbSNP FTP server (ftp://ftp.ncbi.nlm.nih.gov/snp/). We suggest starting from the largest possible set of SNPs, as the larger the number of informative SNPs, the more reliable the CLONETv2 estimates. Pileups from BAM files can be obtained using any of several tools. Here we describe how to prepare pileups using ASEQ (Romanel, Lago, Prandi, Sboner, & Demichelis, 2015 ), a tool freely available at http://demichelislab.eu/tools/ASEQ.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥8 GB RAM

    Software

    Input files

    • BAM files tumor.bam and normal.bam containing aligned reads from genomic sequencing experiments of matched tumor and normal DNA samples, respectively
    • VCF (Degner et al., 2009 ) file known_snp_positions.vcf reporting known SNP positions ASEQ requires that the input VCF only lists SNPs, i.e., columns ALT and REF must contain one of the values A, C, G, or T. ASEQ parameters include:
    1. mrq: minimum read quality (ASEQ does not consider as part of the pileup reads with read quality < mrq)
    2. mbq: minimum base quality (ASEQ does not consider as part of the pileup bases with quality < mbq)
    3. mdc: minimum depth of coverage (ASEQ output only reports positions with coverage ≥ mdc)
    4. threads: number of threads available for ASEQ computation.

    ASEQ code will be available in the subfolder binaries/linux64/ .

    ASEQ examples will available in the subfolder examples/VCF_samples/ .

    • $./binaries/linux64/ASEQ mode=PILEUP vcf=examples/VCF_samples/sample1.vcf bam=examples/BAM_samples/sample1.bam mbq=20 mrq=20 mdc=1 threads=1 out=.

    ASEQ produces the file sample1.PILEUP.ASEQ , reporting allelic fraction and read coverage from the BAM file sample1.bam , for each position in the VCF file sample1.vcf . The parameters mbq = 20 and mrq = 20 tell ASEQ to ignore, respectively, bases and reads with quality <20. The parameter mdc = 1 instructs ASEQ to ignore positions in the BAM file with no reads. The parameters and the format of the output file .PILEUP.ASEQ are compatible with pileup data required in Basic Protocol 1.

    Segmentation algorithms partition input genomic space into segments with homogenous coverage. Given a pair of matched tumor and normal samples, the logR value of a genomic segment is the log2 of the ratio between the tumor coverage and the normal sample coverage within the segment. To account for different mean coverage in different sequencing experiments, logR is normalized over the ratio between the mean tumor and the mean normal coverage this applies both to whole-genome and whole-exome data. In the case of higher coverage in the tumor sample, if without normalization the ratio between the mean tumor and the mean normal coverage is X, a wild-type segment would have logR = log2(X), whereas the expected value is 0 (i.e., same number of alleles between tumor and normal samples). The normalization would, however, introduce a bias whenever the difference in mean coverage between the tumor and the normal sample was due to the presence of an abnormal number of alleles in the tumor (aneuploid) genome. In this case, the normalization leads to a shift in the logR signal. Figure 3A shows an example of a diploid genome sample with 127× and 69× mean tumor and mean normal coverage, respectively. The logR signal is centered on 0, as expected (green line). Figure 3B highlights a more complex case: tumor and normal mean coverage are comparable (125× and 117×, respectively), but the position of the wild-type segments (orange line) is shifted with respect to the expected value (green line). The shift is representative of the total number of alleles in the genome, and ploidy can be estimated as:

    (2)

    The proof (Equation 2) is reported in the paper originally describing CLONET (Prandi et al., 2014 ). The example in Figure 3A has a logR shift of 0 and ploidy of 2, whereas the example in Figure 3B has a logR shift of –0.34 and a ploidy of 2.53.

    1. beta_table : a table created using the function described in Basic Protocol 1
    2. max_homo_dels_fraction (default 0.05): homozygous deletions can provide a confounding factor in the determination of sample ploidy the parameter sets a percentage of genomic segments that will not be used for ploidy computation as putative homozygous deletion, and overestimating this value does not affect ploidy computation
    3. beta_limit_for_neutral_reads (default 0.90): in theory, neutral reads correspond to beta = 1, but experimental noise lowers this value therefore only segments with beta above the limit are used to compute ploidy
    4. min_coverage (default 20): only genomic segments with average coverage at least min_coverage are used to compute DNA admixture
    5. min_required_snps (default 10): only genomic segments covering at least min_required_snps informative SNPs are considered for DNA admixture computation.

    The function returns the ploidy for the input sample.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, ggrepel 0.8.0.

    2. Compute beta table as described in Basic Protocol 1.

    3: COMPUTING DNA ADMIXTURE

    1. beta_table : a table created using the function described in Basic Protocol 1
    2. ploidy_table : a table created using the function described in Basic Protocol 2
    3. min_coverage (default 20): only genomic segments with average coverage at least min_coverage are used to compute DNA admixture
    4. min_required_snps (default 10): only genomic segments covering at least min_required_snps informative SNPs are considered for DNA admixture computation
    5. error_tb : the number of informative SNPs and the coverage of the considered segment affect the accuracy of the estimation of beta of a genomic. The table error_tb reports, for each combination of number of informative SNPs and coverage, the expected error around the beta estimate. CLONETv2 embeds a pre-computed error_tb (details in Prandi et al., 2014 ) previously tested in several studies (Beltran et al., 2015 Beltran et al., 2016 Faltas et al., 2016 ). However, specific experimental settings, such as ultra-deep targeted sequencing or low-pass whole-genome sequencing, may require an ad hoc error_tb table.

    The function returns the estimated DNA admixture for the input sample as well as minimum and maximum DNA admixture values accounting for errors around beta estimates.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

    2. Compute beta table as described in Basic Protocol 1.

    3. Compute ploidy table as described in Basic Protocol 2.

    2: VISUALIZING AND INTERPRETING BETA TABLE, PLOIDY, AND DNA ADMIXTURE

    Basic Protocol 1 describes how to derive the value of beta for a genomic segment. A tumor sample is then described as a set of (beta, logR) values extending the usual logR space and enabling the computation of ploidy and DNA admixture in Basic Protocols 2 and 3, respectively. To help interpreting the results of Basic Protocols 1 to 3, CLONETv2 provides the function check_ploidy_and_admixture that plots beta-vs.-logR space for a given samples. Figure 4A and 4B show the values of beta against the logR of the same samples presented in Figure 3A and B, respectively. For each genomic segment, the plot reports the logR as well as the beta computed by function compute_beta_table . To help the user, the function predicts expected (beta, logR) given the input ploidy and DNA admixture level according to the equations described in CLONET paper (Prandi et al., 2014 ). Predicted values are computed for different combinations of allele-specific copy numbers (see Basic Protocol 4) and represented as red circles. Comparing the expected (red circles) and the observed (gray dots) values helps the interpretation of the estimates. For instance, segments with logR near 0 in Figure 3B cannot be wild type, as their betas are ∼0.8, a value compatible with the presence of three DNA copies.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

    2. Follow Basic Protocols 1, 2, and 3 to compute beta table bt , ploidy table pl , and DNA admixture table adm , respectively.

    check_plot is a ggplot object (Wickham, 2009 ) that can be customized by the user (e.g., for font size, color, line width).

    4: COMPUTING ALLELE-SPECIFIC COPY NUMBER

    1. beta_table : a table created using the bt function described in Basic Protocol 1
    2. ploidy_table : a table created using the pl function described in Basic Protocol 2
    3. admixture_table : a table created using the adm function described in Basic Protocol 3
    4. error_tb : the same error_tb used in the function compute_dna_admixture of Basic Protocol 3, step 4
    5. allelic_imbalance_th (default 0.5): function compute_allele_specific_scna_table also returns integer values cnA.int and cnB.int for cnA and cnB, respectively. The value cnA.int is the rounded-off value of cnA if | cnA.int - cnA | < allelic_imbalance_th otherwise cnA.int is not defined. cnB.int is defined similarly with respect to cnB.

    1. log2.corr : logR value adjusted by ploidy and purity: i.e., the logR value the segment would have in a diploid 100% pure tumor sample
    2. cnA , cnB : number of copies of major (cnA) and minor (cnB) allele the values do not contain information about ploidy and purity — indeed, cnA + cnB equals 2 × 2 log2.corr
    3. cnA.int , cnB.int : integer number of copies of major and minor alleles, respectively.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

    2. Follow Basic Protocols 1, 2, and 3 to compute beta table bt , ploidy table pl , and DNA admixture table adm , respectively.

    5: COMPUTING SOMATIC COPY-NUMBER CLONALITY

    1. beta_table : a table created using the function described in Basic Protocol 1
    2. ploidy_table : a table created using the function described in Basic Protocol 2
    3. admixture_table : a table created using the function described in Basic Protocol 3
    4. error_tb : same error_tb used in the function compute_dna_admixture of Basic Protocol 3 error around beta is propagated to clonality estimate and used in its discretization
    5. clonality_threshold (default = 0.85): the function compute_scna_clonality_table returns minimum and maximum clonality for input genomic segments clonality_threshold is used to discretize clonality as described by Prandi et al. ( 2014 )
    6. beta_threshold (default = 0.9): input beta values below beta_theshold are marked as potentially aberrant and used for clonality estimates.

    1. clonality : real value representing the estimated percentage of tumor cells with uniform copy number for a given genomic segment
    2. clonality.min , clonality.max : real values representing minimum and maximum estimated clonality given the distribution of beta and logR values
    3. clonality.status : discretized clonality.

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

    2. Follow Basic Protocols 1, 2, and 3 to compute beta table bt , ploidy table pl , and DNA admixture table adm , respectively.

    6: COMPUTING SINGLE-NUCLEOTIDE VARIANT CLONALITY

    1. snv_read_count : a table reporting in each row the genomic coordinates of an SNV together with the numbers of reference and alternative reads covering the mutated position
    2. beta_table : a table created using the function described in Basic Protocol 1
    3. ploidy_table : a table created using the function described in Basic Protocol 2
    4. admixture_table : a table created using the function described in Basic Protocol 3
    5. error_tb : the same error_tb used in the function compute_dna_admixture of Basic Protocol 3 error around beta is propagated to assess clonality estimate boundary and in turn is used for its discretization
    6. error_rate (default = 0.05): fraction of SNVs to exclude based on adjusted VAF distribution.

    1. cnA, cnB : allele-specific copy numbers of the genomic segment containing the SNV
    2. t_af_corr : tumor VAF adjusted for ploidy, admixture, and allele-specific copy number
    3. SNV.clonality : percentage of tumor cells harboring the SNV
    4. SNV.clonality.status : discretized SNV.clonality .

    Necessary Resources

    Hardware

    64-bit computer running Linux with ≥4 GB RAM

    Software

    The library has been tested with R version 3.5.2 and the R libraries parallel 3.5.2, ggplot2 3.1.0, sets 1.0-18, arules 1.6-3, and ggrepel 0.8.0

    2. Follow Basic Protocols 1, 2, and 3 to compute beta table bt , ploidy table pl , and DNA admixture table adm , respectively.

    • > read.table(system.file(“sample_snv_read_count.tsv”, package = “CLONETv2”),header = T, as.is=T, comment.char = “", check.names = F, na.strings = ”-")

    Fetal Loss

    34.3.2 Pattern of Chromosome Abnormalities Seen in Aborted Pregnancies

    The vast majority of chromosomal abnormalities observed in aborted fetuses evaluated by G-banding are numerical, including autosomal trisomies, polyploidy, sex chromosome monosomy and double trisomies (26) . A 2009 study, combining G-banding with MLPA and aCGH (27) on 115 first-trimester miscarriages, found 69 (60%) to be chromosomally abnormal. Of these, 69% had autosomal trisomy (including 2% with double trisomies), 12% were polyploid (primarily triploidy), and 10% had sex chromosome monosomy (45,X), with only 1% showing structural abnormalities and the rest showing errors not involving entire chromosomes, such as duplications or deletions. Similar results were reported by combining karyotype analysis with reflex FISH (28) , which observed 61% trisomy, 15% polyploidy (primarily triploidy), 14% sex chromosome monosomy and 7% structural abnormalities. It is not surprising that the most common abnormalities seen are autosomal trisomies, as it was recognized as early as 1984 by Hassold and Chiu (5) that the risk of both pregnancy loss and the incidence trisomy as the result of maternal nondisjunction increase with maternal age, and thus are likely to occur concurrently. Our own data (unpublished) shows the most frequent chromosome abnormalities in presumably sporadic fetal losses to be triploidy, sex chromosome monosomy, and trisomies (21, 22, 15, 18, 13 and 16 in descending order ( Figure 34-1 )). A slightly different pattern was observed among losses from women with a history of pregnancy loss, with the most prevalent abnormalities being triploidy, and trisomies 22, 16, 15 and 21. Interestingly, the pattern associated with sporadic loss is similar to that due to meiotic errors (9) , while the pattern seen in the women with recurrent loss has been associated with mitotic errors seen in mosaic IVF embryos. The relative paucity of sex chromosome monosomy among the recurrent losses might be related to the slightly advanced age (37.3 vs 36.2 years) in this group, as sex chromosome monosomy is most often due to nondisjunction in males, and thus would not necessarily be related to maternal age. Double trisomies, which, except in very rare instances involving the presence of an extra sex chromosome, are not viable, are not uncommon in abortus samples, representing about 1–2% of these cases (29) . They are almost always a result of maternal nondisjunction (30) and are also associated with older maternal age. It should be noted that while studies on preimplantation embryos (see earlier) often report autosomal monosomy, peaking at about the eight-cell stage, such karyotypes are inviable, and autosomal monosomy has not been reported in abortus specimens.

    FIGURE 34-1 . Relative frequency of chromosome abnormalities observed in cytogenetically abnormal POCs from women with a reported history of recurrent pregnancy loss compared to those with reported sporadic pregnancy loss. Mean maternal age was 37.3 years in the recurrent group and 36.2 years in the sporadic group. Number of chromosomes involved presented across the X axis with 23 = double trisomy, 24 = monosomy X, 25 = triploidy, 26 = tetraploidy. The abnormalities that are considered viable (trisomy 13, 18 and 21 and monosomy X) are all more frequent in the group with sporadic losses, with trisomies 15, 16 and 22 being more prevalent among those with recurrent loss.


    Subscriber data

    For each segmentation condition we show here, you'll find a short description of what it controls, and a table that displays all of the options in the drop-down menus. In most cases, there are only three choices to make, but for some condition types, a fourth drop-down menu will appear.

    In almost all cases, you won't see all of the options that appear on this page in the drop-down menu in your account. The drop-down menus that appear in your account are limited to what data is available in the audience you're working with.

    Automation activity

    Automation report data is available in segmenting options, so you can pull segments of subscribers based on whether they've started or completed a certain email automation.

    Campaign activity

    Create segments based on how subscribers have interacted with your email campaigns. For example, use a combination of segmenting criteria to target subscribers who were sent recent campaigns but didn't open them.

    • Any/All of the Last 5 Campaigns

    Here are some examples of how Campaign Activity segments work.

    • Campaign Activity | was sent | All of the Last 5 Campaigns
      Subscribers who received all of the last five email campaigns
    • Campaign Activity | was not sent | All of the Last 5 Campaigns
      Subscribers who received none of the last five email campaigns
    • Campaign Activity | was not sent | Any of the Last 5 Campaigns
      Subscribers who were not sent one or more of the last five email campaigns
    • Campaign Activity | was sent | Any of the Last 5 Campaigns
      Subscribers who have received one or more of the last five email campaigns
    • Campaign Activity | did not open | Any of the Last 5 Campaigns
      Subscribers who did not open one or more of the last five email campaigns
    • Campaign Activity | did not open | All of the Last 5 Campaigns
      Subscribers who opened none of the last five email campaigns
    • Campaign Activity | opened | Any of the Last 5 Campaigns
      Subscribers who opened one or more of the last five email campaigns
    • Campaign Activity | opened | All of the Last 5 Campaigns
      Subscribers who opened all of the last five email campaigns

    Note: Be careful when sending to activity segments based on scheduled, drafted, or paused campaigns. The total recipient count of your segment will not be finalized until your campaign sends.

    Postcard activity

    Use the Postcard Activity condition to segment your contacts based on whether they’ve been sent a postcard campaign. Postcard Activity segments don't include anyone who received a postcard with our lookalike audience finder, since this tool doesn't add contacts to your audience.

    • was not sent specific postcard

    Contact rating

    Use the Contact Rating condition to create a segment of your most or least engaged subscribers.

    Conversations activity

    Mailchimp's Conversations feature tracks email replies from your subscribers. Use this condition to segment for subscribers who have responded to campaigns via email. Sent campaigns and draft campaigns are available. All Recent Campaigns pulls data from the 500 most recent campaigns sent to your audience.

    • Any of the Last 5 Campaigns

    Date added

    Use the Date Added condition to create a segment based on the date a subscriber signs up or is imported to your audience. The Date Added operator automatically converts each contact's signup time to Coordinated Universal Time (UTC), so Date Added segments may sometimes appear to return results outside the chosen timeframe.

    • a specific campaign was sent

    * For within operators, input a whole number value for the "last number of days" parameter. Note that each day is 24 hours, counted back from when you create the segment. For example, if you choose is within, and input 3 days, we'll find subscribers who joined your audience in the last 72 hours.

    Email client

    If you have different campaign designs for people who use different mailbox applications, you can segment based on Email Client. Only one client can be selected per condition, but up to five conditions can be selected for any segment.


    Results

    VCF2CNA can be run through a simple web interface (Fig. 1A) or as a commandline line tool. For the web interface, the sole input is a VCF file (or a file in one of the other supported variant file formats) from a paired tumor–normal WGS or WXS analysis, which is uploaded via the interface to a web server where the application runs. The results are returned to a user-provided email address. For the commandline tool, the pipeline is run by invoking a single run command. VCF2CNA consists of two main modules: (1) SNP information retrieval and processing from the input data and (2) recursive partitioning–based segmentation using SNP allele counts (Fig. 1B). Actual running time for a typical WGS sample is approximately 30 to 60 minutes, depending on the complexity of the genome.

    Overview of the VCF2CNA process. (A) User interface with parameters. (B) Server side pipeline. A parallelogram depicts input or output files, a rectangle depicts an analytical process, and a diamond depicts the condition for a follow-up process.

    To evaluate the utility of VCF2CNA, we ran it on 192 tumor–normal WGS data sets and 15 tumor–normal WXS data sets. These sequences comprised 46 WGS adult glioblastomas (GBMs) from The Cancer Genome Atlas (TCGA-GBM) dataset 10 , sequenced by Illumina technology, and 146 WGS pediatric neuroblastomas (NBLs) from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET-NBL) dataset 11 , sequenced by Complete Genomics, Inc. (CGI) technology. On average, VCF2CNA used approximately 2.8 million high-quality SNPs per sample (median 2,811,245 range, 2,029,467–3,519,454 in TARGET-NBL data) to derive CNA profiles. We further evaluated the consistency between WGS and WXS using 15 rhabdomyosarcoma samples that were sequenced on both platforms 12 and estimated the tumor purity in these samples.

    CNA analysis of TCGA-GBM data

    The adult TCGA-GBM data downloaded from dbGaP (accession number: phs000178.v8.p7) included 46 samples. We first evaluated VCF2CNA’s resistance to library construction artifacts by using 24 samples from this set, which were previously identified as having a fractured genome pattern by CONSERTING and other CNA algorithms 7 . Indeed, VCF2CNA produced CNA profiles that are globally consistent with those of SNP array–derived CNA profiles (downloaded from TCGA, Supplementary File s1) and more robust to noise than those produced by CONSERTING. Specifically, VCF2CNA yielded a mean 59.4-fold reduction in the number of predicted segments than did CONSERTING (median, 46.2 range, 16.2–285.7 p = 3.0 × 10 −6 by Wilcoxon signed-rank test, Fig. 2A and Supplementary File s1).

    A Circos plot that displays CNAs found by CONSERTING (outer ring), VCF2CNA (middle ring), and SNP array (inner ring) for (A) TCGA-GBM fractured sample 41-5651-01A and (B) TCGA-GBM unfractured sample 06-0125-01A. Alternating gray and black chromosomes are used for contrast. Yellow regions depict sequencing gaps, whereas red regions depict centromere location. Blue segments depict copy-number loss, and red segments indicate copy-number gain. Legend depicts CNA range for each track.

    We used an F1 scoring metric 13 to measure the consistency between the CNA profiles derived from VCF2CNA and CONSERTING in the remaining 22 high-quality sample pairs (Fig. 2B and Supplementary File s2). These programs identified approximately 700 Mb of the CNA regions in each sample (range, 92–2299 Mb) with high consistency (mean F1 score, 0.9941 range, 0.9699–0.9995) (Table 1).

    We evaluated the segmental overlap between the CONSERTING outputs and the VCF2CNA outputs for each sample. A CNA segment detected by CONSERTING was classified as corroborated if 90% of the bases in the segment received the same type of CNA call from VCF2CNA (Table 2). The comparison shows that VCF2CNA faithfully recapitulated medium to large CNA segments (≥100 kb), whereas CONSERTING had greater power for identifying focal (<100 kb) low-amplitude (absolute log2 ratio change <1.0) CNAs (p = 1.306 × 10 −5 by Wilcoxon signed-rank test). Furthermore, the segmental–based analysis revealed that the detection power was less affected in focal CNAs with large amplitudes (log2 ratio ≥ 3.0) (Fig. 3).

    Violin plot stratified by segment size and CNA intensity for all 22 TCGA-GBM unfractured samples. Gold diamond represents the mean fraction of matching segments between VCF2CNA and CONSERTING.

    To further test whether VCF2CNA accurately captures the CNA patterns in samples with library artifacts, we applied the cghMCR algorithm 14 . This package in R Bioconductor provides functions to identify genomic regions of interest based on segmented copy number data from multiple samples. We used this functionality to depict these common gains and losses across all 46 samples from either VCF2CNA profiles or SNP array–derived CNA profiles (downloaded from TCGA). The results are quantified by a segment gain or loss (SGOL) score. Although the signal from VCF2CNA contained less noise than did the signal from the SNP array in most samples (Supplementary File s1), both profiles reveal common recurrently amplified and/or lost regions (Fig. 4). These changes included chromosome-level changes (i.e., chr7 amplifications and loss of chr10) and segmental CNAs (i.e., focal deletion of the CDKN2A/B locus on chr9p) 15 . Moreover, VCF2CNA identified recurrent losses in ERBB4 on chr2q and GRIK2 on chr6q that were absent in the SNP array profiles. ERBB4 encodes a transmembrane receptor kinase that is essential for neuronal development 16 . It is frequently mutated in patients with non-small cell lung cancer 17 , and silencing of ERBB4 through DNA hypermethylation is associated with poor prognosis in primary breast tumors 18 . Similarly, GRIK2 is a candidate tumor suppressor gene that is frequently deleted in acute lymphocytic leukemia 19 and silenced by DNA hypermethylation in gastric cancer 20 .

    A chgMCR plot of 46 TCGA-GBM samples. (A) SNP array data and (B) VCF2CNA data are shown.

    Amplifications such as double minute chromosomes and homogeneously staining regions represent a common mechanism of oncogene overexpression in tumors 21 . Among the 46 TCGA-GBM samples analyzed, VCF2CNA identified double minute chromosomes in 34 samples affecting the EGFR 22 , MDM2 23 , MDM4 24 , PDGFRA 25 , HGF 26 , GLI1 27 , CDK4 28 , and CDK6 29 genes (Fig. 5 and Supplementary File s3). These events consisted of high-level amplifications in 21 samples with potential fractured genome patterns (Supplementary File s3a) and 13 previously reported samples (Supplementary File s3b) 7,30 .

    A Circos plot of VCF2CNA (outer ring) and CONSERTING (inner ring), depicting high-amplitude focal CNA segments in TCGA-GBM sample 06-0152-01A. Included in these segments are the known cancer genes EGFR, CDK4, and MDM2. CNA range is specified for each sample.

    CNA analysis of TARGET-NBL data

    We applied VCF2CNA to the TARGET-NBL dataset 11 downloaded from dbGap (assession number: phs000467). This dataset consists of 146 tumors with matched normal WGS samples, sequenced with CGI technology. Because the ligation-based CGI technology has notable differences in the detection of single nucleotide variants (SNVs) and insertions/deletions (indels) compared to Illumina systems 31 , this dataset provided an opportunity to evaluate VCF2CNA’s robustness using different sequencing platforms.

    We used VCF2CNA to perform cghMCR analysis with CNA profiles and observed a genome pattern similar to that reported for SNP array platforms (Fig. 6A) 32 . In addition to loss of large regions on chr1p, 3p, and 11q and a broad gain of chr17q, VCF2CNA found frequent focal amplifications of MYCN in NBL tumors and several potential cancer-related CNAs, including high-level amplifications of CDK4 (1 tumor), and ALK (2 tumors) (Fig. 6B).

    Analysis of the TARGET-NBL dataset, consisting of 146 tumors. (A) A chgMCR plot in which green depicts regions of copy-number gain and red depicts regions of copy-number loss. (B) A Circos plot showing a focal gain on chromosome 2 for MYCN and ALK5 for sample PARETE-01A-01D. CNA range is specified.

    High-level amplification of MYCN is a known oncogenic driver found in

    25% of pediatric patients with NBL, and is associated with aggressive tumors and poor prognosis 33 . A subset of 32 tumors in the TARGET-NBL cohort contains clinically validated amplifications of MYCN. Although the CGI’s hidden Markov CNA model (unpublished) reported MYCN amplifications in 15 of these 32 tumors, VCF2CNA successfully identified high-level amplifications in 31 tumors. In the clinically validated MYCN-amplified sample that went undetected by VCF2CNA, a follow-up review revealed that tumor heterogeneity and sampling bias most likely contributed to the discrepancy. Moreover, VCF2CNA predicted two additional MYCN amplification events among the remaining tumor samples, indicating that VCF2CNA can identify clinically relevant CNAs that were undetected by traditional methods of CNA detection. The high-level concordance with clinically validated data provides a strong indication that VCF2CNA is applicable to multiple tumor types collected from different sequencing platforms.

    CNA analysis of rhabdomyosarcoma data to compare WXS and WGS

    Although, WGS provides unbiased coverage measurements across the genome, whole exome sequencing (WXS) offers characterization of the coding regions of the genome (2% of genome) at much higher depth, which provides a convenient and inexpensive alternative to WGS and has been widely adopted in large scale genome profiling projects and clinical settings. Due to major design differences between the two platforms, we evaluated the consistency of copy number alteration detection between whole exome and whole genome sequencing, using a set of rhabdomyosarcoma samples that were sequenced on both platforms 12 . We observed highly consistent CNA profiles between WGS and WXS platforms (mean F1 score 0.97 on a set of 15 rhabdomyosarcoma xenograph samples). While focal changes are more likely to be missed in the WXS platform compared to the WGS platform, VCF2CNA reliably detects large CNAs from both WGS and WXS platforms (Fig. 7, Supplementary File s5).

    Somatic CNAs computed using VCF2CNA for paired whole-exome and whole-genome Rhabdomyosarcoma xenograph sample SJRHB000026_X1_G1.

    CNA-based purity estimation

    Using the absolute copy number result for each segment identified through VCF2CNA, and B-allele frequencies (BAFs) computed from the paired tumor-normal VCF file, we developed an algorithm to estimate tumor purity using segments with a single copy number gain or loss in VCF2CNA. Briefly, for germline heterogeneous single nucleotide polymorphisms (SNPs, base BAF of 0.5), the extent of loss of heterozygosity (LOH) can be measured by the absolute difference between the B-allele fraction in tumor and that in germline sample. LOH is the result of copy number alteration and/or copy neutral-LOH in tumor cells. We used LOH signals in copy neutral or single-copy gain/loss regions (between single-copy chromosome loss and single-copy chromosome gain) to estimate tumor purity.

    Using purity estimates from various regions within the genome we performed an unsupervised clustering analysis using the mclust package (version 5.4) in R (version 3.4.0). The tumor purity of the sample was defined as the highest cluster center value among all clusters. We estimated Tumor purity for 15 matched tumor-normal xenograph Rhabdomyosarcoma WGS samples. (Table 3). All but one case had a tumor purity prediction near 100%, consistant with the notion that the most mouse-derived reads won’t be mapped to the human genome assembly 34,35 . The sample SJRHB010468_X1_G1 showed extensive subclonal CNAs across multiple chromosomes (Supplementary File s5). While subclonal CNAs are not indicative of low purity, the extensive subclonal copy number segments result in an incorrect tumor purity estimation (0.533), which is a limitation of the algorithm. The Mutant allele fraction (MAF) density plot for somatic single nucleotide variations (SNVs) detected in diploid regions, revealed a subclone in 50% of the tumor cells, which harbors more than 75% of the detected SNVs (Supplementary File s6).


    View and edit the raw query

    The segment designer provides a graphical interface for creating the logic for a dynamic segment. As you work with the settings, you are actually creating a text-based query in the background. This is the query that the system will actually run against your database. Usually you don't need to use the query for anything, but sometimes it can help in troubleshooting. You can also copy/paste queries into the designer, which you might use to create a copy of an existing segment or to share a query design through email.

    To find, view, and edit the query, scroll to the bottom of the page and open the Query view tab here.