COMPOSITIONAL DATA ANALYSIS THEORY AND APPLICATIONS PDF

adminComment(0)

In book: Compositional Data Analysis: theory and applications., Publisher: John Wiley & Sons, London., Editors: V. Pawlowsky-Glahn. Request PDF on ResearchGate | Compositional Data Analysis: Theory and Applications | It is difficult to imagine that the statistical analysis of. Compositional Data Analysis: Theory and Applications It is difficult to imagine that the statistical analysis of compositional data has been a.


Compositional Data Analysis Theory And Applications Pdf

Author:TEDDY TECHAU
Language:English, Japanese, Dutch
Country:Poland
Genre:Health & Fitness
Pages:542
Published (Last):08.09.2016
ISBN:752-2-18086-432-9
ePub File Size:16.65 MB
PDF File Size:19.42 MB
Distribution:Free* [*Sign up for free]
Downloads:21244
Uploaded by: ALMEDA

It is difficult to imagine that the statistical analysis of compositional data has been a major issue of concern for more than years. It is even more difficult to. Compositional Data Analysis and Its Applications. Huiwen Wang. School of Economic Management,. Beijing Univ. of Aeronautics and Astronautics. CoDaWork: International Workshop on Compositional Data Analysis Trace Elements, and Isotopes in Compositional Data Analysis: Applications for Deep Formation Pages PDF · A Compositional Approach to Allele Sharing Analysis theory, statistical methods and techniques to its broad range of applications in.

We encourage researchers to use CoDA in similar studies, to adequately account for the compositional nature of data on physical activity and sedentary behavior. Both insufficient physical activity and excessive sedentary behavior appear to be associated with an increased risk of coronary heart disease, type 2 diabetes mellitus, and cancer [ 3 — 5 ].

Among various factors, age and sex are two potentially important determinants of sedentary behavior and physical activity [ 6 — 9 ]. For instance, men tend to be more physically active than women [ 6 ], and physical activity tends to decrease with age [ 6 ]. A majority of studies, including those investigating differences between sexes and age groups in physical activity and sedentary behavior, have used a standard analysis approach in which the time spent in each behavior, e.

If the time spent in one behavior is changed, it will inevitably influence the time in other behaviors within that day. Data with this inherent dependency in the sense that they add up to a constant sum are constrained or compositional [ 10 , 11 ].

A standard multivariate statistical approach for analyzing time spent in different behaviors within a day fails to account for this constrained property of data [ 12 — 14 ]. A set of procedures has been developed to handle compositional data, i. Compositional Data Analysis CoDA [ 10 ] which has only recently received attention in studies of sedentary behavior and physical activity [ 14 — 20 ]. One of these studies compared results obtained using standard and CoDA approach, in an investigation of associations between time spent in different behaviors within a day and various health indicators [ 16 ].

The study found that associations were different when standard analyses were used, compared to CoDA. No previous study has explicitly investigated the extent to which the results of comparisons between sexes and age groups in time spent in various behaviors during a day depend on whether the analysis was performed using CoDA or a standard approach.

Thus, the present study compared sedentary behavior and physical activity during working days between sexes and age groups, with specific emphasis on differences in results obtained with standard and CoDA approaches. Data were collected between spring and spring at 15 Danish workplaces in three different occupational sectors, i. In total, eligible workers, recruited in collaboration with a large labor union, were invited to participate in the study.

Workers were excluded if they had a white-collar job, were pregnant, had a fever, or had an allergy to adhesives. On the measurement days, the workers were asked to complete a paper-based diary, noting their working hours, time in bed i.

They also noted the time of a reference measurement ie. Instructions to the workers are detailed in previous publications [ 23 , 25 , 26 ].

Accelerometer-based measurements of movement behaviors within a day The amounts of time spent in various behaviors sedentary, standing, and physical activity PA were identified from the accelerometer recordings using the MATLAB program Acti4 [ 22 , 24 , 26 ]. Periods spent walking, stair-climbing, running, and cycling were merged to total PA time category.

All non-working days, non-wear periods and bedtime periods were excluded according to previously reported criteria [ 25 , 26 ].

As Pearson warned in , correlation gives spurious results when applied to relative data: i. If we consider that Z may represent, for example, the library size, we see how two uncorrelated features, X and Y, may appear correlated even when they are not. Spurious correlation is not merely a statistical concern: when applied to real biological data, correlation can lead to wrong conclusions 5 , 7. For example, one might incorrectly conclude that there exists a coordinated regulation among a module of transcriptionally independent genes.

As an alternative to correlation, proportionality is a measure of association that is valid for compositional data 5 , 8. Interestingly, the VLR is the same for relative values and their absolute equivalent.

You might also like: KUROSE AND ROSS 6TH EDITION PDF

However, the VLR lacks a scale that would otherwise make it possible to compare dependency across multiple feature pairs. In essence, what we call proportionality is a modification to the VLR that establishes scale. Methods Consider a matrix of D features as columns measured across N samples as rows exposed to a binary or continuous event. This event might involve case-control status, treatment status, treatment dose, or time.

Proportionality, as analogous to, but distinct from, correlation, measures the association between two log-ratio transformed feature vectors. By default, this package uses the centered log-ratio transformation clr ; this transformation scales each subject vector by its geometric mean indicated as g x.

However, we also include an implementation of the additive log-ratio transformation alr. The propr package implements three measures of proportionality in the R language, as defined for A. Note that although we calculated each row of A using Eq. In considering the two log-ratio transformations, alr uses a specified feature to transform the original subject vectors. When used in conjunction with an a priori known unchanged reference, alr effectively back-calculates the absolute counts from the relative components.

By specifying, for example, a house-keeping gene or an experimentally fixed variable, the investigator can achieve a more accurate measure of dependence than through clr 8.

The user can toggle alr transformation in lieu of the default clr transformation by supplying the name of the unchanged reference or references to the ivar argument of the phit, perb, or phis function. In either case, we wish to alert the reader that log-ratio transformation, by its definition, require non-zero elements in the data matrix. As such, any log-ratio analysis must first address zeros.

Yet, how best to do this remains an open question and a topic of active research For simplicity, propr automatically replaces all zero values with 1 prior to log-ratio transformation, corresponding to the multiplicative replacement strategy The R language, despite its widespread popularity, suffers from poor performance when scaling to big data. This package also offers a number of wrapper functions to visualize proportionality when working with high-dimensional data.

Among these tools are those used to generate the figures included in the next section.

Results Application of proportionality As a use case, we re-analyze the raw RNA-Seq counts from an already published study on cane toad Rhinella marina evolution and adaptation Sugar cane farmers introduced cane toads to Australia in as a cane beetle pest control measure, but these toads quickly became invasive.

This event now serves as a notable example of failed biological control. This dataset contains muscle tissue RNA transcript counts for 20 toads sampled from two regions 10 per region in the wild.

The two regions sampled, which we will treat as the experimental groups, include the long colonized site of introduction in QLD and the front of the range expansion in WA In this analysis, we want to understand the differences in gene expression between the established and expanding populations. By demonstrating propr on public data, we provide a reproducible example of how proportionality analysis can converge on an established biological narrative.

The reader can find these data bundled with the release of the package on CRAN. We begin by constructing the proportionality matrix using all 57, transcript counts, yielding an N2 matrix To minimize the number of lowly expressed transcripts included in the final result, we subset the matrix to include only those transcripts with at least 10 counts in at least 10 samples. By removing the features at this stage, we can exploit a computational trick to calculate proportionality and filter simultaneously, reducing the required RAM to only 5 Gb without altering the resultant matrix.

We refer the reader to the supplementary vignette for a justification of this cutoff S1 Appendix.

Figure 1 Smear plot. A smear of straight diagonal lines confirms that the feature pairs indexed as proportional actually show proportional abundance.

Figure produced using the smear function in propr.

Related Resources

Full size image The procedure used to parse through the proportionality matrix now depends on the experimental question. Here, we wish to identify a highly proportional transcript module that happens to show differential abundance across the experimental groups.

When clustering, we call two features co-clustered if they belong to the cluster after cutting the dendrogram. Meanwhile, the VLS, the sum of the individual variances of two features in that pair, adjusts the rate of this limit. Since we would expect a differentially expressed module to have a low VLR and a high VLS, we prioritize pairs in co-cluster 1 for subsequent analysis Fig. Figure 2 Prism plot. If both features in a pair belong to the same cluster, they receive a non-zero color code.

Figure produced using the prism function in propr. Full size image Co-clusters containing feature pairs with a low VLR and a high VLS have the potential to explain differences between the experimental groups. However, the high VLS may not necessarily have anything to do with the experimental condition.

For example, this co-cluster might instead include highly proportional features that show wide individual feature variance due to batch effects. For the cane toad data, however, the experimental condition does indeed seem to drive the high individual feature variances in the module, as evidenced by the near perfect group separation when visualizing the first two components of a principal components analysis PCA Fig.

Compositional Data Analysis

Note that this plot calculates PCA using the log-ratio transformed data, making it a statistically valid choice for compositional data The separation between groups achieved here compares to that reported in the original publication, which used features selected by the edgeR package In addition, gene set enrichment analysis of the gene ontology terms for co-cluster 1 S1 Table shows an enrichment for similar molecular functions as those enriched among the transcripts selected by edgeR S2 Table , as well as those highlighted in the original publication This is particularly impressive considering that we have no reason to expect most differentially expressed transcripts would appear in differentially expressed modules too.

Figure 3 PCA plot. This figure shows all samples projected across the first two components of a principal components analysis PCA , calculated using the log-ratio transformed data. This plot colors samples based on the experimental group. Figure produced using the pca function in propr. Yet, while agreeing here, proportionality analysis offers an additional benefit in that it provides a layer of information on pairwise associations, all without requiring any kind of normalization.

Nevertheless, we believe there exists added value in integrating the results of propr and conventional differential expression analysis, best visualized as a network graph Fig.

Navigation Bar

Figure 4 Network plot. Red nodes indicate transcripts with increased expression according to edgeR. Blue nodes indicate transcripts with decreased expression according to edgeR. White nodes indicate transcripts included in co-cluster 1, but not selected by edgeR.

Compositional Data Analysis: Theory and Applications

Importantly, we see here several highly proportional up-regulated and down-regulated modules. Figure produced using the cytescape function in propr. Full size image Evaluation of proportionality Above, we show how we can use proportionality to understand RNA-Seq data. For this, we need a dataset for which we know already the absolute abundances exactly. Since this is unknown in the cane toad data, we use two other datasets: a a simulated dataset with 1, random features i.

Compositional Data Analysis

Knowing the absolute abundances, we can make a corresponding relative dataset by dividing the counts as rows by the per-sample sum i. Since the library size changes non-randomly across samples, this operation constrains the data in a way that introduces spurious correlations.

Yet, these spurious correlations disappear when measuring proportionality Fig. Notably, the spurious correlations from these data also disappear when measuring the correlation of clr-transformed data Fig.

Figure 5 Evaluation of proportionality using simulated data. From here, we can visually assess how well any two measures of dependence agree with one another.

Full size image Likewise, Fig.Experiments with Mixtures Accelerometer-based measurements of movement behaviors within a day The amounts of time spent in various behaviors sedentary, standing, and physical activity PA were identified from the accelerometer recordings using the MATLAB program Acti4 [ 22 , 24 , 26 ].

Simplicial Indicator Kriging 81 6. Since we would expect a differentially expressed module to have a low VLR and a high VLS, we prioritize pairs in co-cluster 1 for subsequent analysis Fig.

Egozcue, M. Applications for Deep Formation Brine Geochemistry. Data with this inherent dependency in the sense that they add up to a constant sum are constrained or compositional [ 10 , 11 ].

Old Password.