Commands/Options on the Analyze Tab

Data, transforms, output

Print mean, s.d.

For HCA (hierarchical cluster analysis) only.   Selection of this checkbox will create icons for and save to disk the averages and standard deviations calculated during standardization, along with the standardized values, as well as distance matrices used in cluster agglomeration (node building). 

Affymetrix data

Will take log2 transform of expression data prior to hypothesis testing.   When selected, averages of log data are used in t-tests.   If a fold criterion is specified, then log2 of the fold criterion is taken and t-tests (not Mann Whitney, F-test, or Kruskal-Wallis) are modified such that the numerator becomes , where delta is log2 of the fold criterion specified, and  and  are the averages of log2(expression) data within the two classes being tested.   It warrants noting that values of the t-statistic on output can be drastically altered when a fold criterion other than zero is used, especially when working on the log scale.   Therefore, the user will need to realize that if 2 is used as a fold criterion, then log(2) will be subtracted from the difference of the averages of the log2 expression data.

Print flags in output data

When specified, and “F” is placed next to the expression value in the .csv file opened with Excel. 

Sort spreadsheets on folds(2-sample)

When specified, output data for 2-sample tests are sorted on folds and not on q-values of p-values.    When Affymetrix data are specified and the log2 of expression values is taken, then fold is based on the ratio of the geometric means for both groups.

Two-step testing(T, then M-W)

When specified, both the t-test and Mann-Whitney tests are applied.   In addition to generation of output tables of statistical testing results for each set of tests, an asterisk “*” is written to an output comparison spreadsheet in order to determine if genes were significant for both the t-test and Mann-Whitney tests.  Example 9 provides an output comparison listing for multiple tests.  

Map genes

When specified, chromosome-specific maps are generated with a color gradient based on the scale of log(p) for all genes for a given 2-sample test.  

Compare p-correction

Not currently supported.  

Present p-value<

If P and A are input with expression data, P’s are converted to 0.01 and A’s are converted to 0.06 when in memory during run-time.   Using the default value of 0.05 for the criterion of an informative gene, spots (probes) with a p-value of 0.01 will be considered informative whereas spots having p-values of 0.06 will be non-informative. 

Permutations

The number of permutations used in randomization tests.   1000-10,000 are recommended for the within-gene test and 100 recommended for the between-gene test.

Significance

The p-value criterion (Type I error rate, false positive rate) used in hypothesis testing.

Output Permutations

Writes values of test statistics to output spreadsheet during each permutation of randomization tests.

Default image format

The image format specified for all heat maps and charts.  

Missing code

The code used to denote missing data.   Used in standardization and distance calculations.  However, missing codes are deprecated in hypothesis testing, since present absent calls which commonly parallel expression data are used. 

Gene selection

When data are read in from disk, the raw input matrix is stored in memory.   Using these data, ChipST2C makes an analysis data set that is used for analysis.   If there are present/absent calls or p-values in the data, then if the user has not already clicked one of the buttons shown above to manually screen the genes, a prompt will show asking whether or not all genes should be used or only genes having more than X% of present calls within the classes specified should be used.

Use all genes

This will select all input genes.   If present absent calls are in the input data, then use of this option may bias the results by possibly including “non-informative” genes.   Some labs are now using all genes and not paying attention to calls, and the RT-PCR results in some cases have been remarkable.   Afterall, expression values for absent calls are typically very low.  

When there are no calls in the data, this option is the one of choice for most users.

Use genes with %P calls

This option will screen for genes having more than the given percentage of present calls within each class of arrays specified.   Thus, if there are 10 arrays in a treatment class, and 80% is the specified criterion, then there will be no genes used which have 7 or less present calls across the arrays in the particular class. 

 

Any time a user has not already screened the data by selecting one of the two buttons in this group, and there are present/absent calls or p-values in the input data, then the following prompt will be displayed upon trying to run any analysis:

 

 

 

 

 

 

 

 

This prompt is only used for reminding the user to screen for genes for a given percentage of present calls that applied to each class of arrays, in order to minimize bias in the results (from using “non-informative genes”).  

 

Cluster options

Cluster arrays

This option will specifies an unsupervised hierarchical cluster analysis of the arrays as objects.  In this case, genes will be the attributes.  

Cluster arrays

This option specifies an unsupervised hierarchical cluster analysis of the genes as objects.  In this case, arrays will be the attributes.

Nearest neighbor

Specifies node joining (agglomeration) by single-linkage. 1,2 

UPGMA(default)

Specifies node joining using the unweighted pairs group mean arithmetic method.1,2 

Furthest neighbor

Specifies node joining using complete linkage. 1,2

Euclidean

Specifies Euclidean distance as the distance function. 1,2

1-correlation

Specifies 1-r for the distance function. 1,2

Array pixel width

Specifies the width of array names and width of cells in heat maps (cluster images).

Gene pixel width

Specifies the height of gene names and cells in heat maps (cluster images).

Color

Specifies the color gradient used for heat maps. Scales are blue-green-yellow-orange-red, green-black-red, and blue-white-red.

Color gradient based on percentiles

Under this option (default), all expression values are sorted and the decile values of expression are used to determine the colors for the gradient of expression on heat maps. 

Color gradient based on equal intervals

When this option is specified, the global range of expression (max-min) is divided into equal parts and the corresponding bin walls are used for determining colors of cells in heat maps.   This is the method of choice for biallelic SNP genotypes (n=3) or CGH data where there is a loss/normal/gain value for each locus. 

HCA-(Hierarchical cluster analysis run button)

This is the actual run button for performing an unsupervised HCA.  This button is intended for HCA when there is only one class of arrays, or all of the arrays in a data set were entered as a single class.   Thus, it will be invisible if more than one class of arrays is specified. 

<1000 

This value specifies the upper limit of genes allowable for cluster analysis.   During hypothesis testing there can be thousands of genes identified which are significant, and the user may not want a cluster image with so many genes.       

K-means (run button)

This is the run button for k-means cluster analysis.  This button is intended for k-means cluster analysis when there is only one class of arrays, or all of the arrays in a data set were entered as a single class.   Thus, it will be invisible if more than one class of arrays is specified. 

K:

Let K be the total number of clusters and k (k=1,2,…,K) represent the kth cluster of a clustering.  When K=0 is specified by the user (default), the optimal value of K (best number of clusters) is determined by cycling through values of K=2 to K=  where N is the total number of genes.  This is performed as follows.  For K clusters, the total within-cluster sum-of-squares based on Euclidean distance is

,

where  is the row vector containing expression values for gene i in cluster k over the p arrays and  is the row vector of array-specific mean expression values for nk genes in cluster k.   The Euclidean distance described above is calculated in the form 

.

where xijk is the expression of gene i on array j in cluster k and  is the average expression of the nk genes on array j assigned to cluster k (i=1,2,…,nk genes;  j=1,2,…,p arrays; and k= 1,2,…,K clusters).   For the same K clusters, the between-cluster distance is determined as the Euclidean distance between each pair of cluster mean vectors, given by the relationship

 

.

 

The smallest between-cluster distance is

,

and the score function for a set of K clusters is 

.

After evaluating the score function SK for values of K ranging from 2 to  (N is the total number of genes), the optimal value of K is 

.

Once Kopt is determined, the k-means algorithm is rerun using Kopt clusters and results are presented to the user.  When K>1 is specified by the user, K clusters are used.  Values of K=1 can not be used. 

 

 

Cluster output

All significant

Specifies that all significant genes identified in hypothesis testing will appear in heat maps.

Top ___ genes

Specifies the number of significant genes to appear in heat maps.  

 

 

Array summary statistics

Matrix plots

Specifies construction of a matrix plot (X-Y scatter plot) of all possible pairs of arrays within a single class.  

Average, s.d., boxplots

Specifies calculation of array-specific average, standard deviation, minimum, maximum, and quartiles, and the median.   A box plot is constructed for all arrays showing a box for which the top line represents the upper 75th percentile, the bottom line the 25th percentile, the line in the middle the median, and the ends of the bottom and top lines projecting out of the boxes ending at the lower and upper 95th percentiles.   

CV plot

Specifies construction of a single plot showing average expression of all genes across the arrays vs. standard deviations (over the arrays).

Histogram

Specifies generation of a frequency histogram for each array.  By default 100 bins are used.    

Correlation

Specifies calculation of a correlation matrix showing significant correlation between arrays using colors that represent levels of the p-value.

k-Sample tests

F-test (ANOVA)

Specifies a parametric F-test for all genes considered.   Will be disabled if fewer than three classes of arrays are specified. 

Kruskal-Wallis

Specifies a non-paramertic Kruskal-Wallis test for genes considered.   Will be disabled if fewer than three classes of arrays are specified. 

 

2-Sample tests

T-test (Welch)

Specifies a parametric t-test for all genes considered.   Assumes unequal variance.  Requires three or more classes of arrays to be specified. 

Kruskal-Wallis

Specifies a non-parametric Kruskal-Wallis test for genes considered.   Requires three or more classes of arrays to be specified. 

All comparisons

Will ensure that gene-specific expression is compared between all possible class comparisons.   

Adjust t-df

When specified, the Dixon-Massey adjustment to the degrees of freedom is made for assuming unequal variances among the two classes.

Fold criterion

Only applies to the t-test.  When the Affymetrix option is not specified, fold criterion, or fcrit , will be set equal to the value specified in the text (fcrit =0 is default).   The numerator of the t statistic is then determined as

 

 

 

and the following values are reported in the Excel spreadsheet on output:

 

avg_1

avg_2

avg_1 - avg_2

 

 

 

 

 

However, when the Affymetrix option is specified, then the log2 transform will be taken on individual expression values used in calculating the t-test statistic.  In addition, the log2 transform of the fold criterion, fcrit , specified by the user (text box).  The numerator of the t statistic is then based on the difference in the log2 of expression, in the form

 

 

 

where Xjc is the gene’s expression in the pc arrays in class c (j=1,2,…,pcc=1,2).  The geometric means are then determined as

 

 

 

are reported in the Excel spreadsheet on output:

 

GM_1

GM_2

Fold

 

 

 

 

 

 

                   

Test calls (QAMA-Qualitative Analysis of Microarrays)

This test will perform a one-way ANOVA on the present and absent calls for qualitative analysis of microarray data (QAMA) between several treatment groups.   For each gene, present calls are assigned a value of 3 and absent calls a value of -3 assuming that a 3-sigma effect will ease the identification of genes having all present calls in one class and absent in another.   The chi-square test or extensions of the binomials tests are not applied here in order to avoid cases where all the observed proportions of present (or absent) calls in cells are equal to the expected proportions, resulting in a significant finding.   What is desired here is a simple way to detect when there are large differences between the present and absent calls across classes, not comparison with expected chi-square proportions.  This test has worked reasonably well, and results have often sparked the curiosity of lab directors.  Some lab directors request that this test be performed before other tests that are based on differential expression.     

 

Data transformations

Common log_10

Specifies the common logarithm (base 10) transform on all expression values.

Natural log_e

Specifies the natural logarithm (base e) transform on all expression values.

Log base ___

Specifies the logarithm (base b) transform on all expression values.

Log

Performs the actual logarithmic transform (run button).

AntiLog

Performs the actual anti-logarithmic transform (run button).

Transpose

Transposes the input data set.   Is disabled when (a) present/absent calls or p-values are present in the input data, (b) not all of the arrays in the input data file are assigned to a single class, or (c) multiple classes of arrays were assigned.   

Standardize during cluster analysis

Specifies that standardization is to be used during hierarchical cluster analysis.  When arrays are clustered, the standardization is over the genes (gene averages and s.d. are used).   Whereas, when genes are clustered the standardization is based on array-specific averages and s.d.).   If both arrays and genes are clustered, standardization done on the arrays (using gene-specific averages and s.d.) is removed before the genes are clustered.  

Standardize arrays (permanent effect)

Actual run button to standardize arrays based on array-specific average and s.d.  Changes made to the data are permanent, but data can be restored to original values by clicking on either of the two screening buttons (i.e., buttons for “Use all genes” or “Use genes with more than % present calls”) described in screening.  

Standardize genes (permanent effect)

Actual run button to standardize genes based on genes-specific average and s.d.  Changes made to the data are permanent, but data can be restored to original values by clicking on either of the two screening buttons (i.e., buttons for “Use all genes” or “Use genes with more than % present calls”) described in screening.  

 

Permutation tests

Skip permutations

No randomization tests are performed.  

Within-genes

A within-gene randomization test is performed.   This test was introduced by Ge et al. for calculating permutation-based raw p-values in microarray studies.3,4   For 2-sample tests expression values are permuted, whereas in k-sample tests the class labels are reshuffled.   The empirical  -value after  iterations for gene i is

.

where # is the number of occurrences for which {} is true and b: means for all b.  This equates to the number of times the test statistic  based on permuted labels exceeds or equals the test statistic |ti| for the observed data configuration, divided by the number of permutations.   For 2-sample tests, the optimal number of permutations is based on the binomial coefficient        

.

Whereas for k-sample tests (not k-means clustering), the optimal number of permutations is

,

where  is the total number of arrays for the C classes.

         

Between genes

The between-gene randomization test is performed.   For this test, the reshuffling is still performed only within the gene; however, unlike the within-gene test, the significance is relative to other genes.   After, say,  iterations, the empirical p-value for the  th gene is

 

where # is the number of occurrences for which {} is true and j,b: means for all j and for all b.  Thus, during each iteration, test statistics of all other genes (j= 1,2,…,N) based on their own permuted labels  are compared against the observed test statistic for gene i, .   Since each iteration compares N values of  for other genes against  there are actually NB permutations involved.   The between-gene test was introduced by Storey and Tibshirani (2003) as the “genome-wide” test5.

 

Multiple testing and FDR

No adjustment

No adjustments or corrections are made to p-values

Bonferroni

The Bonferroni adjusted p-value is .   Instead of calculating α* and then determining which genes have p-values less than α*, ChipST2C determines the q-values as

,

and compares q-values with the original α to determine significant genes.  Only genes with qi<α are reported.      

Benjamini and Hochberg

This option specifies that the Benjamini and Hochberg (“BH”) false discovery rate (FDR) method is used.6  Under the BH FDR method, significant genes reflect the expected proportion of type I errors among the rejected hypotheses.  Let p1, p2,…,pm represent p-values for m genes, and p(1) p(2)  ≤ … ≤ p(m) represent their ranked counterparts sorted in ascending order.  Define k as the greatest value of i when the following statement is true p(i) ≤ α i /m.  Reject all p(i) for which i k, and define all other p-values for which i>k as null.  q-values are determined as follows:

,

and are compared with the α level of FDR.  Only genes with q(i)<α  are reported.  A disadvantage of this approach is that it results in a large step function increase in the p vs. q plots for null genes.

Storey q-values

This method specifies calculation of “q-values” for the positive false discovery rate (pFDR) introduced by Storey. 5,7,8  The pFDR is defined as the conditional expectation of the proportion of type I errors among the rejected hypotheses, given that at least one hypothesis is rejected. 

Let  represent the number of truly null tests among  tests. Since it is known that null tests are distributed U(0,1), the expected number of null tests in the interval ( ,1) is approximated simply as  (  )  (1-  ). For well chosen values of , we can safely make the assumption that  -values greater than  in interval are also null, and thus  (  ) = #{