
Print mean, s.d.
For HCA (hierarchical cluster analysis) only. Selection of this checkbox will create icons for and save to disk the averages and standard deviations calculated during standardization, along with the standardized values, as well as distance matrices used in cluster agglomeration (node building).
Affymetrix data
Will take log2 transform of expression data prior
to hypothesis testing. When selected,
averages of log data are used in t-tests.
If a fold criterion is specified, then log2 of the fold
criterion is taken and t-tests (not Mann Whitney, F-test, or Kruskal-Wallis)
are modified such that the numerator becomes ,
where delta is log2 of the fold criterion specified, and
and
are the averages of log2(expression)
data within the two classes being tested.
It warrants noting that values of the t-statistic on output can be drastically
altered when a fold criterion other than zero is used, especially when working
on the log scale. Therefore, the user
will need to realize that if 2 is used as a fold criterion, then log(2) will be
subtracted from the difference of the averages of the log2 expression
data.
Print flags in output data
When specified, and “F” is placed next to the expression value in the .csv file opened with Excel.
Sort spreadsheets on folds(2-sample)
When specified, output data for 2-sample tests are sorted on folds and not on q-values of p-values. When Affymetrix data are specified and the log2 of expression values is taken, then fold is based on the ratio of the geometric means for both groups.
Two-step testing(T, then M-W)
When specified, both the t-test and Mann-Whitney tests are applied. In addition to generation of output tables of statistical testing results for each set of tests, an asterisk “*” is written to an output comparison spreadsheet in order to determine if genes were significant for both the t-test and Mann-Whitney tests. Example 9 provides an output comparison listing for multiple tests.
Map genes
When specified, chromosome-specific maps are generated with
a color gradient based on the scale of log(p)
for all genes for a given 2-sample test.
Compare p-correction
Not currently supported.
Present p-value<
If P and A are input with expression data, P’s are converted to 0.01 and A’s are converted to 0.06 when in memory during run-time. Using the default value of 0.05 for the criterion of an informative gene, spots (probes) with a p-value of 0.01 will be considered informative whereas spots having p-values of 0.06 will be non-informative.
Permutations
The number of permutations used in randomization tests. 1000-10,000 are recommended for the within-gene test and 100 recommended for the between-gene test.
Significance
The p-value criterion (Type I error rate, false positive rate) used in hypothesis testing.
Output Permutations
Writes values of test statistics to output spreadsheet during each permutation of randomization tests.
Default image format
The image format specified for all heat maps and charts.
Missing code
The code used to denote missing data. Used in standardization and distance calculations. However, missing codes are deprecated in hypothesis testing, since present absent calls which commonly parallel expression data are used.
Gene
selectionWhen data are read in from disk, the raw input matrix is stored in memory. Using these data, ChipST2C makes an analysis data set that is used for analysis. If there are present/absent calls or p-values in the data, then if the user has not already clicked one of the buttons shown above to manually screen the genes, a prompt will show asking whether or not all genes should be used or only genes having more than X% of present calls within the classes specified should be used.
Use all genes
This will select all input genes. If present absent calls are in the input data, then use of this option may bias the results by possibly including “non-informative” genes. Some labs are now using all genes and not paying attention to calls, and the RT-PCR results in some cases have been remarkable. Afterall, expression values for absent calls are typically very low.
When there are no calls in the data, this option is the one of choice for most users.
Use genes with %P calls
This option will screen for genes having more than the given percentage of present calls within each class of arrays specified. Thus, if there are 10 arrays in a treatment class, and 80% is the specified criterion, then there will be no genes used which have 7 or less present calls across the arrays in the particular class.
Any time a user has not already screened the data by selecting one of the two buttons in this group, and there are present/absent calls or p-values in the input data, then the following prompt will be displayed upon trying to run any analysis:

This prompt is only used for reminding the user to screen for genes for a given percentage of present calls that applied to each class of arrays, in order to minimize bias in the results (from using “non-informative genes”).
Cluster
optionsCluster arrays
This option will specifies an unsupervised hierarchical cluster analysis of the arrays as objects. In this case, genes will be the attributes.
Cluster arrays
This option specifies an unsupervised hierarchical cluster analysis of the genes as objects. In this case, arrays will be the attributes.
Nearest neighbor
Specifies node joining (agglomeration) by single-linkage. 1,2
UPGMA(default)
Specifies node joining using the unweighted pairs group mean arithmetic method.1,2
Furthest neighbor
Specifies node joining using complete linkage. 1,2
Euclidean
Specifies Euclidean distance as the distance function. 1,2
1-correlation
Specifies 1-r for the distance function. 1,2
Array pixel width
Specifies the width of array names and width of cells in heat maps (cluster images).
Gene pixel width
Specifies the height of gene names and cells in heat maps (cluster images).
Color
Specifies the color gradient used for heat maps. Scales are blue-green-yellow-orange-red, green-black-red, and blue-white-red.
Color gradient based on percentiles
Under this option (default), all expression values are sorted and the decile values of expression are used to determine the colors for the gradient of expression on heat maps.
Color gradient based on equal intervals
When this option is specified, the global range of expression (max-min) is divided into equal parts and the corresponding bin walls are used for determining colors of cells in heat maps. This is the method of choice for biallelic SNP genotypes (n=3) or CGH data where there is a loss/normal/gain value for each locus.
HCA-(Hierarchical cluster analysis run button)
This is the actual run button for performing an unsupervised HCA. This button is intended for HCA when there is only one class of arrays, or all of the arrays in a data set were entered as a single class. Thus, it will be invisible if more than one class of arrays is specified.
<1000
This value specifies the upper limit of genes allowable for cluster analysis. During hypothesis testing there can be thousands of genes identified which are significant, and the user may not want a cluster image with so many genes.
K-means (run button)
This is the run button for k-means cluster analysis. This button is intended for k-means cluster analysis when there is only one class of arrays, or all of the arrays in a data set were entered as a single class. Thus, it will be invisible if more than one class of arrays is specified.
K:
Let K be the total
number of clusters and k (k=1,2,…,K) represent the kth
cluster of a clustering. When K=0 is specified by the user (default),
the optimal value of K (best number
of clusters) is determined by cycling through values of K=2 to K= where N
is the total number of genes. This is
performed as follows. For K clusters, the total within-cluster sum-of-squares based on
Euclidean distance is
,
where is the row vector containing expression values
for gene i in cluster k over the p arrays and
is the row vector of array-specific mean
expression values for nk
genes in cluster k. The Euclidean distance described above is
calculated in the form
.
where xijk is the expression of
gene i on array j in cluster k and is the average expression of the nk genes on array j assigned to cluster k (i=1,2,…,nk genes; j=1,2,…,p arrays; and k= 1,2,…,K clusters). For the same K clusters, the between-cluster
distance is determined as the Euclidean distance between each pair of
cluster mean vectors, given by the relationship
.
The smallest between-cluster distance is
,
and the score function for a set of K clusters is
.
After evaluating the score function SK for values of K
ranging from 2 to (N is
the total number of genes), the optimal value of K is
.
Once Kopt is determined, the k-means algorithm is rerun using Kopt clusters and results are presented to the user. When K>1 is specified by the user, K clusters are used. Values of K=1 can not be used.

All significant
Specifies that all significant genes identified in hypothesis testing will appear in heat maps.
Top ___ genes
Specifies the number of significant genes to appear in heat maps.
Array
summary statisticsMatrix plots
Specifies construction of a matrix plot (X-Y scatter plot) of all possible pairs of arrays within a single class.
Average, s.d., boxplots
Specifies calculation of array-specific average, standard deviation, minimum, maximum, and quartiles, and the median. A box plot is constructed for all arrays showing a box for which the top line represents the upper 75th percentile, the bottom line the 25th percentile, the line in the middle the median, and the ends of the bottom and top lines projecting out of the boxes ending at the lower and upper 95th percentiles.
CV plot
Specifies construction of a single plot showing average expression of all genes across the arrays vs. standard deviations (over the arrays).
Histogram
Specifies generation of a frequency histogram for each array. By default 100 bins are used.
Correlation
Specifies calculation of a correlation matrix showing significant correlation between arrays using colors that represent levels of the p-value.
F-test
(ANOVA)
Specifies a parametric F-test for all genes considered. Will be disabled if fewer than three classes of arrays are specified.
Kruskal-Wallis
Specifies a non-paramertic Kruskal-Wallis test for genes considered. Will be disabled if fewer than three classes of arrays are specified.
T-test
(Welch)
Specifies a parametric t-test for all genes considered. Assumes unequal variance. Requires three or more classes of arrays to be specified.
Kruskal-Wallis
Specifies a non-parametric Kruskal-Wallis test for genes considered. Requires three or more classes of arrays to be specified.
All comparisons
Will ensure that gene-specific expression is compared between all possible class comparisons.
Adjust t-df
When specified, the Dixon-Massey adjustment to the degrees of freedom is made for assuming unequal variances among the two classes.
Fold criterion
Only applies to the t-test. When the Affymetrix option is not specified, fold criterion, or fcrit , will be set equal to the value specified in the text (fcrit =0 is default). The numerator of the t statistic is then determined as
and the following values are reported in the Excel spreadsheet on output:
However, when the Affymetrix option is specified, then the log2 transform will be taken on individual expression values used in calculating the t-test statistic. In addition, the log2 transform of the fold criterion, fcrit , specified by the user (text box). The numerator of the t statistic is then based on the difference in the log2 of expression, in the form
where Xjc is the gene’s expression in the pc arrays in class c (j=1,2,…,pc; c=1,2). The geometric means are then determined as
are reported in the Excel spreadsheet on output:

This test will perform a one-way ANOVA on the present and absent calls for qualitative analysis of microarray data (QAMA) between several treatment groups. For each gene, present calls are assigned a value of 3 and absent calls a value of -3 assuming that a 3-sigma effect will ease the identification of genes having all present calls in one class and absent in another. The chi-square test or extensions of the binomials tests are not applied here in order to avoid cases where all the observed proportions of present (or absent) calls in cells are equal to the expected proportions, resulting in a significant finding. What is desired here is a simple way to detect when there are large differences between the present and absent calls across classes, not comparison with expected chi-square proportions. This test has worked reasonably well, and results have often sparked the curiosity of lab directors. Some lab directors request that this test be performed before other tests that are based on differential expression.
Data
transformationsCommon log_10
Specifies the common logarithm (base 10) transform on all expression values.
Natural log_e
Specifies the natural logarithm (base e) transform on all expression values.
Log base ___
Specifies the logarithm (base b) transform on all expression values.
Log
Performs the actual logarithmic transform (run button).
AntiLog
Performs the actual anti-logarithmic transform (run button).
Transpose
Transposes the input data set. Is disabled when (a) present/absent calls or p-values are present in the input data, (b) not all of the arrays in the input data file are assigned to a single class, or (c) multiple classes of arrays were assigned.
Standardize during cluster analysis
Specifies that standardization is to be used during hierarchical cluster analysis. When arrays are clustered, the standardization is over the genes (gene averages and s.d. are used). Whereas, when genes are clustered the standardization is based on array-specific averages and s.d.). If both arrays and genes are clustered, standardization done on the arrays (using gene-specific averages and s.d.) is removed before the genes are clustered.
Standardize arrays (permanent effect)
Actual run button to standardize arrays based on array-specific average and s.d. Changes made to the data are permanent, but data can be restored to original values by clicking on either of the two screening buttons (i.e., buttons for “Use all genes” or “Use genes with more than % present calls”) described in screening.
Standardize genes (permanent effect)
Actual run button to standardize genes based on genes-specific average and s.d. Changes made to the data are permanent, but data can be restored to original values by clicking on either of the two screening buttons (i.e., buttons for “Use all genes” or “Use genes with more than % present calls”) described in screening.

Skip permutations
No randomization tests are performed.
Within-genes
A within-gene randomization test is performed. This test was introduced by Ge et al. for
calculating permutation-based raw p-values in microarray studies.3,4 For 2-sample tests expression values are
permuted, whereas in k-sample tests the class labels are reshuffled. The empirical -value after
iterations for gene i is
.
where # is the number of occurrences for which {} is true
and b: means for all b.
This equates to the number of times the test statistic based on permuted labels exceeds or equals the
test statistic |ti| for
the observed data configuration, divided by the number of permutations. For 2-sample tests, the optimal number of
permutations is based on the binomial coefficient
.
Whereas for k-sample tests (not k-means clustering), the optimal number of permutations is
,
where is the total number of arrays for the C classes.
Between genes
The between-gene randomization
test is performed. For this test, the
reshuffling is still performed only within the gene; however, unlike the
within-gene test, the significance is relative to other genes. After, say, iterations, the empirical p-value for the
th gene is
where # is the number of occurrences for which {} is true
and j,b: means for all j and
for all b. Thus, during each iteration, test statistics
of all other genes (j= 1,2,…,N) based on their own permuted labels are compared against the observed test
statistic for gene i,
. Since
each iteration compares N values of
for other genes against
there are actually NB permutations involved. The
between-gene test was introduced by Storey and Tibshirani (2003) as the
“genome-wide” test5.

No adjustment
No adjustments or corrections are made to p-values
Bonferroni
The Bonferroni adjusted p-value is . Instead of calculating α* and then
determining which genes have p-values less than α*, ChipST2C determines the
q-values as
,
and compares q-values with the original α to determine significant genes. Only genes with qi<α are reported.
Benjamini and Hochberg
This option specifies that the Benjamini and Hochberg (“BH”) false discovery rate (FDR) method is used.6 Under the BH FDR method, significant genes reflect the expected proportion of type I errors among the rejected hypotheses. Let p1, p2,…,pm represent p-values for m genes, and p(1) ≤ p(2) ≤ … ≤ p(m) represent their ranked counterparts sorted in ascending order. Define k as the greatest value of i when the following statement is true p(i) ≤ α i /m. Reject all p(i) for which i ≤ k, and define all other p-values for which i>k as null. q-values are determined as follows:
,
and are compared with the α level of FDR. Only genes with q(i)<α are reported. A disadvantage of this approach is that it results in a large step function increase in the p vs. q plots for null genes.
Storey q-values
This method specifies calculation of “q-values” for the positive false discovery rate (pFDR) introduced by Storey. 5,7,8 The pFDR is defined as the conditional expectation of the proportion of type I errors among the rejected hypotheses, given that at least one hypothesis is rejected.
Let represent the number of truly null tests among
tests. Since it is known that null tests are
distributed U(0,1), the expected number of null tests in the interval (
,1)
is approximated simply as
(
)
(1-
). For well chosen values of
,
we can safely make the assumption that
-values greater than
in interval are also null, and thus
(
) = #{