ChipST2C - Version 1.18
(Chip Significance Testing to(2) Cluster)
User’s Guide
2004-2006 Peterson
Lab,
Revision: 3/12/06
Expression with calls in the form of p-values
Expression with calls in the form of “P” and “A,” representing present and absent, respectively
Extended Annotation Format (Accession, Gene, Location/LocusLink, Description)
Abbreviated Annotation Format (Description)
Recognition of Annotation Formats by ChipST2C
Opening tab-delimited files with expression values and calls
Opening tab-delimited files with only expression values
SCREENING GENES PRIOR TO ANALYSIS
Using Input Data with Present/Absent Call (or P-values)
Using Input Data without Present/Absent Call (or P-values)
Non-Default Output File Locations
Lifetime of Output File Directory
SPECIFYING COLORS PRIOR TO ANALYSIS
Setting Specific “Custom” Colors
Default Colors for Publications
Default Colors for Presentations
Changing the Default Heat Map (Gradient) Colors for Expression Values in Cluster Images
EXAMPLE 1. Input a tab-delimited text file with only expression values, and cluster only the genes.
Copy and paste an image into Powerpoint or Word
EXAMPLE 2. K-means cluster analysis of genes (setting k=0 to search for best value of k)
EXAMPLE 3. Summary statistics for classes (replicate groups) of arrays
EXAMPLE 4. F-test (ANOVA) to identify genes differentially expressed in 4 classes
EXAMPLE 8. Mann-Whitney U test to identify genes differentially expressed between two classes
EXAMPLE 9. All possible 2-sample tests between multiple classes
EXAMPLE 10. Transpose a data set prior to hierarchical cluster analysis or k-means clustering
EXAMPLE 11. Manually standardizing expression values prior to analysis
EXAMPLE 13. Calculating false discovery rates (pFDR) using q-values
EXAMPLE 16. Exporting (and importing) all data from a run to a ChipST2C “collaboration” file
EXAMPLE 17. Exporting expression of all screened genes (present and absent calls in input)
EXAMPLE 19. Search for gene accession or abbreviated names in an input file, and export expression
The following is a list of system requirements for installation and operation of ChipST2C:

For cluster analysis of large data sets, ChipST2C’s RAM (random access memory) requirements are a function of the number of genes to be clustered. In the table below, the GigaBytes (GB) of RAM needed by ChipST2C is highlighted in yellow. As an example, if you had only 1 GB of RAM, you can only cluster 16,000 genes. Keep in mind the Windows operating system may need 200 MB, or 0.2 GB. Because the default memory on most desktop or notebook PC’s is 512 MegaBytes (0.512 GB), you will only be able to cluster analyze 10,000-12,000 genes depending on the size of the swap file used for virtual (disk-based) memory which augments the amount of RAM.
|
|
|
|
|
|
|
|
|
Matrix elements |
|
Single-precision |
|
Double-precision |
|
|
Off-diagonals +diagaonal |
|
(4-byte reals) |
|
(8-byte reals) |
|
Genes |
(A)(A-1)/2+A |
Bytes needed |
GB RAM |
Bytes needed |
GB RAM |
|
(A) |
(B) |
(B)*4 |
Needed |
(B)*8 |
needed |
|
2000 |
2001000 |
8004000 |
0.008 |
16008000 |
0.016 |
|
4000 |
8002000 |
32008000 |
0.032 |
64016000 |
0.064 |
|
6000 |
18003000 |
72012000 |
0.072 |
144024000 |
0.144 |
|
8000 |
32004000 |
128016000 |
0.128 |
256032000 |
0.256 |
|
10000 |
50005000 |
200020000 |
0.200 |
400040000 |
0.400* |
|
12000 |
72006000 |
288024000 |
0.288 |
576048000 |
0.576* |
|
14000 |
98007000 |
392028000 |
0.392 |
784056000 |
0.784 |
|
16000 |
128008000 |
512032000 |
0.512 |
1024064000 |
1.024 |
|
18000 |
162009000 |
648036000 |
0.648 |
1296072000 |
1.296 |
|
20000 |
200010000 |
800040000 |
0.800 |
1600080000 |
1.600 |
|
22000 |
242011000 |
968044000 |
0.968 |
1936088000 |
1.936 |
|
24000 |
288012000 |
1152048000 |
1.152 |
2304096000 |
2.304 |
|
26000 |
338013000 |
1352052000 |
1.352 |
2704104000 |
2.704 |
|
28000 |
392014000 |
1568056000 |
1.568 |
3136112000 |
3.136 |
|
30000 |
450015000 |
1800060000 |
1.800 |
3600120000 |
3.600 |
|
32000 |
512016000 |
2048064000 |
2.048 |
4096128000 |
4.096 |
|
34000 |
578017000 |
2312068000 |
2.312 |
4624136000 |
4.624 |
|
36000 |
648018000 |
2592072000 |
2.592 |
5184144000 |
5.184 |
|
38000 |
722019000 |
2888076000 |
2.888 |
5776152000 |
5.776 |
|
40000 |
800020000 |
3200080000 |
3.200 |
6400160000 |
6.400 |
|
42000 |
882021000 |
3528084000 |
3.528 |
7056168000 |
7.056 |
|
44000 |
968022000 |
3872088000 |
3.872 |
7744176000 |
7.744 |
|
46000 |
1058023000 |
4232092000 |
4.232 |
8464184000 |
8.464 |
|
48000 |
1152024000 |
4608096000 |
4.608 |
9216192000 |
9.216 |
|
50000 |
1250025000 |
5000100000 |
5.000 |
10000200000 |
10.000 |
|
52000 |
1352026000 |
5408104000 |
5.408 |
10816208000 |
10.816 |
|
54000 |
1458027000 |
5832108000 |
5.832 |
11664216000 |
11.664 |
|
56000 |
1568028000 |
6272112000 |
6.272 |
12544224000 |
12.544 |
|
58000 |
1682029000 |
6728116000 |
6.728 |
13456232000 |
13.456 |
|
60000 |
1800030000 |
7200120000 |
7.200 |
14400240000 |
14.400 |
*Limits of systems with 512 MB of Ram are shown in red.
After starting ChipST2C, the splashscreen below will appear. ChipSTC was designed for large flat-screen monitors that easily accompany a screen resolution of 1024 x 768 pixels. Therefore, the screen resolution of older monitors such as VGA or SVGA will need to be switched from 800 x 600 (or lower) to 1024 x 768.

There are three tab-delimited file formats that can be input into ChipST2C.
The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_w_sig_calls.txt that is distributed with ChipST2C contains both expression and p-values for calls. Below is the appearance of this format when opened in Excel:

(image above shows the tab-delimited file format when opened with Excel)
The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_w_PA_calls.txt that is distributed with ChipST2C contains both expression and present (“P”) and absent (“A”) calls. Below is the appearance of this format when opened in Excel:

(image above shows the tab-delimited file format when opened with Excel)
The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_wo_calls.txt that is distributed with ChipST2C contains only expression values. Below is the appearance of this format when opened in Excel:

(image above shows the tab-delimited file format when opened with Excel)
There are two genes annotation formats that can be input into ChipST2C. These are Extended Annotation and Abbreviated Annotation. Genenames and other annotation are always to the right of the expression data.
The most detailed annotation format called Extended Annotation Format and includes each gene’s accession id, gene abbreviation, chromosome location or LocusLink, followed by the Description. In order to use this format, you must have use the following column names in Row 1 of the tab-delimited file: “Accession,” followed by “Gene”, followed by either “Location” or “Locuslink”, followed by “Description” The tab-delimited file named DCHIP_Export_With_Calls_and_Extended_Annotation.txt, which is bundled with ChipST2C was opened with Excel and is shown below:

Notice that the call column (“NORMAL_9550_CALL”) for the last array is visible, which is followed the field names “Accession”, “Gene”, “Location”, and “Description”.
When you input data in this format, ChipST2C will pick up on the field usage, and allow you to format the gene names in cluster images by presenting the following drop down box on the “Analyze” tab.
To experiment with the annotation styles in cluster images open the DCHIP export file as follows:

On the Analyze tab, ensure that the “Description” (Default) annotation is selected as shown below:

Click on the
button to run a hierarchical cluster analysis
of all of the data. After a cluster run
with the default gene “Description” used, the following annotation will appear
in the cluster image:

If “Accession” is specified as follows:
the following cluster image will be obtained:

As another example, “Abbreviation” was selected for the genename format in cluster images, with the following results obtained after cluster analysis:

If “Location” is selected as follows:

The following results will be obtained:

The Abbreviated Annotation Format is commonly used for clustering data sets with just the gene name on the right hand side of the spreadsheet, shown as follows.

There is no need to specify the type of annotation used when inputting data, since ChipST2C will pick up on the filed names and make the necessary adjustments to options during run-time.

There can be no commas “,” used in accession numbers, gene abbreviations, or gene descriptions. If commas are present anywhere in the annotation, then the data will not be parsed correctly.
When calls are present (either p-values or “P” and “A”), use the first File-->Open-->Expression with calls (E,0.01,E,0.06,…) or (E,P,E,A,E,A…), shown below

When calls are present (either p-values or “P” and “A”), use the first File-->Open -->Expression only (E,E,E,E,E,…), shown below:

1. Start ChipST2C.
2. To open a file with only expression (no calls) select the second file open option as
shown below:

3. Select the tab-delimited text file with 60 arrays, 120 genes, and no calls (as shown below), then click on “Open”:

After the file is read, the “Array” tab will be shown listing the arrays input when the file was read, this is shown as follows:

4. Now that the data are read in, select the “Log” tab, and confirm that the log states there were 60 arrays and 120 genes read in from the file.

5. Next, select all of the arrays by clicking on the first array dragging the mouse pointer down the entire list of arrays.




To perform the cluster analysis, click on the
button of the cluster option group (HCA stands
for Hierarchical Cluster Analysis).
Before the cluster run occurs, ChipST2C automatically
selected all genes to be used in the analysis.
This can be done manually, however.
A”screen” icon
in the left treeview indicates that 120 genes
were specified for the analysis.
After the cluster run is complete, two icons were added to the treeview on the left, which are shown below:

Click on the first icon
and
the cluster image of the arrays and genes will be displayed in the image tab on
the right (shown below):

By clicking on the second icon
you
will be able to view a plot of array-specific average expression ± the
standard deviation (error bars), for example:

To copy an image into the Windows Clipboard for pasting into
Powerpoint or Word, single-click on either one of the icons
or
,
and then click on the
button on the upper right side of the image tab. The selected image is now pasted into the
Windows Clipboard, and can be pasted directly into Powerpoint or Word. In Powerpoint, here’s what we get:

Whereas, in Word we get:

When viewing the image tab, you can resize any image by clicking on the “-“ and “+”
buttons shown as
. Please note that resizing the image only
changes the display and not the actual image file or what gets pasted into the
clipboard. If we resize (reduce) the
cluster image originally obtained, the following image is dislayed:

Let’s change the colors prior to another cluster run, since the data are still in memory.
First, select the Edit Colors pull-down menu, then “Publication” command, as shown by

This will specify a white background with black fonts and
lines, which is necessary for publications.
Next, let’s change the color of the expression gradient in the cluster
image by clicking on the icon
in the Cluster group. You will then notice the default cluster color
shown in the treeview (below):

Rerun the cluster analysis by simply clicking on the
button.
After run-time, you will notice the new icons for this run, and that the
color has changed.

Now click on the
icon,
and you will notice the new colors in the cluster image, as illustrated below:

If the input data contain present and absent calls or p-values along with expression values, then you may want to screen out genes with absent calls prior to an analysis. The default is to use all genes that are input from the disk-file, so one must be cautious about the genes used for each analysis. Considering that you have present/absent calls in the input file, it is recommended that you screen for genes having a given percentage of present calls within each class prior to an analysis. The need to screen for genes for present calls is independent of the number of classes to which arrays were assigned.
When screening genes for present calls, ChipST2C will always apply the specified percentage criterion to each class in order to prevent the possibility of an entire class of arrays containing all absent calls. This is done for the following reason. Consider the case in which there are five classes of arrays and you screened genes having 80% or more present calls. If the percentage criterion was applied globally across the arrays, then by chance alone one or more classes containing all absent calls could be selected, especially if they are classes with a small number of arrays. To prevent this from occurring, ChipST2C applies the percentage criterion for present calls to each class. As another example, assume that you have completed an experiment with triplicate arrays at time points 0h, 24h, and 48h. If the 80% criterion is applied for present calls within each treatment (time), then it will ensure that at least 2 arrays (0.8 x 3 = 2.4 arrays rounded down to 2) are used per class (0h, 24h, and 48h). Analogously, if only two arrays were used per class then 2 arrays (0.8 x 2 =1.6 rounded up to 2) are used per class (0h, 24h, and 48h).
In summary, if you do have present/absent calls or p-values in the input data, then be sure to screen for genes with a given percentage of present calls per class. If not, then the following bias may result:
For hypothesis testing with 2-sample or k-sample testing, if you don’t want to determine summary statistics or standardize when there are present/absent calls in the data, then you can directly proceed to 2-sample or k-sample tests, since ChipST2C will automatically screen for genes having more than the given criterion for present calls (within each class) prior to the analysis. Be sure to set the criterion to 0% if you want to use all genes.
If you don’t have present/absent calls in the input data, then any of the analyses may be used without bias occurring.
The schematic below shows that there is no need to screen genes when only the expression values are in the input file.
![]() |
All output files are written to the directory c:\Program Files\ChipST2C\ChipST2C\Output\ assuming ChipST2C was installed in the default directory. The second “ChipST2C” exists in the directory tree because the first represents the company name, and the second represents the product name (as is usual done by most companies).
All output files are written to disk in a newly generated \Output subdirectory of the directory in which the input file resided. Therefore, any directory where an input file is opened will have a new Output subdirectory appended to it.
The contents of the \Output subdirectory in the directory in which the input files reside will remain intact until they are deleted.
The format of all images can be specified prior to run time. To change the default image format prior to an analysis, select the format from the drop down list provided in the “Data, transforms, output” group on the upper right side of the analysis tab.

The table below lists the possible image format options which can be specified prior to an analysis. File sizes are listed for the default cluster image for 40 arrays and 120 genes.
|
Image file format |
Size |
Remarks |
|
Windows Metafiles (.wmf) (default in ChipST2C) |