ChipST2C - Version 1.18

 

 

 

(Chip Significance Testing to(2) Cluster)

 

 

User’s Guide

 

2004-2006 Peterson Lab, Baylor College of Medicine

 

Revision:  3/12/06

 

 

TABLE OF CONTENTS

 

SYSTEM REQUIREMENTS

            Hardware

            Software

            Memory

 

TAB-DELIMITED FILE FORMATS

Expression with calls in the form of p-values

Expression with calls in the form of “P” and “A,” representing present and absent, respectively

Expression without calls

 

ANNOTATION FORMATS

            Extended Annotation Format (Accession, Gene, Location/LocusLink, Description)

            Abbreviated Annotation Format (Description)

            Recognition of Annotation Formats by ChipST2C

            Limitations of Annotation

 

OPENING INPUT FILES

            Opening tab-delimited files with expression values and calls

            Opening tab-delimited files with only expression values

 

SCREENING GENES PRIOR TO ANALYSIS

            Using Input Data with Present/Absent Call (or P-values)

            Using Input Data without Present/Absent Call (or P-values)

 

FILE OUTPUT

            Default Output File Locations

            Non-Default Output File Locations

            Lifetime of Output File Directory

 

IMAGE FORMAT

 

SPECIFYING COLORS PRIOR TO ANALYSIS

            Setting Specific “Custom” Colors

            Default Colors for Publications

            Default Colors for Presentations

            Changing the Default Heat Map (Gradient) Colors for Expression Values in Cluster Images

 

EXAMPLE 1.   Input a tab-delimited text file with only expression values, and cluster only the genes.

            Copy and paste an image into Powerpoint or Word

            Resizing an image

            Changing Colors

EXAMPLE 2.   K-means cluster analysis of genes (setting k=0 to search for best value of k)

EXAMPLE 3.   Summary statistics for classes (replicate groups) of arrays

EXAMPLE 4.   F-test (ANOVA) to identify genes differentially expressed in 4 classes

EXAMPLE 5.   F-test (ANOVA) to identify genes differentially expressed in 4 (diagnostic) classes of arrays, followed by k-means clustering to partition the identified genes

EXAMPLE 6.    Kruskal-Wallis k-sample test to determine significantly differentially expressed genes among four (diagnostic) classes of arrays

EXAMPLE 7.   T-test to identify genes differentially expressed between two classes, followed by (a) k-means cluster analysis on significant genes, and (b) bar graphs of most significant genes

EXAMPLE 8.   Mann-Whitney U test to identify genes differentially expressed between two classes

EXAMPLE 9.   All possible 2-sample tests between multiple classes

EXAMPLE 10.   Transpose a data set prior to hierarchical cluster analysis or k-means clustering

EXAMPLE 11.   Manually standardizing expression values prior to analysis

EXAMPLE 12.   Automatically standardizing expression values during hierarchical cluster analysis (HCA), and saving raw, standardized data, and distance matrices to disk during run-time

EXAMPLE 13.   Calculating false discovery rates (pFDR) using q-values

EXAMPLE 14.   Comparing the number of genes significant for 2-sample and k-sample tests as a function of p-value correction and FDR method

EXAMPLE 15.   Benchmarking randomization tests: confirming test statistics during each permutation (iteration)

EXAMPLE 16.   Exporting (and importing) all data from a run to a ChipST2C “collaboration” file

EXAMPLE 17.   Exporting expression of all screened genes (present and absent calls in input)

EXAMPLE 18.   Exporting expression data for cluster analysis or classification analysis (after using multiple 2-sample tests of one k-sample test)

EXAMPLE  19.   Search for gene accession or abbreviated names in an input file, and export expression

 

 

 

SYSTEM REQUIREMENTS

 

The following is a list of system requirements for installation and operation of ChipST2C:

Hardware

  1. 32-bit PC system with preferably a Pentium IV, or AMD Athlon CPU.
  2. 1 Gigabyte (GB) of Random Access Memory (RAM).   0.5 GB (512 KB) is the default onmost PC’s.
  3. Hard-disk with 20 or more GB of disk-space.
  4. Flat screen LCD monitor set to 1024 x 768 pixels.   VGA and SVGA monitors are not recommended.  

 

Software

  1. Microsoft .NET Framework 1.1 Redistributable Package .   ChipST2C is solely based on the “next-generation technology” from Microsoft contained in its .NET Framework.   The .NET Framework is currently being developed and is therefore not available on all Windows XP operating systems.   Eventually, .NET Framework will be bundled in all future versions of the Windows operating system.  

 

 

 

  1. Windows XP (32-bit, Windows 2000 will suffice as long as the .NET Framework is installed).
  2. Windows XP Home (32-bit, users may be required to download and install the .NET Framework.)

 

Memory

For cluster analysis of large data sets, ChipST2C’s RAM (random access memory) requirements are a function of the number of genes to be clustered.  In the table below, the GigaBytes (GB) of RAM needed by ChipST2C is highlighted in yellow.  As an example, if you had only 1 GB of RAM, you can only cluster 16,000 genes.  Keep in mind the Windows operating system may need 200 MB, or 0.2 GB.   Because the default memory on most desktop or notebook PC’s is 512 MegaBytes (0.512 GB), you will only be able to cluster analyze 10,000-12,000 genes depending on the size of the swap file used for virtual (disk-based) memory which augments the amount of RAM. 

 

 

 

 

 

 

 

Matrix elements

 

Single-precision

 

Double-precision

 

Off-diagonals

+diagaonal

 

(4-byte reals)

 

(8-byte reals)

Genes

(A)(A-1)/2+A

Bytes needed

GB RAM

Bytes needed

GB RAM

(A)

(B)

(B)*4

Needed

(B)*8

needed

2000

2001000

8004000

0.008

16008000

0.016

4000

8002000

32008000

0.032

64016000

0.064

6000

18003000

72012000

0.072

144024000

0.144

8000

32004000

128016000

0.128

256032000

0.256

10000

50005000

200020000

0.200

400040000

0.400*

12000

72006000

288024000

0.288

576048000

0.576*

14000

98007000

392028000

0.392

784056000

0.784

16000

128008000

512032000

0.512

1024064000

1.024

18000

162009000

648036000

0.648

1296072000

1.296

20000

200010000

800040000

0.800

1600080000

1.600

22000

242011000

968044000

0.968

1936088000

1.936

24000

288012000

1152048000

1.152

2304096000

2.304

26000

338013000

1352052000

1.352

2704104000

2.704

28000

392014000

1568056000

1.568

3136112000

3.136

30000

450015000

1800060000

1.800

3600120000

3.600

32000

512016000

2048064000

2.048

4096128000

4.096

34000

578017000

2312068000

2.312

4624136000

4.624

36000

648018000

2592072000

2.592

5184144000

5.184

38000

722019000

2888076000

2.888

5776152000

5.776

40000

800020000

3200080000

3.200

6400160000

6.400

42000

882021000

3528084000

3.528

7056168000

7.056

44000

968022000

3872088000

3.872

7744176000

7.744

46000

1058023000

4232092000

4.232

8464184000

8.464

48000

1152024000

4608096000

4.608

9216192000

9.216

50000

1250025000

5000100000

5.000

10000200000

10.000

52000

1352026000

5408104000

5.408

10816208000

10.816

54000

1458027000

5832108000

5.832

11664216000

11.664

56000

1568028000

6272112000

6.272

12544224000

12.544

58000

1682029000

6728116000

6.728

13456232000

13.456

60000

1800030000

7200120000

7.200

14400240000

14.400

 

*Limits of systems with 512 MB of Ram are shown in red. 

 

 

Program Startup

After starting ChipST2C, the splashscreen below will appear.  ChipSTC was designed for large flat-screen monitors that easily accompany a screen resolution of 1024 x 768 pixels.   Therefore, the screen resolution of older monitors such as VGA or SVGA will need to be switched from 800 x 600 (or lower) to 1024 x 768.  

 

 

 

 

Tab-Delimited File Formats

 

There are three tab-delimited file formats that can be input into ChipST2C.  

 

  1. Expression with calls in the form of p-values (e.g., 2.51<tab>0.0023<tab>-0.2<tab>0.78<tab>…, etc.).
  2. Expression with calls in the form of “P” and “A,” representing present and absent, respectively (e.g., 2.51<tab>P<tab>-0.2<tab>A<tab>…, etc.).
  3. Expression without calls (2.51<tab>-0.2<tab>-1.21<tab>1.33<tab>…, etc.

 

 

Expression with calls in the form of p-values

 

The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_w_sig_calls.txt that is distributed with ChipST2C contains both expression and p-values for calls.  Below is the appearance of this format when opened in Excel:

 

 

(image above shows the tab-delimited file format when opened with Excel)

 

 

Expression with calls in the form of “P” and “A,” representing present and absent, respectively

 

The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_w_PA_calls.txt that is distributed with ChipST2C contains both expression and present (“P”) and absent (“A”) calls.  Below is the appearance of this format when opened in Excel:

 

 

(image above shows the tab-delimited file format when opened with Excel)

 

 

Expression without calls

 

The tab-delimited text file named Benchmark_data_40_arrays_1000_genes_wo_calls.txt that is distributed with ChipST2C contains only expression values.  Below is the appearance of this format when opened in Excel:

 

(image above shows the tab-delimited file format when opened with Excel)

 

 

Annotation Formats

 

There are two genes annotation formats that can be input into ChipST2C.   These are Extended Annotation and Abbreviated Annotation.   Genenames and other annotation are always to the right of the expression data.  

Extended Annotation Format (Accession, Gene, Location/LocusLink, Description)

 

The most detailed annotation format called Extended Annotation Format and includes each gene’s accession id, gene abbreviation, chromosome location or LocusLink, followed by the Description.   In order to use this format, you must have use the following column names in Row 1 of the tab-delimited file: “Accession,” followed by “Gene”, followed by either “Location” or “Locuslink”, followed by “Description” The tab-delimited file named DCHIP_Export_With_Calls_and_Extended_Annotation.txt, which is bundled with ChipST2C was opened with Excel and is shown below:  

 

 

 

Notice that the call column (“NORMAL_9550_CALL”) for the last array is visible, which is followed the field names “Accession”, “Gene”, “Location”, and “Description”.

 

When you input data in this format, ChipST2C will pick up on the field usage, and allow you to format the gene names in cluster images by presenting the following drop down box on the “Analyze” tab.     

 

To experiment with the annotation styles in cluster images open the DCHIP export file as follows:

 

 

 

 

 

On the Analyze tab, ensure that the “Description” (Default) annotation is selected as shown below:

 

 

Click on the  button to run a hierarchical cluster analysis of all of the data.   After a cluster run with the default gene “Description” used, the following annotation will appear in the cluster image:

 

 

 

 

 

If “Accession” is specified as follows:

 

 

 

the following cluster image will be obtained:

 

 

 

 

As another example, “Abbreviation” was selected for the genename format in cluster images, with the following results obtained after cluster analysis:

 

 

 

 

 

If “Location” is selected as follows:

 

 

 

The following results will be obtained:

 

 

 

 

Abbreviated Annotation Format (Description)

The Abbreviated Annotation Format is commonly used for clustering data sets with just the gene name on the right hand side of the spreadsheet, shown as follows.

 

 

Recognition of Annotation Formats by ChipST2C

There is no need to specify the type of annotation used when inputting data, since ChipST2C will pick up on the filed names and make the necessary adjustments to options during run-time.  

Limitations of Annotation

There can be no commas “,” used in accession numbers, gene abbreviations, or gene descriptions.   If commas are present anywhere in the annotation, then the data will not be parsed correctly. 

 

Opening Input Files

 

Opening tab-delimited files with expression values and calls

When calls are present (either p-values or “P” and “A”), use the first File-->Open-->Expression with calls (E,0.01,E,0.06,…) or (E,P,E,A,E,A…), shown below

 

 

 

Opening tab-delimited files with only expression values

When calls are present (either p-values or “P” and “A”), use the first File-->Open -->Expression only (E,E,E,E,E,…), shown below:

 

 

 

 

 

 

EXAMPLE 1.   Input a tab-delimited text file with only expression values, and cluster only the genes.

 

1.   Start ChipST2C.

2.   To open a file with only expression (no calls) select the second file open option as

shown below: 

 

 

 

 

3.  Select the tab-delimited text file with 60 arrays, 120 genes, and no calls (as shown below), then click on “Open”:

 

 

 

 

After the file is read, the “Array” tab will be shown listing the arrays input when the file was read, this is shown as follows: 

 

 

 

 

4.  Now that the data are read in, select the “Log” tab, and confirm that the log states there were 60 arrays and 120 genes read in from the file.  

 

 

 

 

 

 

5.   Next, select all of the arrays by clicking on the first array dragging the mouse pointer down the entire list of arrays.

 

 

  1. Now that all of the arrays are selected, click on the right-arrow button   and send all of the arrays as a single group to the “Arrays in this class” list.   You should obtain the same results as shown below: 

 

 

  1. Skip naming the class, and click on the second right-arrow button  to send the selected arrays to the list arrays into the “Arrays in all classes” list.  (Arrays do not need to be selected for this step).  

 

 

 

 

 

  1. All of the arrays are now available for analysis.  

 

  1. Select the Analyze tab, and you will see the main control panel for analysis in ChipST2C, as illustrated below:

 

 

To perform the cluster analysis, click on the  button of the cluster option group (HCA stands for Hierarchical Cluster Analysis).

 

  

 

 

Before the cluster run occurs, ChipST2C automatically selected all genes to be used in the analysis.   This can be done manually, however.   A”screen” icon  in the left treeview indicates that 120 genes were specified for the analysis. 

 

After the cluster run is complete, two icons were added to the treeview on the left, which are shown below: 

 

 

 

 

 

Click on the first icon and the cluster image of the arrays and genes will be displayed in the image tab on the right (shown below):

 

 

 

By clicking on the second icon you will be able to view a plot of array-specific average expression ± the standard deviation (error bars), for example:

 

 

 

 

Copy and paste an image into Powerpoint or Word

 

To copy an image into the Windows Clipboard for pasting into Powerpoint or Word, single-click on either one of the icons or   , and then click on the  button on the upper right side of the image tab.   The selected image is now pasted into the Windows Clipboard, and can be pasted directly into Powerpoint or Word.  In Powerpoint, here’s what we get:

 

 

Whereas, in Word we get:

 

 

 

 

Resizing an image

When viewing the image tab, you can resize any image by clicking on the “-“ and “+”

buttons shown as .    Please note that resizing the image only changes the display and not the actual image file or what gets pasted into the clipboard.   If we resize (reduce) the cluster image originally obtained, the following image is dislayed:

 

 

 

 

 

Changing Colors

Let’s change the colors prior to another cluster run, since the data are still in memory.    

 

First, select the Edit Colors pull-down menu, then “Publication” command, as shown by

 

 

 

This will specify a white background with black fonts and lines, which is necessary for publications.   Next, let’s change the color of the expression gradient in the cluster image by clicking on the icon  in the Cluster group.  You will then notice the default cluster color shown in the treeview (below):

 

Rerun the cluster analysis by simply clicking on the  button.  After run-time, you will notice the new icons for this run, and that the color has changed.  

 

 

 

 

Now click on the icon, and you will notice the new colors in the cluster image, as illustrated below:

 

 

 

 

 

 

Screening Genes Prior to Analysis

Using Input Data with Present/Absent Call (or P-values)

If the input data contain present and absent calls or p-values along with expression values, then you may want to screen out genes with absent calls prior to an analysis.   The default is to use all genes that are input from the disk-file, so one must be cautious about the genes used for each analysis.  Considering that you have present/absent calls in the input file, it is recommended that you screen for genes having a given percentage of present calls within each class prior to an analysis.   The need to screen for genes for present calls is independent of the number of classes to which arrays were assigned.

 

When screening genes for present calls, ChipST2C will always apply the specified percentage criterion to each class in order to prevent the possibility of an entire class of arrays containing all absent calls.   This is done for the following reason.  Consider the case in which there are five classes of arrays and you screened genes having 80% or more present calls.  If the percentage criterion was applied globally across the arrays, then by chance alone one or more classes containing all absent calls could be selected, especially if they are classes with a small number of arrays.   To prevent this from occurring, ChipST2C applies the percentage criterion for present calls to each class.  As another example, assume that you have completed an experiment with triplicate arrays at time points 0h, 24h, and 48h.   If the 80% criterion is applied for present calls within each treatment (time), then it will ensure that at least 2 arrays (0.8 x 3 = 2.4 arrays rounded down to 2) are used per class (0h, 24h, and 48h).   Analogously, if only two arrays were used per class then 2 arrays (0.8 x 2 =1.6 rounded up to 2) are used per class (0h, 24h, and 48h).

 

In summary, if you do have present/absent calls or p-values in the input data, then be sure to screen for genes with a given percentage of present calls per class.   If not, then the following bias may result:

 

 

 

For hypothesis testing with 2-sample or k-sample testing, if you don’t want to determine summary statistics or standardize when there are present/absent calls in the data, then you can directly proceed to 2-sample or k-sample tests, since ChipST2C will automatically screen for genes having more than the given criterion for present calls (within each class) prior to the analysis.   Be sure to set the criterion to 0% if you want to use all genes.

 

Using Input Data without Present/Absent Call (or P-values)

If you don’t have present/absent calls in the input data, then any of the analyses may be used without bias occurring.  

 

The schematic below shows that there is no need to screen genes when only the expression values are in the input file.      

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


File Output

Default Output File Locations

All output files are written to the directory c:\Program Files\ChipST2C\ChipST2C\Output\ assuming ChipST2C was installed in the default directory.   The second “ChipST2C” exists in the directory tree because the first represents the company name, and the second represents the product name (as is usual done by most companies).   

Non-Default Output File Locations

All output files are written to disk in a newly generated \Output subdirectory of the directory in which the input file resided.   Therefore, any directory where an input file is opened will have a new Output subdirectory appended to it.   

Lifetime of Output File Directory

The contents of the \Output subdirectory in the directory in which the input files reside will remain intact until they are deleted.  

Image Format

The format of all images can be specified prior to run time.   To change the default image format prior to an analysis, select the format from the drop down list provided in the “Data, transforms, output” group on the upper right side of the analysis tab.  

 

 The table below lists the possible image format options which can be specified prior to an analysis.   File sizes are listed for the default cluster image for 40 arrays and 120 genes.

 Image file format

 Size

 Remarks

Windows Metafiles (.wmf)

(default in ChipST2C)