Use the tar xvfz command to extract the contents of the archives in the links below, i.e., tar xvfz upstream.tar.gz.
The goal here is the produce a FASTA-formatted sequence file with the upstream regions of a set of genes for the purposes of motif-finding. The major complication that has to be considered in S. cerevisiae is the presence of repeated upstream regions. If two or more substantially repeated sequences are given as input to AlignACE, the motifs returned will nearly all be variants of whatever sequence pecularities may be found in these repeats, which is usually not what is desired. The following descriptions of the files included in the download provide an outline of the method used to produce such clean upstream sequence files. This process is a replacement for the Smith-Waterman based purging algorithm that was integrated into an older version of AlignACE.
This program finds the best matching sites for a motif in a target
sequence. For consistency, it uses the same scoring mechanism that
AlignACE uses in its sampling phase. The number of sites returned may
be controlled by either the -s or -c options. With the -s option, you
may specify the number of top sites to be returned. With the -c
option, sites are returned if they score better than the specified
number of standard deviations below the mean of the scores for each of
the aligned sites used to define the motif. It is assumed that the -s
option will be generally more useful. There are two primary sequence
targets that may be searched with this program. First, an entire
genome may be searched, and the best hits returned along with their
genomic context. This requires appropriately formatted support files
such as those provided for S. cerevisiae in the above link,
which may be specified with the -y option. For example,
ScanACE -i input.ace -n 3 -s 1000 -g 0.38 -y
creates a file named input_n3.scn which contains the positions and information on the nearest neighboring genomic features for the top 1000 sites in the genome.
Alternatively, any FASTA-formatted sequence file may be searched using
the -x option. In this case, the source sequence name, position,
strand, score, and sequence is returned for each site. The first word
in the title for each sequence FASTA file must be unique, since it is
used as the sequence name in the output. For example,
ScanACE -i input.ace -n 3 -s 1000 -g 0.38 -x -z all_orfs.seq
NOTE 06Feb21: The hotlink above will now download a new version of ScanACE source code that avoids compiler errors sometimes encountered with the old version. The old version, last modified in 2003, may be downloaded here: ScanACE_2003. Our thanks go to Bertrand Huber and Martha Bulyk for reporting the error and supplying the corrected code.
NOTE 06May10: After the 06Feb21 version of scanace was put up on this site, incidents were reported in which ScanACE output was incorrect -- especially cases in which ScanACE's reports of sites that it found were corrupted or where ScanACE attributed sites to the wrong sequences in the input file. The problem was tracked to code in ScanACE that incorrectly processed FASTA records that were empty (i.e., contained no sequence). This problem was fixed and code changes were made that removed compiler errors on systems additional to the ones covered on 06Feb21. The ScanACE link above now downloads this most recent version. The 06Feb21 version may be downloaded here: ScanACE_06Feb21.
A note posted on this site on 06May09 alerted users to reports about the problem above and suggested that compiler warnings that remained in the 06Feb21 version may have been related to it. We have found that with the fixes on 06Feb21 and 06May10, there should not be any compiler errors or warnings for gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-118.7.2) and gcc version 3.3.5 (Debian 1:3.3.5-13). For Windows Visual C++ 6.0 there are a large number of compiler warnings but the program will compile. The new version was tested on all three systems and ran without error on a test file that contained a blank sequence line. Gary Gao, Zhou Zhu, and John Aach collaborated on resolving the problem. Please feel free to send reports of any additional problems here.
This program performs a pairwise comparison between two motifs and returns a value between -1.0 and 1.0 for the best possible alignment, with a perfect match scoring 1.0. This value corresponds to the Pearson correlation coefficient between the base frequencies of the positions in the aligned portion of the motifs. To prevent spurious matches, it is required that the aligned portion include at least the six most informative positions in each motif.
This download includes three programs for clustering a large set of motifs. This will only be useful to someone who has done extensive AlignACE runs and needs to simplify his output for further analysis. Accordingly, any such user will be considered an expert, and no support is offered for the programs provided here.
The clustering program itself is called Tree. It takes as input a file generated by CompareACE containing all pairwise similarity scores for a large set of motifs. It performs hierarchical clustering in which the cluster-cluster score is taken to be the average of all pairwise scores. Its output is interpreted with the Perl script list_clusters.pl, which allows the user to choose the score cutoff for the clusters. Also included is cluster_sort.pl, which might be useful in sorting the resulting clusters.
This program returns group specificity and positional bias scores for all motifs resulting from a given AlignACE run. It assumes that ScanACE has been run and that all associated files are locally available. The method of computation of these statistics depends on whether ScanACE was run with genomic features turned on or off (option -x). If genomic features are being returned, then group specificity is calculated between the ORFs listed in the AlignACE file header and the ORFs listed in each corresponding ScanACE file with a site between -100 and -500 bp upstream of the start site. This corresponds to the computations performed in the March 2000 JMB paper for S. cerevisiae. More flexible options may be offered in the future. Positional bias calculations are made using all sites listed in each ScanACE file, considering a range of 600 bp and a bin size of 50 bp, as in the JMB paper.
If genomic features are being ignored (option -x), then group specificity involves a comparison of the list of sequence names in the AlignACE file to the list of names in each ScanACE file, considering a sample space of possible names as listed in the file used to generate the ScanACE file (-z option). Positional bias is calculated assuming that the sequences are anchored at the translational start and extend upstream. Only the initial 600 bp upstream is considered (more flexibility possible in the future). This method of positional bias is not as useful for S. cerevisiae using the files provided here since so many of the sequences are divergent promoters and can't be anchored uniquely to a single translational start site. Group specificity as computed with option -x might also be less appropriate for S. cerevisiae since a motif site in a divergent promoter is considered as a single sequence hit as opposed to a putative control site for two different ORFs. For organisms with better separation between ORF/promoter units (human perhaps), the calculations as done here may be more appropriate.
NOTE 06Feb21: The hotlink above will now download a new version of MotifStats source code that avoids compiler errors sometimes encountered with the old version. The old version may be downloaded here. Our thanks go to Bertrand Huber and Martha Bulyk for reporting the error and supplying the corrected code.