Count composition of n-mer words in a sequence with p-value estimation

Simple online service to count the number of distinct n-mers in a sequence. Pretty like compseq program from EMBOSS, but also estimates significance of difference between observed and expected frequences.

 

Input 

Set word size ( default 3) 

Use "beg.index" and "end index" to count words in a window. Indexes should be 1-based

Script can handle several sequences in FASTA format. All word frequencies are summarized. If counting in a window, then estimation is done for each sequence separately.

Frequency of each nucleotide (optionally). The default frequency for each nucleotide is 0.25, hence 'Expected' frequency of any dimer is 1/16,  of any trimer is 1/64 and etc. However, you can estimate frequencies of each nucleotide by specifying word size = 1 and replace expected frequences with observed.

 

Output

For each n-mer report observed and expected counts and frequencies.

P-value of significance between observed and expected is provided. If p-value<0.05 then program outputs additional info if n-mer is significantly overrepresented( '+'),  or significantly underrepresented ('-'). For p-values<0.01 two signs will be added, for p-values<0.001 three signs will be added.

 http://services.bioinformatics.ru/nmercount.htm

 


Extremely long sequences are not supported due to server limitaions. Please, contact administrator directly to process your data nevertheless.