Distribution at Contingency of Alignment of Two Literal Sequences under Constrains


  • Lorentz Jäntschi Technical University of Cluj-Napoca
  • Sorana D. Bolboaca* Department of Medical Infromatics and Biostatistics "Iuliu Hatieganu" University of Medicine and Pharmacy Cluj-Napoca




deoxyribonucleicacid), RNA, (ribonucleic acid) or protein (amino-acid) sequences toidentify similar regions that could reflect functional, structural or evolutionaryrelationships between sequences [1], is frequently used nowadays dueto huge amount of already identified sequence of DNA, RNA, or proteins[2]. Several algorithms were developed and implement for global or localalignments, and each having advantages and disadvantages [3] and [4].Our research started from the hypothesis that the distribution of alignmentscould provide useful information about the chance that a certainalignment occur or not by chance. We present here a statistical approachbased on distribution analysis that is able to identify the thresholds for rejectingan alignment by chance under the supposition that each literal hasat least one alignment in any case. For two literal sequences, we define thealignment through the frequency of matches (with 0 meaning no alignmentand n meaning perfect alignment, where n is the number of nucleotides oramino-acids in the two equal length sequences). A closed form of the probabilitydistribution function of the alignment was obtained. We providedthat the cumulative distribution function have (unfortunately) no generalclosed form. Anyway, a series of statistics (including mode and central momentstill order 4) were obtained with closed forms. By using the formulafor the cumulative probability of an alignment, for the particular case offour literals alignment, thresholds to reject the alignment by chance wereobtained as follow: 70% for n > 8; 60% for n > 13; 55% for n > 21; 50%for n > 39; 45% for n > 282; 44% for n в†’в€ћ.






Conference Contributions