On Sun, 4 Dec 2005, Srinivas Iyyer wrote:

> Contr1        SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143
> contr2        SPR-1   SPR-15  SPR-126 SPR-128 SPR-141 SPR-148
> contr3        SPR-106 SPR-130 SPR-135 SPR-138 SPR-139 SPR-145
> contr4        SPR-124 SPR-125 SPR-130 SPR-139 SPR-144 SPR-148

Hi Srinivas,

I'd strongly recommend changing the data representation from a
line-oriented to a more structured view.  Each line in your data above
appears to describe a conceptual set of tuples:

    (control_number, spr_number)

For example, we can think of the line:

    Contr1      SPR-10  SPR-101 SPR-125 SPR-137 SPR-139 SPR-143

as an encoding for the set of tuples written below (The notation I use
below is mathematical and not meant to be interpreted as Python.):

    { (Contr1, SPR-10),
      (Contr1, SPR-101),
      (Contr1, SPR-125),
      (Contr1, SPR-137),
      (Contr1, SPR-139),
      (Contr1, SPR-143) }

I'm not sure if I'm seeing everything, but from what I can tell so far,
your data cries out to be held in a relational database.  I agree with
Kent: you do not need to "align" anything.  If, within your sequence, each
element has to be unique in that sequence, then your "alignment" problem
transforms into a simpler table lookup problem.


That is, if all your data looks like:

    1: A B D E
    2: A C F
    3: A B C D

where no line can have repeated characters, then that data can be
transformed into a simple tablular representation, conceptually as:


        A   B   C   D   E   F
    1 | x | x |   | x | x |   |
    2 | x |   | x |   |   | x |
    3 | x | x | x | x |   |   |


So unless there's something here that you're not telling us, there's no
need for any complicated alignment algorithms: we just start off with an
empty table, and then for each tuple, check the corresponding entry in
the table.

Then when we need to look for common elements, we just scan across a row
or column of the table.  BLAST is cool, but, like regular expressions,
it's not the answer to every string problem.


If you want to implement code to do the above, it's not difficult, but you
really should use an SQL database to do this.  As a bioinformatician, it
would be in your best interest to know SQL, because otherwise, you'll end
up trying to reinvent tools that have already been written for you.

A good book on introductory relational database usage is "The Practical
SQL Handbook: Using Structured Query Language" by Judith Bowman, Sandra
Emerson, and Marcy Darnovsky.


Good luck to you.

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to