Thanks to all , i need to start from the beginning theory , you are speaking arab :) to me, or in other words i need a less theoretical approach, or in other words some real code to put my hands on. Excuse this raw approach but i need a real fast to implement and understand algorithm to use in real world scenario possibly now ;) . Alternatively i need a basic text(book) to start reading and arrive to understand what you are saying.
thanks again On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[email protected]> wrote: > Easier to simply index all, say, three word phrases and use a TF-IDF score. > This will give you a good proxy for sequence similarity. Documents should > either be chopped on paragraph boundaries to have a roughly constant length > or the score should not be normalized by document length. > > Log likelihood ratio (LLR) test can be useful to extract good query > features > from the subject document. TF-IDF score is a reasonable proxy for this > although it does lead to some problems. The reason TF-IDF works as a query > term selection method and why it fails can be seen from the fact that > TF-IDF > is very close to one of the most important terms in the LLR score. > > On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg < > [email protected] > > wrote: > > > On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote: > > > - bioinformatics, in particular gene sequencing to detect long > > > near-matching sequences (a variation of the above, I'm not familiar > > > with any particular algorithms, but I imagine this is a well explored > > > space > > > > The classic is Smith & Waterman: > > > > http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm > > > > This approach been used in general text processing tasks too, e.g.: > > > > > > > http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf > > > > > given the funds they receive ;), > > > > Hah! Less so these days I'm afraid :-) > > > > Andrew. > > > > -- > > > > http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg > > >
