Hi Luca, again, I have to emphasize read what I gave you. The algorithm in my link was explained for non-scientists and if you are going to download Solr you will find the class to have a look on how they implemented that algorithm.
More easy would mean that someone else is writing the code for you ;). Regards, Em Am 12.07.2011 09:58, schrieb Luca Natti: > Thanks to all , > > i need to start from the beginning theory , > you are speaking arab :) to me, or in other words i need > a less theoretical approach, or in other words some real code to put my > hands on. > Excuse this raw approach but i need a real fast to implement and understand > algorithm > to use in real world scenario possibly now ;) . > Alternatively i need a basic text(book) to start reading and arrive to > understand what you are saying. > > thanks again > > On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[email protected]> wrote: > >> Easier to simply index all, say, three word phrases and use a TF-IDF score. >> This will give you a good proxy for sequence similarity. Documents should >> either be chopped on paragraph boundaries to have a roughly constant length >> or the score should not be normalized by document length. >> >> Log likelihood ratio (LLR) test can be useful to extract good query >> features >> from the subject document. TF-IDF score is a reasonable proxy for this >> although it does lead to some problems. The reason TF-IDF works as a query >> term selection method and why it fails can be seen from the fact that >> TF-IDF >> is very close to one of the most important terms in the LLR score. >> >> On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg < >> [email protected] >>> wrote: >> >>> On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote: >>>> - bioinformatics, in particular gene sequencing to detect long >>>> near-matching sequences (a variation of the above, I'm not familiar >>>> with any particular algorithms, but I imagine this is a well explored >>>> space >>> >>> The classic is Smith & Waterman: >>> >>> http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm >>> >>> This approach been used in general text processing tasks too, e.g.: >>> >>> >>> >> http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf >>> >>>> given the funds they receive ;), >>> >>> Hah! Less so these days I'm afraid :-) >>> >>> Andrew. >>> >>> -- >>> >>> http://tinyurl.com/andrew-clegg-linkedin | >> http://twitter.com/andrew_clegg >>> >> >
