Easier to simply index all, say, three word phrases and use a TF-IDF score. This will give you a good proxy for sequence similarity. Documents should either be chopped on paragraph boundaries to have a roughly constant length or the score should not be normalized by document length.
Log likelihood ratio (LLR) test can be useful to extract good query features from the subject document. TF-IDF score is a reasonable proxy for this although it does lead to some problems. The reason TF-IDF works as a query term selection method and why it fails can be seen from the fact that TF-IDF is very close to one of the most important terms in the LLR score. On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <[email protected] > wrote: > On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote: > > - bioinformatics, in particular gene sequencing to detect long > > near-matching sequences (a variation of the above, I'm not familiar > > with any particular algorithms, but I imagine this is a well explored > > space > > The classic is Smith & Waterman: > > http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm > > This approach been used in general text processing tasks too, e.g.: > > > http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf > > > given the funds they receive ;), > > Hah! Less so these days I'm afraid :-) > > Andrew. > > -- > > http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg >
