Easier to simply index all, say, three word phrases and use a TF-IDF score.
 This will give you a good proxy for sequence similarity.  Documents should
either be chopped on paragraph boundaries to have a roughly constant length
or the score should not be normalized by document length.

Log likelihood ratio (LLR) test can be useful to extract good query features
from the subject document.  TF-IDF score is a reasonable proxy for this
although it does lead to some problems.  The reason TF-IDF works as a query
term selection method and why it fails can be seen from the fact that TF-IDF
is very close to one of the most important terms in the LLR score.

On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <[email protected]
> wrote:

> On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote:
> > - bioinformatics, in particular gene sequencing to detect long
> > near-matching sequences (a variation of the above, I'm not familiar
> > with any particular algorithms, but I imagine this is a well explored
> > space
>
> The classic is Smith & Waterman:
>
> http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
>
> This approach been used in general text processing tasks too, e.g.:
>
>
> http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf
>
> > given the funds they receive ;),
>
> Hah! Less so these days I'm afraid :-)
>
> Andrew.
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Reply via email to