Re: Plagiarism - document similarity

Luca Natti Tue, 12 Jul 2011 00:59:01 -0700

Thanks to all ,

i need to start from the beginning theory ,
you are speaking arab :) to me, or in other words i need
a less theoretical approach, or in other words some real code to put my
hands on.
Excuse this raw approach but i need a real fast to implement and understand
algorithm
to use in real world scenario possibly now ;) .
Alternatively i need a basic text(book) to start reading and arrive to
understand what you are saying.


thanks again

On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[email protected]> wrote:

> Easier to simply index all, say, three word phrases and use a TF-IDF score.
>  This will give you a good proxy for sequence similarity.  Documents should
> either be chopped on paragraph boundaries to have a roughly constant length
> or the score should not be normalized by document length.
>
> Log likelihood ratio (LLR) test can be useful to extract good query
> features
> from the subject document.  TF-IDF score is a reasonable proxy for this
> although it does lead to some problems.  The reason TF-IDF works as a query
> term selection method and why it fails can be seen from the fact that
> TF-IDF
> is very close to one of the most important terms in the LLR score.
>
> On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <
> [email protected]
> > wrote:
>
> > On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote:
> > > - bioinformatics, in particular gene sequencing to detect long
> > > near-matching sequences (a variation of the above, I'm not familiar
> > > with any particular algorithms, but I imagine this is a well explored
> > > space
> >
> > The classic is Smith & Waterman:
> >
> > http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
> >
> > This approach been used in general text processing tasks too, e.g.:
> >
> >
> >
> http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf
> >
> > > given the funds they receive ;),
> >
> > Hah! Less so these days I'm afraid :-)
> >
> > Andrew.
> >
> > --
> >
> > http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >
>

Re: Plagiarism - document similarity

Reply via email to