Re: Plagiarism - document similarity

Em Tue, 12 Jul 2011 01:11:12 -0700

Hi Luca,

again, I have to emphasize read what I gave you.
The algorithm in my link was explained for non-scientists and if you are
going to download Solr you will find the class to have a look on how
they implemented that algorithm.


More easy would mean that someone else is writing the code for you ;).

Regards,
Em

Am 12.07.2011 09:58, schrieb Luca Natti:
> Thanks to all ,
> 
> i need to start from the beginning theory ,
> you are speaking arab :) to me, or in other words i need
> a less theoretical approach, or in other words some real code to put my
> hands on.
> Excuse this raw approach but i need a real fast to implement and understand
> algorithm
> to use in real world scenario possibly now ;) .
> Alternatively i need a basic text(book) to start reading and arrive to
> understand what you are saying.
> 
> thanks again
> 
> On Tue, Jul 12, 2011 at 12:33 AM, Ted Dunning <[email protected]> wrote:
> 
>> Easier to simply index all, say, three word phrases and use a TF-IDF score.
>>  This will give you a good proxy for sequence similarity.  Documents should
>> either be chopped on paragraph boundaries to have a roughly constant length
>> or the score should not be normalized by document length.
>>
>> Log likelihood ratio (LLR) test can be useful to extract good query
>> features
>> from the subject document.  TF-IDF score is a reasonable proxy for this
>> although it does lead to some problems.  The reason TF-IDF works as a query
>> term selection method and why it fails can be seen from the fact that
>> TF-IDF
>> is very close to one of the most important terms in the LLR score.
>>
>> On Mon, Jul 11, 2011 at 2:52 PM, Andrew Clegg <
>> [email protected]
>>> wrote:
>>
>>> On 11 July 2011 08:19, Dawid Weiss <[email protected]> wrote:
>>>> - bioinformatics, in particular gene sequencing to detect long
>>>> near-matching sequences (a variation of the above, I'm not familiar
>>>> with any particular algorithms, but I imagine this is a well explored
>>>> space
>>>
>>> The classic is Smith & Waterman:
>>>
>>> http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm
>>>
>>> This approach been used in general text processing tasks too, e.g.:
>>>
>>>
>>>
>> http://compbio.ucdenver.edu/Hunter_lab/Cohen/usingBLASTforIdentifyingGeneAndProteinNames.pdf
>>>
>>>> given the funds they receive ;),
>>>
>>> Hah! Less so these days I'm afraid :-)
>>>
>>> Andrew.
>>>
>>> --
>>>
>>> http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>>>
>>
>

Re: Plagiarism - document similarity

Reply via email to