Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
give you news when done. > > > > >> From: mischa.tuffi...@garlik.com >> Subject: Re: dedup dont delete duplicates ! >> Date: Wed, 25 Nov 2009 11:45:21 + >> To: nutch-user@lucene.apache.org >> >> Hello All, >> >> I am getting the followi

RE: dedup dont delete duplicates !

2009-11-25 Thread BELLINI ADAM
plz mischa, if your problem is not about delete duplicate just open another thread ! thx Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages. i will give you news when done. > From: mischa.tuffi...@garlik.com > Subject: Re: dedup dont delete dupl

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
Hello All, I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :( Does anyone know what problem I am having ? Cheers, Mischa On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote: > BELLINI ADAM wrote: >> hi,

Re: dedup dont delete duplicates !

2009-11-25 Thread reinhard schwab
Andrzej Bialecki schrieb: > BELLINI ADAM wrote: >> hi, >> >> my two urls points to the same page ! > > Please, no need to shout ... > > If the MD5 signatures are different, then the binary content of these > pages is different, period. > > Use readseg -dump utility to retrieve the page content from

Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages

Re: dedup dont delete duplicates !

2009-11-24 Thread Subhojit Roy
Hi, Does TextProfileSignature exclude the HTML header (meta tags etc.) while creating the signature for a page? I have noticed that minor differences like time-stamp etc. cause the same page to "look" different to Nutch causing multiple copies of the same page to be added to the index. Is it also

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
they have different signature ! can you tell m eplz more about TextProfileSignature ? how should i use it best regards > Date: Tue, 24 Nov 2009 22:35:52 +0100 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > Subject: Re: dedup dont delete duplicates ! > > BELLINI

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Apparently it isn't a bit-exact replica of the page, so its MD5 hash is different. You need to use a more re

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html > Date: Tue, 24 Nov 2009 22:21:19 +0100 > From: a...@getopt.org > To: nutch-user@lucene.apache.org > S

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
i dont understand also why they have 3 differentes signatures, since it's realy the same page ! > From: mbel...@msn.com > To: nutch-user@lucene.apache.org > Subject: dedup dont delete duplicates ! > Date: Tue, 24 Nov 2009 20:56:39 + > > > > hi, > > dedup doesn't work for me. > I have

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same conte