Re: dedup dont delete duplicates !

2009-11-25 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the segment, extract just the two pages

Re: dedup dont delete duplicates !

2009-11-25 Thread reinhard schwab
Andrzej Bialecki schrieb: BELLINI ADAM wrote: hi, my two urls points to the same page ! Please, no need to shout ... If the MD5 signatures are different, then the binary content of these pages is different, period. Use readseg -dump utility to retrieve the page content from the

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
Hello All, I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools :( !-- 2009-11-25 11:42:49,299 INFO crawl.Injector - Injector: done 2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient -

RE: dedup dont delete duplicates !

2009-11-25 Thread BELLINI ADAM
plz mischa, if your problem is not about delete duplicate just open another thread ! thx Andrzej, thx for all, i will try to run a diff command on the content of the 2 pages. i will give you news when done. From: mischa.tuffi...@garlik.com Subject: Re: dedup dont delete duplicates

Re: dedup dont delete duplicates !

2009-11-25 Thread Mischa Tuffield
: mischa.tuffi...@garlik.com Subject: Re: dedup dont delete duplicates ! Date: Wed, 25 Nov 2009 11:45:21 + To: nutch-user@lucene.apache.org Hello All, I am getting the following error in my hadoop.log (see below). It seems to happen everytime I run any of the nutch command line tools

dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same content 1-

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: hi, dedup doesn't work for me. I have read that Duplicates have either the same contents (via MD5 hash) or the same URL in my case i dont have the same URLS but still have the same contents for those URLS. i give you an exemple: i have three urls that have the same

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
i dont understand also why they have 3 differentes signatures, since it's realy the same page ! From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject: dedup dont delete duplicates ! Date: Tue, 24 Nov 2009 20:56:39 + hi, dedup doesn't work for me. I have read

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Date: Tue, 24 Nov 2009 22:21:19 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: dedup dont delete

Re: dedup dont delete duplicates !

2009-11-24 Thread Andrzej Bialecki
BELLINI ADAM wrote: yes i cheked the signatures and it's not the same !! it's realy weird the url www.domaine/folder/index.html?lang=fr is just this one www.domaine/folder/index.html Apparently it isn't a bit-exact replica of the page, so its MD5 hash is different. You need to use a more

RE: dedup dont delete duplicates !

2009-11-24 Thread BELLINI ADAM
different signature ! can you tell m eplz more about TextProfileSignature ? how should i use it best regards Date: Tue, 24 Nov 2009 22:35:52 +0100 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: dedup dont delete duplicates ! BELLINI ADAM wrote: yes i cheked

Re: dedup dont delete duplicates !

2009-11-24 Thread Subhojit Roy
Hi, Does TextProfileSignature exclude the HTML header (meta tags etc.) while creating the signature for a page? I have noticed that minor differences like time-stamp etc. cause the same page to look different to Nutch causing multiple copies of the same page to be added to the index. Is it also