BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the segment,
extract just the two pages
Andrzej Bialecki schrieb:
BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the
Hello All,
I am getting the following error in my hadoop.log (see below). It seems to
happen everytime I run any of the nutch command line tools :(
!--
2009-11-25 11:42:49,299 INFO crawl.Injector - Injector: done
2009-11-25 11:42:49,302 DEBUG hdfs.DFSClient -
plz mischa, if your problem is not about delete duplicate just open another
thread ! thx
Andrzej, thx for all, i will try to run a diff command on the content of the 2
pages.
i will give you news when done.
From: mischa.tuffi...@garlik.com
Subject: Re: dedup dont delete duplicates
: mischa.tuffi...@garlik.com
Subject: Re: dedup dont delete duplicates !
Date: Wed, 25 Nov 2009 11:45:21 +
To: nutch-user@lucene.apache.org
Hello All,
I am getting the following error in my hadoop.log (see below). It seems to
happen everytime I run any of the nutch command line tools
hi,
dedup doesn't work for me.
I have read that Duplicates have either the same contents (via MD5 hash) or
the same URL
in my case i dont have the same URLS but still have the same contents for those
URLS.
i give you an exemple: i have three urls that have the same content
1-
BELLINI ADAM wrote:
hi,
dedup doesn't work for me.
I have read that Duplicates have either the same contents (via MD5 hash) or
the same URL
in my case i dont have the same URLS but still have the same contents for those
URLS.
i give you an exemple: i have three urls that have the same
i dont understand also why they have 3 differentes signatures, since it's
realy the same page !
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject: dedup dont delete duplicates !
Date: Tue, 24 Nov 2009 20:56:39 +
hi,
dedup doesn't work for me.
I have read
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
Date: Tue, 24 Nov 2009 22:21:19 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: dedup dont delete
BELLINI ADAM wrote:
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
Apparently it isn't a bit-exact replica of the page, so its MD5 hash is
different. You need to use a more
different signature !
can you tell m eplz more about TextProfileSignature ? how should i use it
best regards
Date: Tue, 24 Nov 2009 22:35:52 +0100
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: dedup dont delete duplicates !
BELLINI ADAM wrote:
yes i cheked
Hi,
Does TextProfileSignature exclude the HTML header (meta tags etc.) while
creating the signature for a page? I have noticed that minor differences
like time-stamp etc. cause the same page to look different to Nutch
causing multiple copies of the same page to be added to the index.
Is it also
12 matches
Mail list logo