give you news when done.
>
>
>
>
>> From: mischa.tuffi...@garlik.com
>> Subject: Re: dedup dont delete duplicates !
>> Date: Wed, 25 Nov 2009 11:45:21 +
>> To: nutch-user@lucene.apache.org
>>
>> Hello All,
>>
>> I am getting the followi
plz mischa, if your problem is not about delete duplicate just open another
thread ! thx
Andrzej, thx for all, i will try to run a diff command on the content of the 2
pages.
i will give you news when done.
> From: mischa.tuffi...@garlik.com
> Subject: Re: dedup dont delete dupl
Hello All,
I am getting the following error in my hadoop.log (see below). It seems to
happen everytime I run any of the nutch command line tools :(
Does anyone know what problem I am having ?
Cheers,
Mischa
On 25 Nov 2009, at 09:15, Andrzej Bialecki wrote:
> BELLINI ADAM wrote:
>> hi,
Andrzej Bialecki schrieb:
> BELLINI ADAM wrote:
>> hi,
>>
>> my two urls points to the same page !
>
> Please, no need to shout ...
>
> If the MD5 signatures are different, then the binary content of these
> pages is different, period.
>
> Use readseg -dump utility to retrieve the page content from
BELLINI ADAM wrote:
hi,
my two urls points to the same page !
Please, no need to shout ...
If the MD5 signatures are different, then the binary content of these
pages is different, period.
Use readseg -dump utility to retrieve the page content from the segment,
extract just the two pages
Hi,
Does TextProfileSignature exclude the HTML header (meta tags etc.) while
creating the signature for a page? I have noticed that minor differences
like time-stamp etc. cause the same page to "look" different to Nutch
causing multiple copies of the same page to be added to the index.
Is it also
they have different signature !
can you tell m eplz more about TextProfileSignature ? how should i use it
best regards
> Date: Tue, 24 Nov 2009 22:35:52 +0100
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: dedup dont delete duplicates !
>
> BELLINI
BELLINI ADAM wrote:
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
Apparently it isn't a bit-exact replica of the page, so its MD5 hash is
different. You need to use a more re
yes i cheked the signatures and it's not the same !! it's realy weird
the url www.domaine/folder/index.html?lang=fr is just this one
www.domaine/folder/index.html
> Date: Tue, 24 Nov 2009 22:21:19 +0100
> From: a...@getopt.org
> To: nutch-user@lucene.apache.org
> S
i dont understand also why they have 3 differentes signatures, since it's
realy the same page !
> From: mbel...@msn.com
> To: nutch-user@lucene.apache.org
> Subject: dedup dont delete duplicates !
> Date: Tue, 24 Nov 2009 20:56:39 +
>
>
>
> hi,
>
> dedup doesn't work for me.
> I have
BELLINI ADAM wrote:
hi,
dedup doesn't work for me.
I have read that Duplicates have either the same contents (via MD5 hash) or
the same URL
in my case i dont have the same URLS but still have the same contents for those
URLS.
i give you an exemple: i have three urls that have the same conte
11 matches
Mail list logo