e.org
> Subject: Re: deletions from index
>
> So, I had these numbers in my index:
> Num Docs: 189550Max Docs: 285531
> Deleted Docs: 95981
>
> Then I did a crawl and index, which told meindexed (add/update): 13,423
> And now I have these numbers in my index:
>
>
utch.apache.org>; User
<user@nutch.apache.org>
Sent: Monday, October 2, 2017 1:19 PM
Subject: RE: deletions from index
You can check the Hadoop job's counters to see how many are being deleted. If
some are, then -deleteGone is on in your case. Only with that setting documents
are g
October 2017 21:51
> To: User <user@nutch.apache.org>
> Subject: deletions from index
>
> With my new news crawl, I would like to keep web pages in the index, even
> after they have disappeared from the web, so I can continue using them in
> machine-learning
With my new news crawl, I would like to keep web pages in the index, even after
they have disappeared from the web, so I can continue using them in
machine-learning processes. I thought I could achieve this by avoiding running
cleaning jobs. However, I still notice increasing numbers of
4 matches
Mail list logo