RE: deletions from index

2017-10-03 Thread Markus Jelsma
e.org > Subject: Re: deletions from index > > So, I had these numbers in my index: > Num Docs: 189550Max Docs: 285531 > Deleted Docs: 95981 > > Then I did a crawl and index, which told meindexed (add/update): 13,423 > And now I have these numbers in my index: > >

Re: deletions from index

2017-10-02 Thread Michael Coffey
utch.apache.org>; User <user@nutch.apache.org> Sent: Monday, October 2, 2017 1:19 PM Subject: RE: deletions from index You can check the Hadoop job's counters to see how many are being deleted. If some are, then -deleteGone is on in your case. Only with that setting documents are g

RE: deletions from index

2017-10-02 Thread Markus Jelsma
October 2017 21:51 > To: User <user@nutch.apache.org> > Subject: deletions from index > > With my new news crawl, I would like to keep web pages in the index, even > after they have disappeared from the web, so I can continue using them in > machine-learning

deletions from index

2017-10-02 Thread Michael Coffey
With my new news crawl, I would like to keep web pages in the index, even after they have disappeared from the web, so I can continue using them in machine-learning processes. I thought I could achieve this by avoiding running cleaning jobs. However, I still notice increasing numbers of