Best and economical way of setting hadoop cluster for distributed crawling

2019-10-22 Thread Sachin Mittal
Hi, I have been running nutch in local mode and so far I am able to have a good understanding on how it all works. I wanted to start with distributed crawling using some public cloud provider. I just wanted to know if fellow users have any experience in setting up nutch for distributed crawling.

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel
Hi Sachin, > does mergesegs by default updates the > crawldb once it merges all the segments? No it does not. That's already evident from the command-line help (no CrawlDb passed as parameter): $> bin/nutch mergesegs SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] ... > Or d

Re: what happens to older segments

2019-10-22 Thread Sachin Mittal
Ok. Understood. I had one question though is that does mergesegs by default updates the crawldb once it merges all the segments? Or do we have to call the updatedb command on the merged segment to update the crawldb so that it has all the information for next cycle. Thanks Sachin On Tue, Oct 22

Re: what happens to older segments

2019-10-22 Thread Sebastian Nagel
Hi Sachin, > I want to know once a new segment is generated is there any use of > previous segments and can they be deleted? As soon as a segment is indexed and the CrawlDb is updated from this segment, you may delete it. But keeping older segments allows - reindexing in case something went wron

Re: Unable to index on Hadoop 3.2.0 with 1.16

2019-10-22 Thread Sebastian Nagel
Hi Markus, any updates on this? Just to make sure the issue gets resolved. Thanks, Sebastian On 14.10.19 17:08, Markus Jelsma wrote: Hello, We're upgrading our stuff to 1.16 and got a peculiar problem when we started indexing: 2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.Yar