Hi,
I have been running nutch in local mode and so far I am able to have a good
understanding on how it all works.
I wanted to start with distributed crawling using some public cloud
provider.
I just wanted to know if fellow users have any experience in setting up
nutch for distributed crawling.
Hi Sachin,
> does mergesegs by default updates the
> crawldb once it merges all the segments?
No it does not. That's already evident from the command-line help
(no CrawlDb passed as parameter):
$> bin/nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter]
...
> Or d
Ok.
Understood.
I had one question though is that does mergesegs by default updates the
crawldb once it merges all the segments?
Or do we have to call the updatedb command on the merged segment to update
the crawldb so that it has all the information for next cycle.
Thanks
Sachin
On Tue, Oct 22
Hi Sachin,
> I want to know once a new segment is generated is there any use of
> previous segments and can they be deleted?
As soon as a segment is indexed and the CrawlDb is updated from this
segment, you may delete it. But keeping older segments allows
- reindexing in case something went wron
Hi Markus,
any updates on this? Just to make sure the issue gets resolved.
Thanks,
Sebastian
On 14.10.19 17:08, Markus Jelsma wrote:
Hello,
We're upgrading our stuff to 1.16 and got a peculiar problem when we started
indexing:
2019-10-14 13:50:30,586 WARN [main] org.apache.hadoop.mapred.Yar
5 matches
Mail list logo