Hi Sachin,

> does mergesegs by default updates the
> crawldb once it merges all the segments?

No it does not. That's already evident from the command-line help
(no CrawlDb passed as parameter):

$> bin/nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter]
...

> Or do we have to call the updatedb command on the merged segment to
> update the crawldb so that it has all the information for next
> cycle.

One segment usually holds fetch list and content from one cycle. The command updatedb should be called every cycle (for the latest segment). The script bin/crawl does this. There is no need to call
updatedb again with the merged segment.

Best,
Sebastian

On 10/22/19 11:43 AM, Sachin Mittal wrote:
Ok.
Understood.

I had one question though is that does mergesegs by default updates the
crawldb once it merges all the segments?
Or do we have to call the updatedb command on the merged segment to update
the crawldb so that it has all the information for next cycle.

Thanks
Sachin


On Tue, Oct 22, 2019 at 1:32 PM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

Hi Sachin,

  > I want to know once a new segment is generated is there any use of
  > previous segments and can they be deleted?

As soon as a segment is indexed and the CrawlDb is updated from this
segment, you may delete it. But keeping older segments allows
- reindexing in case something went wrong with the index
- debugging: check the HTML of a page

When segments are merged only the most recent record of one URL is kept -
saves storage space but
requires to run the mergesegs tool.

  > Also when we then start the fresh crawl cycle how do we instruct
  > nutch to use this new merged segment, or it automatically picks up
  > the newest segment as starting point?

The CrawlDb contains all necessary information for the next cycle.
It's mandatory to update the CrawlDb (command "updatedb") for each
segment which transfers the fetch status information (fetch time, HTTP
status, signature, etc.) from
the segment to the CrawlDb.

Best,
Sebastian

On 10/22/19 6:59 AM, Sachin Mittal wrote:
Hi,
I have been crawling using nutch.
What I have understood is that for each crawl cycle it creates a segment
and for the next crawl cycle it uses the outlinks from previous segment
to
generate and fetch next set of urls to crawl. Then it creates a new
segment
with those urls.

I want to know once a new segment is generated is there any use of
previous
segments and can they be deleted?

I also see a command line tool  mergesegs
<
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=122916832
.
Does it make sense to use this to merge old segments into new segment
before deleting old segments?

Also when we then start the fresh crawl cycle how do we instruct nutch to
use this new merged segment, or it automatically picks up the newest
segment as starting point?

Thanks
Sachin





Reply via email to