Re: parse data directory not found after merge

Dean Pullen Mon, 09 Jan 2012 06:29:26 -0800

No, thank you for taking the time to look at it! I'm still on the casebut am hoping you'll find the problem.


Dean.


On 09/01/2012 14:24, Lewis John Mcgibbney wrote:

Hi Dean,

I'll have a look into this later today if I get a chance. Anyone else
experiencing problems using the mergesegs command or code?

Thanks for persisting with this Dean hopefully we will get to the
bottom of it soon.

On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen<[email protected]>  wrote:

Looking through the code, I'm seeing
org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
crawl_fetch and crawl_generate.

Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
parse_data parse_text

I'm not quiet sure what's going on in-between these two calls...

Dean.



On 08/01/2012 22:51, Dean Pullen wrote:

Where do we go from here? I can start looking/stepping through the
mergesegs code, but I'm reluctant due to it's probable complexity.

Dean.


On 08/01/2012 14:26, Dean Pullen wrote:

No Lewis, -linkdb was already been used for the solrindex command, so we
still have the same problem.

Many thanks,

Dean

On 08/01/2012 14:08, Lewis John Mcgibbney wrote:

Hi dean is this sorted

On Saturday, January 7, 2012, Dean Pullen<[email protected]>
  wrote:

Sorry, you did mean on solrindex - which I already do...

On 07/01/2012 13:15, Dean Pullen wrote:

The -linkdb param isn't in the invertlinks docs

http://wiki.apache.org/nutch/bin/nutch_invertlinks

(However it is in the solrindex docs)

Adding it makes no difference to invertlinks.

I think the problem is definitely with mergesegs, as opposed to

invertlinks etc.

Thanks again,

Dean.

On 06/01/2012 17:53, Lewis John Mcgibbney wrote:

OK so now I think were at the bottom of it. If you wish to create a
linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
parameter. This was implemented as not everyone wishes to create a
linkdb.

Your invertlinks command should be passed as follows

bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
/path/to/segment/dirs
then
bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
path/to/linkdb -dir path/to/segment/dirs

If you are not passing the -linkdb path/to/linkdb explicitly you will
be thrown an exception as the linkdb is treated as a segment directory
now.

On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[email protected]>

  wrote:

Only this:

2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use
GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the

same.

2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes

where

applicable
2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at

2012-01-06

17:15:51
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
/opt/nutch_1_4/data/crawl/linkdb
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize:
true
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
    at

org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)

at

org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)

at

org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)

at

org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

    at

org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting
at
2012-01-06 17:15:52
2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
IndexerMapReduce:
crawldb: /opt/nutch_1_4/data/crawl/crawldb
2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
IndexerMapReduce:
linkdb: /opt/nutch_1_4/data/crawl/linkdb

Re: parse data directory not found after merge

Reply via email to