Re: parse data directory not found after merge

Dean Pullen Sun, 08 Jan 2012 14:51:57 -0800

Where do we go from here? I can start looking/stepping through themergesegs code, but I'm reluctant due to it's probable complexity.


Dean.



On 08/01/2012 14:26, Dean Pullen wrote:

No Lewis, -linkdb was already been used for the solrindex command, sowe still have the same problem.


Many thanks,

Dean

On 08/01/2012 14:08, Lewis John Mcgibbney wrote:

Hi dean is this sorted

On Saturday, January 7, 2012, Dean Pullen<[email protected]>wrote:

Sorry, you did mean on solrindex - which I already do...

On 07/01/2012 13:15, Dean Pullen wrote:

The -linkdb param isn't in the invertlinks docs

http://wiki.apache.org/nutch/bin/nutch_invertlinks

(However it is in the solrindex docs)

Adding it makes no difference to invertlinks.

I think the problem is definitely with mergesegs, as opposed to

invertlinks etc.

Thanks again,

Dean.

On 06/01/2012 17:53, Lewis John Mcgibbney wrote:

OK so now I think were at the bottom of it. If you wish to create a
linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
parameter. This was implemented as not everyone wishes to create a
linkdb.

Your invertlinks command should be passed as follows

bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
/path/to/segment/dirs
then
bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
path/to/linkdb -dir path/to/segment/dirs

If you are not passing the -linkdb path/to/linkdb explicitly you will
be thrown an exception as the linkdb is treated as a segment directory
now.

On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[email protected]>

  wrote:

Only this:
2012-01-06 17:15:47,972 WARN mapred.JobClient - UseGenericOptionsParser
for parsing the arguments. Applications should implement Tool for the

same.

2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes

where

applicable
2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at

2012-01-06

17:15:51
2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
/opt/nutch_1_4/data/crawl/linkdb

2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize:true

2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
file:/opt/nutch_1_4/data/crawl/segments/20120106171547
2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:

org.apache.hadoop.mapred.InvalidInputException: Input path does notexist:

file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
    at

org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)

at

org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)

at

org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)

at

org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer:starting at
2012-01-06 17:15:52
2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -IndexerMapReduce:
crawldb: /opt/nutch_1_4/data/crawl/crawldb
2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce -IndexerMapReduce:
linkdb: /opt/nutch_1_4/data/crawl/linkdb

Re: parse data directory not found after merge

Reply via email to