Re: parse data directory not found after merge

Dean Pullen Fri, 06 Jan 2012 08:39:15 -0800

Two iterations do the same thing - the parse_data directory is missing.

Interestingly, just doing the mergesegs on ONE crawl also removes theparse_data dir etc!


Dean.


On 06/01/2012 16:28, Lewis John Mcgibbney wrote:

How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific
parse_data directory. If it doesn't work after two iterations then we
know that it is happening early on in the crawl cycle. Have you
manually checked that the directories exist after fetching and
parsing?

On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<[email protected]>  wrote:

Good spot because all of that was meant to be removed! No, I'm afraid that's
just a copy/paste problem.

Dean

On 06/01/2012 16:17, Lewis John Mcgibbney wrote:

Ok then,

How about your generate command:

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26

Your<segments_dir>    seems to point to /opt/semantico/slot/etc/etc/etc,
when everything else being utilised within the crawl cycle points to
an entirely different<segment_dirs>    path which is
/opt/nutch_1_4/data/crawl/segments/segment_date

Was this intentional?

On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[email protected]>
  wrote:

Lewis,

Changing the merge to * returns a similar response:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files

And yes, your assumption was correct - it's a different segment directory
each loop.

Many thanks,

Dean.

On 06/01/2012 15:43, Lewis John Mcgibbney wrote:

Hi Dean,

Without discussing any of your configuration properties can you please
try

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize

paying attention to the wildcard /* in -dir
/opt/nutch_1_4/data/crawl/segments/*

Also presumably, when you mention you repeat steps 2-5 another 4
times, you are not recursively generating, fetching, parsing and
updating the WebDB with
/opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
with every iteration of the g/f/p/updatedb cycle.

Thanks

On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]>
  wrote:

No problem Lewis, I appreciate you looking into it.


Firstly I have a seed URL XML document here:
http://www.ukcigarforums.com/injectlist.xml
This basically has 'http://www.ukcigarforums.com/content.php' as a URL
within it.

Nutch's regex-urlfilter.txt contains this:

# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.


Here's the procedure:


1) INJECT:
/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/seed/

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
26

3) FETCH:
/opt/nutch_1_4/bin/nutch fetch
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

4) PARSE:
/opt/nutch_1_4/bin/nutch parse
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

5) UPDATE DB:
/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter


Repeat steps 2 to 5 another 4 times, then:

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/
-dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize


Interestingly, this prints out:
"SegmentMerger: using segment data from: crawl_generate crawl_fetch
crawl_parse parse_data parse_text"

MERGEDsegments segment directory then has just two directories, instead
of
all of those listed in the last output, i.e. just: crawl_generate and
crawl_fetch

(when then delete from the segments directory and copy the
MERGEDsegments
results into it)


Lastly we run invert links after merge segments:

7) INVERT LINKS:
/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
-dir
/opt/nutch_1_4/data/crawl/segments/

Which produces:

"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
does
not
exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"

Re: parse data directory not found after merge

Reply via email to