RE: Sequence File Question

2007-03-29 Thread Steve Severance
 -Original Message-
 From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 4:34 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: Sequence File Question
 
 Steve Severance wrote:
  Let me actually refine that question we do some directories like the
 linkdb
  have a current, and why do others like parse_data not? Is there a
 convention
  on this?
 
 First, to answer your original question: you should use
 MapFileOutputFormat class for reading such output. It handles these
 part- subdirectories automatically.
 
 Second, the current subdirectory is there in order to properly handle
 DB updates - or actually replacements - see e.g. CrawlDb.install()
 method for details. This is not needed in case of segments, which are
 created once and never updated.

How does the reader know which one it is expecting. For instance I can make a 
reader to read a linkDB just by instantiating it on the directory crawl/linkdb 
And it knows to go inside the current directory. What when opening a parse_data 
there is no current. So how does it know which expect?

Steve

 
 Thirdly, although you didn't ask about it ;) the latest version of
 Hadoop contains a handy facility called Counters - if you use the PR
 PowerMethod you need to collect PR from dangling nodes in order to
 redistribute it later. You can use Counters for this, and save on a
 separate aggregation step.
 
 
 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Re: Sequence File Question

2007-03-29 Thread Andrzej Bialecki

Steve Severance wrote:

DB updates - or actually replacements - see e.g. CrawlDb.install()
 method for details. This is not needed in case of segments, which 
are created once and never updated.


How does the reader know which one it is expecting. For instance I 
can make a reader to read a linkDB just by instantiating it on the 
directory crawl/linkdb And it knows to go inside the current 
directory. What when opening a parse_data there is no current. So how

 does it know which expect?


Use The Source Luke ;) It follows this (arbitrary) naming convention
that we always use a current subdirectory when working with LinkDb and
CrawlDb. And it follows a different naming convention when we use
SegmentReader.

One comment: CrawlDbReader, LinkDbReader and SegmentReader are Nutch
classes. However, the real data is stored using Hadoop classes,
specifically MapOutputFileFormat. CrawlDbReader knows about Nutch naming
convention and always appends current to the db name. But if you were
to use MapFileOutputFormat.getReaders() directly this Hadoop class of
course doesn't know about this, so you need to provide a full path that
includes current.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Sequence File Question

2007-03-28 Thread Steve Severance
Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?

Steve

 -Original Message-
 From: Steve Severance [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, March 28, 2007 4:11 PM
 To: nutch-dev@lucene.apache.org
 Subject: Sequence File Question
 
 Hey guys,
 I have a mapreduce job that sets up a directory for pagerank. It
 iterates
 over all the segments and then outputs a MapFile containing the data.
 When I
 go to open the outputted directory with another MapReduce job it fails
 saying that it cannot find the path. The path that it thinks it is
 trying to
 open does not include the part-0 directory. Both my directory (and
 all
 other directories for that matter) have the same structure which is
 /path/part-0/whatever. I feel like this is a really stupid error
 and I
 have forgotten something that is easily fixed. Any ideas?
 
 Steve



Re: Sequence File Question

2007-03-28 Thread Andrzej Bialecki

Steve Severance wrote:

Let me actually refine that question we do some directories like the linkdb
have a current, and why do others like parse_data not? Is there a convention
on this?


First, to answer your original question: you should use 
MapFileOutputFormat class for reading such output. It handles these 
part- subdirectories automatically.


Second, the current subdirectory is there in order to properly handle 
DB updates - or actually replacements - see e.g. CrawlDb.install() 
method for details. This is not needed in case of segments, which are 
created once and never updated.


Thirdly, although you didn't ask about it ;) the latest version of 
Hadoop contains a handy facility called Counters - if you use the PR 
PowerMethod you need to collect PR from dangling nodes in order to 
redistribute it later. You can use Counters for this, and save on a 
separate aggregation step.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com