You can the the segment reader to read downloaded content.
> this is helpful -- can someone also explain whether there is mechanism to > extract full text of pages from where they are stored in mapreduce? > > On Tue, Sep 27, 2011 at 11:24, Bai Shen <[email protected]> wrote: > > I'm trying to understand exactly what the Nutch workflow is and I have a > > few > > questions. From the tutorial: > > > > bin/nutch inject crawldb urls > > > > This takes a list of urls and creates a database of urls for nutch to > > fetch. > > > > bin/nutch generate crawldb segments > > > > This generates a segment which contains all of the urls that need to be > > fetched. From what I understand, you can generate multiple segments, but > > I'm not sure I see the value in that as the fetching is primarily limited > > by > > your connection, not any one machine. > > > > s1=`ls -d crawl/segments/2* | tail -1` > > echo $s1 > > bin/nutch fetch $s1 > > > > This takes the segment and fetchs all of the content and stores it in > > hadoop > > for the mapreduce job. I'm not quite sure how that works, as when I ran > > the > > fetch, my connection showed 12GB of data downloaded, yet the hadoop > > directory was using over 40GB of space. Is this normal? > > > > bin/nutch parse $1 > > > > This parses the fetched data using hadoop in order to extract more urls > > to fetch. It doesn't do any actual indexing, however. Is this correct? > > > > bin/nutch updatedb crawldb $s1 > > > > Now the parsed urls are added back to the initial database of urls. > > > > bin/nutch invertlinks linkdb -dir segments > > > > I'm not exactly sure what this does. The tutorial says "Before indexing > > we first invert all of the links, so that we may index incoming anchor > > text with the pages." Does that mean that if there's a link such as < A > > HREF="url" > click here for more info < /A > that it adds the "click > > here for more info" to the database for indexing in addition to the > > actual link content? > > > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb > > crawl/linkdb crawl/segments/* > > > > This is where the actual indexing takes place, correct? Or is Nutch just > > posting the various documents to Solr and leaving Solr to do the actual > > indexing? Is this the only step that uses the schema.xml file? > > > > > > Thanks.

