this is helpful -- can someone also explain whether there is mechanism to extract full text of pages from where they are stored in mapreduce?
On Tue, Sep 27, 2011 at 11:24, Bai Shen <[email protected]> wrote: > I'm trying to understand exactly what the Nutch workflow is and I have a > few > questions. From the tutorial: > > bin/nutch inject crawldb urls > > This takes a list of urls and creates a database of urls for nutch to > fetch. > > bin/nutch generate crawldb segments > > This generates a segment which contains all of the urls that need to be > fetched. From what I understand, you can generate multiple segments, but > I'm not sure I see the value in that as the fetching is primarily limited > by > your connection, not any one machine. > > s1=`ls -d crawl/segments/2* | tail -1` > echo $s1 > bin/nutch fetch $s1 > > This takes the segment and fetchs all of the content and stores it in > hadoop > for the mapreduce job. I'm not quite sure how that works, as when I ran > the > fetch, my connection showed 12GB of data downloaded, yet the hadoop > directory was using over 40GB of space. Is this normal? > > bin/nutch parse $1 > > This parses the fetched data using hadoop in order to extract more urls to > fetch. It doesn't do any actual indexing, however. Is this correct? > > bin/nutch updatedb crawldb $s1 > > Now the parsed urls are added back to the initial database of urls. > > bin/nutch invertlinks linkdb -dir segments > > I'm not exactly sure what this does. The tutorial says "Before indexing we > first invert all of the links, so that we may index incoming anchor text > with the pages." Does that mean that if there's a link such as < A > HREF="url" > click here for more info < /A > that it adds the "click here > for more info" to the database for indexing in addition to the actual link > content? > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb > crawl/linkdb crawl/segments/* > > This is where the actual indexing takes place, correct? Or is Nutch just > posting the various documents to Solr and leaving Solr to do the actual > indexing? Is this the only step that uses the schema.xml file? > > > Thanks. >

