Fwd: Understanding Nutch workflow

Bai Shen Tue, 27 Sep 2011 10:40:30 -0700

Not sure why gmail keeps sending my replies to people instead of back to the
list.  Have to keep a better eye out for it.


---------- Forwarded message ----------
From: Bai Shen <[email protected]>
Date: Tue, Sep 27, 2011 at 1:38 PM
Subject: Re: Understanding Nutch workflow
To: [email protected]


>

> > I didn't mean that the segment would contain every unfetched url that was
> > in the db, if that's what you mean.
> >
> > I don't think I've hit more than 5000 urls in my current segments.  At
> > least that's the highest I've seen the queue.  Is there a way to
> determine
> > how many urls are in a segment?
>
> Sure, segment X contains the same number of URL's as there are reduce
> output
> records in the partioner job for X. You can see that statistic in the
> output
> of every mapred job.
>
>
How do I see the outpud of the mapred job?  I don't recall seeing anything
like that in the log file.


>  >
> > What kind of connection do you use to fetch 500k urls?  What are your
> > fetcher threads set to?
>
> We usually don't exceed 30mbit/second in short bursts per node with 128
> threads. This only happens for many small fetch queues, e.g. a few URL's
> (e.g.
> 2) for 250.000 domains. Then it's fast.
>
> I see.  I've been using 100 threads with 10 per host and it seems to
saturate the current connection pretty well, and that's just from one
machine.  Which is why I was wondering about your splitting of segments.
What machine limitations have you run into?


> > So the downloaded data gets stored in the segment directories, not the
> > mapreduce temp files?  Why does mapreduce get so large then?
>
> It is stored in the tmp during the job and writte to to the segment in the
> reducer.
>
> The mapred jobs require a factor of four for overhead?  The fetch
downloaded 12GB of data, but the mapred dir was around 50GB(I think).  Just
trying to understand what it's doing to use all that space.


> >
> > And any parse filter plugins are only used to search for urls, right?  So
> > if I'm worried about additional indexing, this is not the place to be
> > looking, correct?
>
> Nono, a parse filter can, for instance, extract information from the parsed
> DOM such as headings, meta elements or whatever and output it as a field.
>
>
I see.  I don't think I'll need that, but we'll see once I get the rest
working.


>  > What do you mean?  What is the current schema if not schema.xml?  My
> > understanding is that the schema.xml file in the Nutch conf dir should be
> > the same as the schema.xml file in Solr.
>
> >
> The provided schema file is only an example, Nutch does not use it but Solr
> does. You must copy the schema from Nutch to Solr, that's all. We ship it
> for
> completeness. Later we might ship other Solr files for better integration
> on
> the Solr side such as Velocity template files.
>
> Ah, okay.  So I just need to add the Nutch fields to my current Solr
schema.


>  >
> > If I want to modify and add additional indexing, how would I set that up?
> > I swapped out the schema.xml file, but wasn't able to get the solrindex
> > command to work.  It kicked back the error that I was missing the site
> > field.
>
> If you want to add new fields  you must create or modify indexing plugins
> such
> as index-basic, index-more, index-anchor.
>

Okay.  I'll take a look at the plugin documentation and see if I can figure
that out.



BTW, do you know what the timeline is to have the documentation updated for
1.3?

Fwd: Understanding Nutch workflow

Reply via email to