Not sure why gmail keeps sending my replies to people instead of back to the list. Have to keep a better eye out for it.
---------- Forwarded message ---------- From: Bai Shen <[email protected]> Date: Tue, Sep 27, 2011 at 1:38 PM Subject: Re: Understanding Nutch workflow To: [email protected] > > > I didn't mean that the segment would contain every unfetched url that was > > in the db, if that's what you mean. > > > > I don't think I've hit more than 5000 urls in my current segments. At > > least that's the highest I've seen the queue. Is there a way to > determine > > how many urls are in a segment? > > Sure, segment X contains the same number of URL's as there are reduce > output > records in the partioner job for X. You can see that statistic in the > output > of every mapred job. > > How do I see the outpud of the mapred job? I don't recall seeing anything like that in the log file. > > > > What kind of connection do you use to fetch 500k urls? What are your > > fetcher threads set to? > > We usually don't exceed 30mbit/second in short bursts per node with 128 > threads. This only happens for many small fetch queues, e.g. a few URL's > (e.g. > 2) for 250.000 domains. Then it's fast. > > I see. I've been using 100 threads with 10 per host and it seems to saturate the current connection pretty well, and that's just from one machine. Which is why I was wondering about your splitting of segments. What machine limitations have you run into? > > So the downloaded data gets stored in the segment directories, not the > > mapreduce temp files? Why does mapreduce get so large then? > > It is stored in the tmp during the job and writte to to the segment in the > reducer. > > The mapred jobs require a factor of four for overhead? The fetch downloaded 12GB of data, but the mapred dir was around 50GB(I think). Just trying to understand what it's doing to use all that space. > > > > And any parse filter plugins are only used to search for urls, right? So > > if I'm worried about additional indexing, this is not the place to be > > looking, correct? > > Nono, a parse filter can, for instance, extract information from the parsed > DOM such as headings, meta elements or whatever and output it as a field. > > I see. I don't think I'll need that, but we'll see once I get the rest working. > > What do you mean? What is the current schema if not schema.xml? My > > understanding is that the schema.xml file in the Nutch conf dir should be > > the same as the schema.xml file in Solr. > > > > The provided schema file is only an example, Nutch does not use it but Solr > does. You must copy the schema from Nutch to Solr, that's all. We ship it > for > completeness. Later we might ship other Solr files for better integration > on > the Solr side such as Velocity template files. > > Ah, okay. So I just need to add the Nutch fields to my current Solr schema. > > > > If I want to modify and add additional indexing, how would I set that up? > > I swapped out the schema.xml file, but wasn't able to get the solrindex > > command to work. It kicked back the error that I was missing the site > > field. > > If you want to add new fields you must create or modify indexing plugins > such > as index-basic, index-more, index-anchor. > Okay. I'll take a look at the plugin documentation and see if I can figure that out. BTW, do you know what the timeline is to have the documentation updated for 1.3?

