> Not sure why gmail keeps sending my replies to people instead of back to > the list. Have to keep a better eye out for it. > > ---------- Forwarded message ---------- > From: Bai Shen <[email protected]> > Date: Tue, Sep 27, 2011 at 1:38 PM > Subject: Re: Understanding Nutch workflow > To: [email protected] > > > > I didn't mean that the segment would contain every unfetched url that > > > was in the db, if that's what you mean. > > > > > > I don't think I've hit more than 5000 urls in my current segments. At > > > least that's the highest I've seen the queue. Is there a way to > > > > determine > > > > > how many urls are in a segment? > > > > Sure, segment X contains the same number of URL's as there are reduce > > output > > records in the partioner job for X. You can see that statistic in the > > output > > of every mapred job. > > How do I see the outpud of the mapred job? I don't recall seeing anything > like that in the log file.
This output on stdout, which can be viewed realtime using the web gui: 11/09/27 16:54:35 INFO mapred.JobClient: Job complete: job_201109261414_0039 11/09/27 16:54:37 INFO mapred.JobClient: Counters: 27 11/09/27 16:54:37 INFO mapred.JobClient: Job Counters 11/09/27 16:54:37 INFO mapred.JobClient: Launched reduce tasks=9 11/09/27 16:54:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=4561078 11/09/27 16:54:37 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/09/27 16:54:37 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/09/27 16:54:37 INFO mapred.JobClient: Rack-local map tasks=2 11/09/27 16:54:37 INFO mapred.JobClient: Launched map tasks=417 11/09/27 16:54:37 INFO mapred.JobClient: Data-local map tasks=415 11/09/27 16:54:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=6166304 11/09/27 16:54:37 INFO mapred.JobClient: File Input Format Counters 11/09/27 16:54:37 INFO mapred.JobClient: Bytes Read=10396521777 11/09/27 16:54:37 INFO mapred.JobClient: File Output Format Counters 11/09/27 16:54:37 INFO mapred.JobClient: Bytes Written=917655979 11/09/27 16:54:37 INFO mapred.JobClient: FileSystemCounters 11/09/27 16:54:37 INFO mapred.JobClient: FILE_BYTES_READ=3278262704 11/09/27 16:54:37 INFO mapred.JobClient: HDFS_BYTES_READ=10396613577 11/09/27 16:54:37 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6539342397 11/09/27 16:54:37 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=917655979 11/09/27 16:54:37 INFO mapred.JobClient: Map-Reduce Framework 11/09/27 16:54:37 INFO mapred.JobClient: Map output materialized bytes=3250364133 11/09/27 16:54:37 INFO mapred.JobClient: Map input records=7494536 11/09/27 16:54:37 INFO mapred.JobClient: Reduce shuffle bytes=3250360919 11/09/27 16:54:37 INFO mapred.JobClient: Spilled Records=18455792 11/09/27 16:54:37 INFO mapred.JobClient: Map output bytes=4421256434 11/09/27 16:54:37 INFO mapred.JobClient: Map input bytes=10396451841 11/09/27 16:54:37 INFO mapred.JobClient: Combine input records=42643906 11/09/27 16:54:37 INFO mapred.JobClient: SPLIT_RAW_BYTES=64218 11/09/27 16:54:37 INFO mapred.JobClient: Reduce input records=6966070 11/09/27 16:54:37 INFO mapred.JobClient: Reduce input groups=3065036 11/09/27 16:54:37 INFO mapred.JobClient: Combine output records=13178184 11/09/27 16:54:37 INFO mapred.JobClient: Reduce output records=3065036 11/09/27 16:54:37 INFO mapred.JobClient: Map output records=36431792 > > > > What kind of connection do you use to fetch 500k urls? What are your > > > fetcher threads set to? > > > > We usually don't exceed 30mbit/second in short bursts per node with 128 > > threads. This only happens for many small fetch queues, e.g. a few URL's > > (e.g. > > 2) for 250.000 domains. Then it's fast. > > > > I see. I've been using 100 threads with 10 per host and it seems to > > saturate the current connection pretty well, and that's just from one > machine. Which is why I was wondering about your splitting of segments. > What machine limitations have you run into? 10 threads per host? That's a lot, doesn't seem polite to me. Segment size doesn't affect bandwidth. > > > > So the downloaded data gets stored in the segment directories, not the > > > mapreduce temp files? Why does mapreduce get so large then? > > > > It is stored in the tmp during the job and writte to to the segment in > > the reducer. > > > > The mapred jobs require a factor of four for overhead? The fetch > > downloaded 12GB of data, but the mapred dir was around 50GB(I think). Just > trying to understand what it's doing to use all that space. > > > > And any parse filter plugins are only used to search for urls, right? > > > So if I'm worried about additional indexing, this is not the place to > > > be looking, correct? > > > > Nono, a parse filter can, for instance, extract information from the > > parsed DOM such as headings, meta elements or whatever and output it as > > a field. > > I see. I don't think I'll need that, but we'll see once I get the rest > working. > > > > What do you mean? What is the current schema if not schema.xml? My > > > > > > understanding is that the schema.xml file in the Nutch conf dir should > > > be the same as the schema.xml file in Solr. > > > > The provided schema file is only an example, Nutch does not use it but > > Solr does. You must copy the schema from Nutch to Solr, that's all. We > > ship it for > > completeness. Later we might ship other Solr files for better integration > > on > > the Solr side such as Velocity template files. > > > > Ah, okay. So I just need to add the Nutch fields to my current Solr > > schema. Yes. > > > > If I want to modify and add additional indexing, how would I set that > > > up? I swapped out the schema.xml file, but wasn't able to get the > > > solrindex command to work. It kicked back the error that I was > > > missing the site field. > > > > If you want to add new fields you must create or modify indexing plugins > > such > > as index-basic, index-more, index-anchor. > > Okay. I'll take a look at the plugin documentation and see if I can figure > that out. > > > > BTW, do you know what the timeline is to have the documentation updated for > 1.3? It is as we speak. Lewis did quite a good job for the wiki docs on 1.3.

