Does anyone have a EC2 image that runs smoothly for >3000 domains? If a sample of the complete nutch&hadoop configurations for distributed crawling on EC2 is available for the community, that will help anyone to learn nutch best practice quickly. -aj
On Mon, Aug 2, 2010 at 1:59 PM, Scott Gonyea <[email protected]> wrote: > By the way, can anyone tell me if there is a way to explicitly limit how > many pages should be fetched, per fetcher-task? > > I know that one limit to this, is that each site/domain/whatever could > exceed that limit (assuming the limit were lower than the number of those > sites). For politeness, that limit would have to be soft. But that's more > than suitable, in my opinion. > > I think part of the problem is that, seemingly, Nutch seems to be > generating > some really unbalanced fetcher tasks. > > The task (task_201008021617_0026_m_000000) had 6859 pages to fetch. Each > higher-numbered task had fewer pages to fetch. Task 000180 only had 44 > pages to fetch. > > This *huge* imbalance, I think, creates tasks that are seemingly > unpredictable. All of my other resources just sit around, wasting > resources, until one task grabs some crazy number of sites. > > Thanks again, > sg > > On Mon, Aug 2, 2010 at 11:57 AM, Scott Gonyea <[email protected]> wrote: > > > Thank you very much, Adrzej. I'm really hoping some people can share > some > > non-sensitive details of their setup. I'm really curious about the > > following: > > > > The ratio of Maps to Reduces for their nutch jobs? > > The amount of memory that they allocate to each job task? > > The number of simultaneous Maps/Reduces on any given host? > > The number of fetcher threads they execute? > > > > Any config setup people can share would be great, so I can have a > different > > perspective on how people setup their nutch-site and mapred-site files. > > > > At the moment, I'm experimenting with the following configs: > > > > http://gist.github.com/505065 > > > > I'm giving each task 2048m of memory. Up to 5 Maps and 2 Reduces run at > > any given time. I have Nutch firing off 181 Maps and 41 Reduces. Those > are > > both prime numbers, but I don't know if that really matters. I've seen > > Hadoop say that the number of reducers should be around the number of > nodes > > you have (the nearest prime). I've seen, somewhere, some suggestions > that > > Nutch maps/reduces be anywhere from 1:0.93-1:1.25. Does anyone have > insight > > to share on that? > > > > Thank you, Andrzej for the SIGQUIT suggestion. I forgot about that. I'm > > waiting for it to return to the 4th fetch step, so I can see why Nutch > hates > > me so much. > > > > sg > > > > On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[email protected]> wrote: > > > >> On 2010-08-02 10:17, Scott Gonyea wrote: > >> > >>> The big problem that I am facing, thus far, occurs on the 4th fetch. > >>> All but 1 or 2 maps complete. All of the running reduces stall (0.00 > >>> MB/s), presumably because they are waiting on that map to finish? I > >>> really don't know and it's frustrating. > >>> > >> > >> Yes, all map tasks need to finish before reduce tasks are able to > proceed. > >> The reason is that each reduce task receives a portion of the keyspace > (and > >> values) according to the Partitioner, and in order to prepare a nice > <key, > >> list(value)> in your reducer it needs to, well, get all the values under > >> this key first, whichever map task produced the tuples, and then sort > them. > >> > >> The failing tasks probably fail due to some other factor, and very > likely > >> (based on my experience) the failure is related to some particular URLs. > >> E.g. regex URL filtering can choke on some pathological URLs, like URLs > 20kB > >> long, or containing '\0' etc, etc. In my experience, it's best to keep > regex > >> filtering to a minimum if you can, and use other urlfilters (prefix, > domain, > >> suffix, custom) to limit your crawling frontier. There are simply too > many > >> ways where a regex engine can lock up. > >> > >> Please check the logs of the failing tasks. If you see that a task is > >> stalled you could also log in to the node, and generate a thread dump a > few > >> times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the > regex > >> processing then it's likely this is your problem. > >> > >> > >> My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text > >>> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh) > >>> Storage: I've performed crawls with HDFS and with amazon S3. I > >>> thought S3 would be more performant, yet it doesn't appear to affect > >>> matters. Cost vs Speed: I don't mind throwing EC2 instances at this > >>> to get it done quickly... But I can't imagine I need much more than > >>> 10-20 mid-size instances for this. > >>> > >> > >> That's correct - with this number of unique sites the max. throughput of > >> your crawl will be ultimately limited by the politeness limits (# of > >> requests/site/sec). > >> > >> > >> > >>> Can anyone share their own experiences in the performance they've > >>> seen? > >>> > >> > >> There is a very simple benchmark in trunk/ that you could use to measure > >> the raw performance (data processing throughput) of your EC2 cluster. > The > >> real-life performance, though, will depend on many other factors, such > as > >> the number of unique sites, their individual speed, and (rarely) the > total > >> bandwidth at your end. > >> > >> > >> -- > >> Best regards, > >> Andrzej Bialecki <>< > >> ___. ___ ___ ___ _ _ __________________________________ > >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >> ___|||__|| \| || | Embedded Unix, System Integration > >> http://www.sigram.com Contact: info at sigram dot com > >> > >> > > >

