How to speed up of Map/Reduce job?

2011-02-01 Thread Igor Bubkin
Hello everybody I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount example. It takes about 20 sec for processing of 1,5MB text file. We want to use Map/Reduce in real time (interactive: by user's requests). User can't wait for his request 20 sec. This is too long. Is it

RE: How to speed up of Map/Reduce job?

2011-02-01 Thread praveen.peddi
Hi Igor, I am not sure if Hadoop is designed for realtime requests. I have a feeling that you are trying to use Hadoop in a way that it isnot designed for. From my experience, Hadoop cluster will be much slower than local hadoop mode when processing smaller dataset, because there is always

Re: How to speed up of Map/Reduce job?

2011-02-01 Thread Steve Loughran
On 01/02/11 08:19, Igor Bubkin wrote: Hello everybody I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount example. It takes about 20 sec for processing of 1,5MB text file. We want to use Map/Reduce in real time (interactive: by user's requests). User can't wait for his

RE::How to speed up of Map/Reduce job?

2011-02-01 Thread Black, Michael (IS)
Try this rather small C++ program...it will more than likley be a LOT faster than anything you could do in hadoop. Hadoop is not the hammer for every nail. Too many people think that any cluster solution will automagically scale their problem...tain't true. I'd appreciate hearing your

TestDFSIO fails to run

2011-02-01 Thread Da Zheng
Hello, I want to measure the performance of HDFS on a cluster. There are 16 nodes in the cluster and each node has 12GB memory. In order to reduce the impact of caching of the file system, each file written by TestDFSIO is 10GB. The replication factor is 1. Then I got the errors such as

Reduce progress goes backward?

2011-02-01 Thread Shi Yu
Hi, I observe that sometimes the map/reduce progress is going backward. What does this mean? 11/02/01 12:57:51 INFO mapred.JobClient: map 100% reduce 99% 11/02/01 12:59:14 INFO mapred.JobClient: map 100% reduce 98% 11/02/01 12:59:45 INFO mapred.JobClient: map 100% reduce 99% 11/02/01

Re: Reduce progress goes backward?

2011-02-01 Thread James Seigel
It means that the scheduler is killing off some of your reducer jobs or some of them are dieing. Maybe they are taking too long. You should check out your job tracker and look at some of the details and then drill down to see if you are getting any errors in some of your reducers. Cheers

Streaming reporter counter update

2011-02-01 Thread Keith Wiley
I have found the following abstraction a streaming counter update in many places on the web: reporter:counter:group,counter,amount to be sent to cerr, but I haven't found a description of the three parameters. Presumably group is just a string of my own choosing, but I don't know what

Re: Streaming reporter counter update

2011-02-01 Thread Harsh J
A 'group' is a set of counters (like 'Map-Reduce Framework' in default counters - That's a group name). A 'counter' needs a name to signify what it is about. The increment is what the 'amount' signifies. Generally 1, but can be any amount you need to increment the current count by. All counters

Re: Streaming reporter counter update

2011-02-01 Thread Keith Wiley
Ah, so count doesn't set the count, it increments it, meaning I shouldn't increment it each time. That would result in some sort of strange exponential counter. I should just pass 1 every time. I get it. I'm still unclear on the distinction between the group and the individual counter, but

Managing stdout in streaming

2011-02-01 Thread Keith Wiley
So streaming uses stdout to organize the mapper/reducer output, one record per line with each key/val split at the first TAB. (Presumably multiple TABS are permitted and become embedded in the value string, I haven't experimented with this yet). Obviously, one must be very careful not to write

Hadoop Framework questions.

2011-02-01 Thread Raj V
Hi I have been running some benchmarks with some hadoop jobs on different nodes and disk configurations to see what is a good configuration to get the optimum performance. Here are some results that I have. Using the hadoop job log, I added up the timings for each off the map task and

Re: Streaming reporter counter update

2011-02-01 Thread Harsh J
Group is more of a meta-data thing. For instance, one may want all FileSystem related counters under a single namespace (which would be a Group, here). On Wed, Feb 2, 2011 at 2:17 AM, Keith Wiley kwi...@keithwiley.com wrote: Ah, so count doesn't set the count, it increments it, meaning I

Re: Hadoop Framework questions.

2011-02-01 Thread Patrick Angeles
Hi, Raj. Interesting analysis... These numbers appear to be off. For example, 405s for mappers + 751s for reducers = 1156s for all tasks. If you have 2000 map and reduce tasks, this means each task is spending roughly 500ms to do actual work. That is a very low number and seems impossible. - P

Re: Benchmarking performance in Amazon EC2/EMR environment

2011-02-01 Thread Aaron Eng
Hi Steve, -are you asking for XL or bigger VMs to get the full physical host and less network throtting? I've used m1.large, m1.xlarge and cc1.4xlarge instance types and seen this issue on all of them. Speaking specifically about cc1.4xlarge instances, I see disk read speeds for ephemeral

Re: Hadoop Framework questions.

2011-02-01 Thread Raj V
Patrick,All APologies. My brain must have just frozen. There are two problems here. The first problem is that my perl script alreadfy divides the time by 1000. So the time is seconds not milliseconds. The second problem isĀ  more fundamental. You can add all the task timings., That ignores the

Re: Hadoop Framework questions.

2011-02-01 Thread Patrick Angeles
Inline On Wed, Feb 2, 2011 at 5:35 AM, Raj V rajv...@yahoo.com wrote: Patrick,All APologies. My brain must have just frozen. There are two problems here. The first problem is that my perl script alreadfy divides the time by 1000. So the time is seconds not milliseconds. The second problem

Multiple various streaming questions

2011-02-01 Thread Keith Wiley
I would really appreciate any help people can offer on the following matters. When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first four parameters anywhere in command I always get a Streaming