Re: Pipelining Mappers and Reducers

2010-07-27 Thread Amogh Vasekar
Hi, >>What would really be great for me is if I could have the Reducer start >>processing the map outputs as they are ready, and not after all Mappers finish Check the property mapred.reduce.slowstart.completed.maps >>I've read about chaining mappers, but to the best of my understanding the >>se

Re: Need Suggestion: Tuning MR performance by changing parameters in Hadoop project and JVM

2010-06-02 Thread Amogh Vasekar
Hi, You might want to check https://issues.apache.org/jira/browse/HADOOP-4179 And http://hadoop.apache.org/common/docs/current/vaidya.html Amogh On 6/2/10 1:24 PM, "WANG Shicai" wrote: Hi, This message is a little long. I beg your patient. Our team would like to tune MR performance by chang

Re: Aborting a MapReduce job

2010-05-04 Thread Amogh Vasekar
Hi, A hack that immediately comes to my mind is having the mapper touch a predetermined filepath and use that to clean up. Or alternatively, check the RunningJob interface available via JobClient, you can monitor and kill tasks from there too. Amogh On 5/4/10 9:46 AM, "Ersoy Bayramoglu" wrot

Re: combiner with GroupComparator

2010-03-07 Thread Amogh Vasekar
Hi, Not sure if this can be done. Here's a relevant snippet of code: { super(inputCounter, conf, reporter); combinerClass = cls; keyClass = (Class) job.getMapOutputKeyClass(); valueClass = (Class) job.getMapOutputValueClass(); comparator = (RawComparator) job.getOutpu

Re: Barrier between reduce and map of the next round

2010-02-08 Thread Amogh Vasekar
K, the chaining has to be defined before the job is started, right? But because I don't know the value of K beforehand, I want the chain to continue forever until some counter in reduce task is zero. Felix Halim On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar wrote: > >>>However, from ri

Re: avoiding data redistribution in iterative mapreduce

2010-02-08 Thread Amogh Vasekar
n this case? If that is the case, can a custom scheduler be written -- will it be any easy task? Regards, Raghava. On Thu, Feb 4, 2010 at 2:52 AM, Amogh Vasekar wrote: Hi, >>Will there be a re-assignment of Map & Reduce nodes by the Master? In general using available schedulers, I

Re: Strange behaviour from a custom Writable

2010-02-08 Thread Amogh Vasekar
Hi, Yes the same location is populated with different values ( returned by iter.next() ) for optimization reasons. There is a new patch which will allow you to mark() and reset() iterator so that you buffer required values ( equivalently you can do that yourself, its anyways in-mem for the patch

Re: Barrier between reduce and map of the next round

2010-02-03 Thread Amogh Vasekar
>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should not >>need to wait for all reducers ri to finish, right? Yes, but r(i+1) cant be in the same job, since that requires another sort and shuffle phase ( barrier ). So you would end up doing, job(i) : m(i)r(i)m(i+1) . Job

Re: avoiding data redistribution in iterative mapreduce

2010-02-03 Thread Amogh Vasekar
get the same nodes always to run your map >>> reduce job on a >>> shared cluster? while (!done) { JobClient.runJob(jobConf); <>} If I write something like that in the code, would not the Map node run on the same data chunk it has each time? Will there be a re-assignment o

Re: avoiding data redistribution in iterative mapreduce

2010-02-03 Thread Amogh Vasekar
Hi, If each of your sequential iteration is map+reduce, then no. The lifetime of a split is confined to a single map reduce job. The split is actually a reference to data, which is used to schedule job as close as possible to data. The record reader then uses same object to pass the in split. W

Re: chained mappers & reducers

2010-01-21 Thread Amogh Vasekar
.clements=disney....@hadoop.apache.org] On Behalf Of Amogh Vasekar Sent: Tuesday, January 19, 2010 10:53 PM To: mapreduce-user@hadoop.apache.org Subject: Re: chained mappers & reducers Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer ta

Re: chained mappers & reducers

2010-01-19 Thread Amogh Vasekar
Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ), you

Re: Question about setting the number of mappers.

2010-01-18 Thread Amogh Vasekar
Hi, >>so I wanted to try and lower the number to 10 and see how the performance is The number of mappers is provided as only a hint to the framework, it is not guaranteed to be that number. >>I have been digging around in the hadoop source code and it looks like the >>JobClient actually sets the

Re: How do I sum by Key in the Reduce Phase AND keep the initial value

2010-01-12 Thread Amogh Vasekar
e second job and appending the sum value to each record. Kind regards Steve Watt From: Amogh Vasekar To: "mapreduce-user@hadoop.apache.org" Date: 01/12/2010 02:01 PM Subject: Re: How do I sum by Key in the Reduce Phase AND keep the initial value

Re: How do I sum by Key in the Reduce Phase AND keep the initial value

2010-01-12 Thread Amogh Vasekar
Hi, I ran into a very similar situation quite some time back and had then encountered this : http://issues.apache.org/jira/browse/HADOOP-475 After speaking to a few Hadoop folks, they had said complete cloning was not a straightforward option for some optimization reasons. There were a few things

Re: Does Using MultipleTextOutputFormat Require the Deprecated API?

2009-12-14 Thread Amogh Vasekar
rther assume, I need only apply the latest patch, which is 5. Am I correct. On Wed, Dec 9, 2009 at 7:30 AM, Amogh Vasekar wrote: http://issues.apache.org/jira/browse/MAPREDUCE-370 You'll have to work around for now / try to apply patch. Amogh On 12/9/09 8:54 PM, "Geoffry Rob

Re: Does Using MultipleTextOutputFormat Require the Deprecated API?

2009-12-09 Thread Amogh Vasekar
http://issues.apache.org/jira/browse/MAPREDUCE-370 You'll have to work around for now / try to apply patch. Amogh On 12/9/09 8:54 PM, "Geoffry Roberts" wrote: Aaron, I am using 0.20.1 and I'm not finding org.apache.hadoop.mapreduce. lib.output.MultipleOutputs. I'm using the download page w

Re: How to write a custom input format and record reader to read multiple lines of text from files

2009-12-01 Thread Amogh Vasekar
Hi, The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less the same, and should help you guide writing custom input format :) Amogh On 12/1/09 11:47 AM, "Kunal Gupta" wrote: Can someone explain how to override the "FileInputFormat" and "RecordReader" in order to be able to rea

Re: Hadoop streaming job issue

2009-11-09 Thread Amogh Vasekar
Hi, I'm pretty sure you need to specify unicode equivalent, or atleast that is what I used in my java map-red program. Amogh On 11/10/09 9:24 AM, "wd" wrote: hi, I'm try to write a hadoop streaming job by perl. But i'm complately confused by the key/value separator. I found lots of separat

Re: How to use MultipleTextOutputFormat ?

2009-10-27 Thread Amogh Vasekar
Hi, The file name generated depends on the output pair and hence, if you are modifying the key from mapper to reducer output, collisions are possible. You may split and append 'name' (* from part-*) to get unique reducer files, which can be merged later. Or see if multipleoutputs fits your bill

Re: Map output compression leads to JVM crash (0.20.0)

2009-10-25 Thread Amogh Vasekar
Hi, Can you let us know if the count of attempt_ s is 32k - 1? I remember reading about similar error sometime back. Amogh On 10/26/09 9:06 AM, "Ed Mazur" wrote: I'm having problems on 0.20.0 when map output compression is enabled. Map tasks complete (TaskRunner: Task 'attempt_*' done), but

JVM reuse

2009-09-15 Thread Amogh Vasekar
Hi All, Regarding the JVM reuse feature incorporated, it says reuse is generally recommended for streaming and pipes jobs. I'm a little unclear on this and any pointers will be appreciated. Also, in what scenarios will this feature be helpful for java mapred jobs? Thanks, Amogh

RE: Location reduce task running.

2009-08-24 Thread Amogh Vasekar
er-specific processing) ---> Store mails to designated boxes. Do you have any suggestion? I am thinking about JVM re-use feature of Hadoop or I can set up a chain of two map-reduce pairs. Best regards. Fang. On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar mailto:am...@yahoo-inc.com>> wrote

RE: Location reduce task running.

2009-08-23 Thread Amogh Vasekar
No, but if you want a "reducer like" functionality on the same node, have a look at combiners. To get exact functionality you might need to tweak around a little wrt buffers, flush etc. Cheers! Amogh From: fan wei fang [mailto:eagleeye8...@gmail.com] Sent: Monda

RE: merging multiple mapper's outputs

2009-08-17 Thread Amogh Vasekar
Same amount of data will have to be read and transferred over network, same file or multiple files. If you do merge to a single file, the S&S phase actually cant start till all mappers have finished, as opposed to fetching outputs from individual mapper tasks which can be as soon as it has finis