Re: Number of Mappers Running Simultaneously

2010-09-16 Thread Amogh Vasekar
Hi Rahul, Can you please be more specific? Do you want to control mappers running simultaneously for your job ( I guess ) or the cluster as a whole? If for your job, and you want to control it on a per node basis, one way is to allocate more memory to each of your mapper so it occupies more than

Re: getJobID and job handling

2010-07-26 Thread Amogh Vasekar
Hi, I see you are using the new APIs, so this should be relevant for you https://issues.apache.org/jira/browse/MAPREDUCE-118 As you have noticed, in the old APIs the JobClient could be queried using JobID , which was returned when the job was submitted. There was a thread in hadoop-dev to

ODBC isql error

2010-06-25 Thread Amogh Vasekar
Hi, I tried testing my odbc build with isql, but I get the following error: [ISQL]ERROR: Could not SQLAllocEnv I tried, dltest /usr/local/lib/libodbchive.so SQLAllocEnv which succeeds, so I guess the entry point should be found. Any suggestions anyone? Amogh

Re: Hive and ODBC driver- single threaded?

2010-06-25 Thread Amogh Vasekar
Hi, Incidentally was looking into a similar thing. The Hive server is not threadsafe, https://issues.apache.org/jira/browse/HIVE-187?focusedCommentId=12738494page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12738494 for more. Amogh On 6/25/10 7:16 PM, Omer,

Missing fb303.a in 64bit libthrift

2010-06-23 Thread Amogh Vasekar
Hi, I'm referring to https://issues.apache.org/jira/browse/HIVE-187 , which has Linux 32 and 64 bit thrift libs. I noticed that the 64 bit lib doesn't contain the fb303 module, unlike the 32 bit compilation. I'm trying to build one for my use, but if you have it handy will be of great help to

Re: Can we modify existing file in HDFS?

2010-06-22 Thread Amogh Vasekar
Do I need to remove and re-create the whole file? Simply put, as of now, yes. Append functionality is being made available to users to add to end of file though :) Amogh On 6/22/10 1:56 PM, elton sky eltonsky9...@gmail.com wrote: hello everyone, I noticed there are 6 operations in HDFS:

Re: Performance tuning of sort

2010-06-17 Thread Amogh Vasekar
Since the scale of input data and operations of each reduce task is the same, what may cause the execution time of reduce tasks different? You should consider looking at the copy, shuffle and reduce times separately from JT UI to get better info. Many (dynamic) considerations like network

Re: hadoop streaming on Amazon EC2

2010-06-02 Thread Amogh Vasekar
Hi, Depending on what hadoop version ( 0.18.3??? ) EC2 uses, you can try one of the following 1. Compile the streaming jar files with your own custom classes and run on ec2 using this custom jar ( should work for 18.3 . Make sure you pick compatible streaming classes ) 2. Jar up your classes

Re: error in communication with hdfs

2010-06-02 Thread Amogh Vasekar
Hi, Quick couple of questions, Is the namenode formatted and the daemon started? Can you ssh w/o password? Amogh On 6/2/10 5:03 PM, Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu wrote: Hi :) I installed hadoop and i tried to store data in hdfs but any command i want to execute like fs

Re: hadoop streaming on Amazon EC2

2010-06-02 Thread Amogh Vasekar
warning. WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). So the results were incorrect. Thanks, Mo On Wed, Jun 2, 2010 at 4:56 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, Depending on what hadoop version ( 0.18.3

Re: hadoop streaming on Amazon EC2

2010-06-02 Thread Amogh Vasekar
) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Thanks, Mo On Wed, Jun 2, 2010 at 8:40 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, You might need to add -Dstream.shipped.hadoopstreaming=path_to_your_custom_streaming_jar Amogh On 6/2/10 5:10

Re: Getting zero length files on the reduce output.

2010-06-02 Thread Amogh Vasekar
Hi, The default partitioner is - hashcode(key) MODULO number_of_reducers, so its pretty much possible. Can I change this hash function in anyway? Sure, any custom partitioner can be plugged in. Check o.a.h.mapreduce.partition or the secondary sort example on mapred tutorial for more. On a side

Doc on MSTR to Hive

2010-05-26 Thread Amogh Vasekar
Hi All, Is there a documentation I can refer to while attempting to connect MSTR v8.* / v9 to Hive, probably some FAQs or cookbook or even a blog ;) ? Any inputs appreciated. Thanks, Amogh

Re: which node processed my job

2010-05-06 Thread Amogh Vasekar
Hi, InetAddress.getLocalHost() should give you the hostname for each mapper/reducer Amogh On 5/6/10 8:39 PM, Alan Miller alan.mil...@synopsys.com wrote: Not sure if this is the right list for this question, but. Is it possible to determine which host actually processed my MR job? Regards,

Re: Per-file block size

2010-04-13 Thread Amogh Vasekar
Hi, Pass the -D property in command line. eg: Hadoop fs -Ddfs.block.size=multiple of checksum . You can check if its actually set the way you needed by hadoop fs -stat %o file HTH, Amogh On 4/14/10 9:01 AM, Andrew Nguyen andrew-lists-had...@ucsfcti.org wrote: I thought I saw a way to

Re: How do I use MapFile Reader and Writer

2010-04-13 Thread Amogh Vasekar
Hi, The file system object will contain the scheme, authority etc for the given uri or path. The conf object acts as reference ( unable to get a better terminology ) to this info. Looking at the MapFileOutputFormat should help provide better understanding as to how writers and readers are

Re: swapping on hadoop

2010-03-31 Thread Amogh Vasekar
Hi, (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize PhysicalMemoryonNode The tasktracker and datanode daemons also take up memory, 1GB each by default I think. Is that accounted for? Could there be an issue with HDFS data or metadata taking up memory? Is the

Re: execute mapreduce job on multiple hdfs files

2010-03-23 Thread Amogh Vasekar
Hi, Piggybacking on Gang’s reply, to add files / dirs recursively you can use the filestatus, liststatus to determine if its a file or dir and add as needed ( check FileStatus API for this ) There is a patch which does this for FileInputFormat

Re: split number

2010-03-21 Thread Amogh Vasekar
Hi, AFAIK, it is a hint. Depending on the block size, minimum split size and this hint the exact number of splits is computed. So if you have total_size/hint block size but greater than min split size, you should see the exact number. This is how I understand it, please let me know if I'm

Re: when to sent distributed cache file

2010-03-18 Thread Amogh Vasekar
Hi Gang, Yes, the time to distribute files is considered as jobs running time ( more specifically the set up time ). The time is essentially for the the TT to copy the files specified in distributed cache to its local FS, generally from HDFS unless you have a separate FS for JT. So in general

Re: Is there an easy way to clear old jobs from the jobtracker webpage?

2010-03-18 Thread Amogh Vasekar
Hi, The property mapred.jobtracker.completeuserjobs.maximum property specifies the number of jobs to be kept on JT page at any time. After this they are available under history page. Probably setting this to 0 will do the trick? Amogh On 3/17/10 10:09 PM, Raymond Jennings III

Re: java.lang.NullPointerException at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)

2010-03-18 Thread Amogh Vasekar
Hi, http://hadoop.apache.org/common/docs/current/native_libraries.html Should answer your questions. Amogh On 3/18/10 10:48 PM, jiang licht licht_ji...@yahoo.com wrote: I got the following error when I tried to do gzip compression on map output, using hadoop-0.20.1. settings in

Re: combiner with GroupComparator

2010-03-07 Thread Amogh Vasekar
Hi, Not sure if this can be done. Here's a relevant snippet of code: { super(inputCounter, conf, reporter); combinerClass = cls; keyClass = (ClassK) job.getMapOutputKeyClass(); valueClass = (ClassV) job.getMapOutputValueClass(); comparator = (RawComparatorK)

Re: cluster involvement trigger

2010-03-01 Thread Amogh Vasekar
map-reduce completes). Does the name node need to store the metadata of each individual file during the unpacking for this case? -Michael On Feb 25, 2010, at 10:31 PM, Amogh Vasekar wrote: Hi, The number of mappers initialized depends largely on your input format ( the getSplits of your

Re: cluster involvement trigger

2010-02-25 Thread Amogh Vasekar
Hi, The number of mappers initialized depends largely on your input format ( the getSplits of your input format) , (almost all) input formats available in hadoop derive from fileinputformat, hence the 1 mapper per file block notion ( this actually is 1 mapper per split ). You say that you have

Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-24 Thread Amogh Vasekar
file writing besides the context.write() for the intermediate records. Thanks, Tim Am 24.02.2010 05:28, schrieb Amogh Vasekar: Hi, Can you let us know what is the value for : Map input records Map spilled records Map output bytes Is there any side effect file written? Thanks, Amogh

Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Amogh Vasekar
Hi, Can you let us know what is the value for : Map input records Map spilled records Map output bytes Is there any side effect file written? Thanks, Amogh On 2/23/10 8:57 PM, Tim Kiefer tim-kie...@gmx.de wrote: No... 900GB is in the map column. Reduce adds another ~70GB of FILE_BYTES_WRITTEN

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
Hi, Can you please let us know what platform you are running on your hadoop machines? For gzip and lzo to work, you need supported hadoop native libraries ( I remember reading on this somewhere in hadoop wiki :) ) Amogh On 2/23/10 8:16 AM, jiang licht licht_ji...@yahoo.com wrote: I have a

Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-22 Thread Amogh Vasekar
--- On Mon, 2/22/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: Monday, February 22, 2010, 11:27 PM Hi, Can

Re: Unexpected empty result problem (zero-sized part-### files)?

2010-02-21 Thread Amogh Vasekar
So, considering this situation of loading mixed good and corrupted .gz files, how to still get expected results? Try manipulating the value mapred.max.map.failures.percent to a % of files you expect to be corrupted / acceptable data skip percent. Amogh On 2/21/10 7:17 AM, jiang licht

Re: basic hadoop job help

2010-02-18 Thread Amogh Vasekar
Hi, The hadoop meet last year has some very interesting business solutions discussed: http://www.cloudera.com/company/press-center/hadoop-world-nyc/ Most of the companies in there have shared their methodology on their blogs / on slideshare. One I have handy is:

Re: Pass the TaskId from map to Reduce

2010-02-18 Thread Amogh Vasekar
Hi Ankit, however the the issue that i am facing that I was expecting all the maps to finish before any reduce starts. This is exactly how it happens, reducers poll map tasks for data and begin user code only after all maps complete. when is closed function called after every map or after all

Re: Strange behaviour from a custom Writable

2010-02-08 Thread Amogh Vasekar
Hi, Yes the same location is populated with different values ( returned by iter.next() ) for optimization reasons. There is a new patch which will allow you to mark() and reset() iterator so that you buffer required values ( equivalently you can do that yourself, its anyways in-mem for the

Re: avoiding data redistribution in iterative mapreduce

2010-02-08 Thread Amogh Vasekar
redistribution in this case? If that is the case, can a custom scheduler be written -- will it be any easy task? Regards, Raghava. On Thu, Feb 4, 2010 at 2:52 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, Will there be a re-assignment of Map Reduce nodes by the Master? In general using available

Re: Barrier between reduce and map of the next round

2010-02-08 Thread Amogh Vasekar
has to be defined before the job is started, right? But because I don't know the value of K beforehand, I want the chain to continue forever until some counter in reduce task is zero. Felix Halim On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar am...@yahoo-inc.com wrote: However, from ri to m(i+1

Re: configuration file

2010-02-04 Thread Amogh Vasekar
Hi, A shot in the dark, is the conf file in your classpath? If yes, are the parameters you are trying to override marked final? Amogh On 2/4/10 3:18 AM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi, I am writing script to run whole bunch of jobs automatically. But the configuration file doesn't

Re: avoiding data redistribution in iterative mapreduce

2010-02-03 Thread Amogh Vasekar
(jobConf); Do something to check termination condition} If I write something like that in the code, would not the Map node run on the same data chunk it has each time? Will there be a re-assignment of Map Reduce nodes by the Master? Regards, Raghava. On Wed, Feb 3, 2010 at 9:59 AM, Amogh

Re: Barrier between reduce and map of the next round

2010-02-03 Thread Amogh Vasekar
However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should not need to wait for all reducers ri to finish, right? Yes, but r(i+1) cant be in the same job, since that requires another sort and shuffle phase ( barrier ). So you would end up doing, job(i) : m(i)r(i)m(i+1) .

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
Hi, For global line numbers, you would need to know the ordering within each split generated from the input file. The standard input formats provide offsets in splits, so if the records are of equal length you can compute some kind of numbering. I remember someone had implemented sequential

Re: Input file format doubt

2010-01-28 Thread Amogh Vasekar
-program.html. You particular solution won't work, because I need to do additional processing between the two passes. --gordon On Wed, Nov 25, 2009 at 1:50 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Amogh On 1/28/10 4:03 PM, Ravi ravindra.babu.rav...@gmail.com wrote: Thank you Amogh. On Thu, Jan 28

Re: fine granularity operation on HDFS

2010-01-28 Thread Amogh Vasekar
Hi Gang, Yes PathFilters work only on file paths. I meant you can include such type of logic at split level. The input format's getSplits() method is responsible for computing and adding splits to a list container, for which JT initializes mapper tasks. You can override the getSplits() method

Re: File split query

2010-01-28 Thread Amogh Vasekar
Hi, In general, the file split may break the records, its the responsibility of the record reader to present the record as a whole. If you use standard available InputFormats, the framework will make sure complete records are presented in key,value. Amogh On 1/29/10 9:04 AM, Udaya Lakshmi

Re: distributing hdfs put

2010-01-27 Thread Amogh Vasekar
Yes, parameter is mapred.task.timeout in mS. You can also update status / output to stdout after some time chunks to avoid this :) Amogh On 1/28/10 10:52 AM, prasenjit mukherjee pmukher...@quattrowireless.com wrote: Now I see. The tasks are failing with the following error message : *Task

Re: distributing hdfs put

2010-01-27 Thread Amogh Vasekar
this property only on master's hadoop-site.xml will do or I need to do it on all the slaves as well ? Any way I can do this from PIG ( or I guess I am asking too much here :) ) On Thu, Jan 28, 2010 at 10:57 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Yes, parameter is mapred.task.timeout in mS

Re: When exactly is combiner invoked?

2010-01-27 Thread Amogh Vasekar
Hi, To elaborate a little on Gang's point, the buffer threshold is limited by io.sort.spill.percent, during which spills are created. If the number of spills is more than min.num.spills.for.combine, combiner gets invoked on the spills created before writing to disk. I'm not sure what exactly

Re: Debugging Partitioner problems

2010-01-20 Thread Amogh Vasekar
Can I tell hadoop to save the map outputs per reducer to be able to inspect what's in them You can set keep.tasks.files.pattern will save mapper output, set this regex to match your job/task as need be. But this will eat up a lot of local disk space. The problem most likely is your data ( or

Re: chained mappers reducers

2010-01-19 Thread Amogh Vasekar
Hi, Can you elaborate on your case a little? If you need sort and shuffle ( ie outputs of different reducer tasks of R1 to be aggregated in some way ) , you have to write another map-red job. If you need to process only local reducer data ( ie your reducer output key is same as input key ),

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-19 Thread Amogh Vasekar
HDFS. -Thanks for the pointer. Prasen On Tue, Jan 19, 2010 at 10:47 AM, Amogh Vasekar am...@yahoo-inc.com wrote: Hi, When NN is in safe mode, you get a read-only view of the hadoop file system. ( since NN is reconstructing its image of FS ) Use hadoop dfsadmin -safemode get to check

Re: Is it always called part-00000?

2010-01-18 Thread Amogh Vasekar
Hi, Do your steps qualify as separate MR jobs? Then using JobClient APIs should be more than sufficient for such dependencies. You can add the whole output directory as input to another one to read all files, and provide PathFilter to ignore any files you don't want to be processed, like side

Re: rmr: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /op. Name node is in safe mode.

2010-01-18 Thread Amogh Vasekar
Hi, When NN is in safe mode, you get a read-only view of the hadoop file system. ( since NN is reconstructing its image of FS ) Use hadoop dfsadmin -safemode get to check if in safe mode. hadoop dfsadmin -safemode leave to leave safe mode forcefully. Or use hadoop dfsadmin -safemode wait to

Re: Is it possible to share a key across maps?

2010-01-13 Thread Amogh Vasekar
and the new APIs. I was digging for that answer for awhile. Thanks. --- On Tue, 1/12/10, Amogh Vasekar am...@yahoo-inc.com wrote: From: Amogh Vasekar am...@yahoo-inc.com Subject: Re: Is it possible to share a key across maps? To: common-user@hadoop.apache.org common-user@hadoop.apache.org, raymondj

Re: How do I sum by Key in the Reduce Phase AND keep the initial value

2010-01-12 Thread Amogh Vasekar
Hi, I ran into a very similar situation quite some time back and had then encountered this : http://issues.apache.org/jira/browse/HADOOP-475 After speaking to a few Hadoop folks, they had said complete cloning was not a straightforward option for some optimization reasons. There were a few

Re: What can cause: Map output copy failure

2010-01-08 Thread Amogh Vasekar
Hi, Can you please let us know your system configuration running hadoop? The error you see is when the reducer is copying its respective map output into memory. The parameter mapred.job.shuffle.input.buffer.percent can be manipulated for this ( a bunch of others will also help you optimize sort

Re: File _partition.lst does not exist.

2009-12-15 Thread Amogh Vasekar
Hi, I believe you need to add the partition file to distributed cache so that all tasks have it. The terasort code uses this sampler, you can refer to that if needed. Amogh On 12/15/09 5:06 PM, afarsek adji...@gmail.com wrote: Hi, I'm using the InputSampler.RandomSampler to perform a

Re: Does Using MultipleTextOutputFormat Require the Deprecated API?

2009-12-14 Thread Amogh Vasekar
. I further assume, I need only apply the latest patch, which is 5. Am I correct. On Wed, Dec 9, 2009 at 7:30 AM, Amogh Vasekar am...@yahoo-inc.com wrote: http://issues.apache.org/jira/browse/MAPREDUCE-370 You'll have to work around for now / try to apply patch. Amogh On 12/9/09 8:54 PM

Re: Re: Re: Re: Re: map output not euqal to reduce input

2009-12-14 Thread Amogh Vasekar
#. I didn't use SkipBadRecords class. I think by default the feature is disabled. So, it should have nothing to do with this. I do my test using tables of TPC-DS. If I run my job on some 'toy tables' I make, the statistics is correct. -Gang - 原始邮件 发件人: Amogh Vasekar am...@yahoo

Re: Re: Re: Re: map output not euqal t o reduce input

2009-12-10 Thread Amogh Vasekar
Hi, The counters are updated as the records are *consumed*, for both mapper and reducer. Can you confirm if all the values returned by your iterators are consumed on reduce side? Also, are you having feature of skipping bad records switched on? Amogh On 12/11/09 4:32 AM, Gang Luo

Re: Does Using MultipleTextOutputFormat Require the Deprecated API?

2009-12-09 Thread Amogh Vasekar
http://issues.apache.org/jira/browse/MAPREDUCE-370 You'll have to work around for now / try to apply patch. Amogh On 12/9/09 8:54 PM, Geoffry Roberts geoffry.robe...@gmail.com wrote: Aaron, I am using 0.20.1 and I'm not finding org.apache.hadoop.mapreduce. lib.output.MultipleOutputs. I'm

Re: Re: return in map

2009-12-06 Thread Amogh Vasekar
Hi, If the file doesn’t exist, java will error out. For partial skips, o.a.h.mapreduce.Mapper class provides a method run(), which determines if the end of split is reached and if not, calls map() on your k,v pair. You may override this method to include flag checks too and if that fails, the

Re: Hadoop with Multiple Inpus and Outputs

2009-12-03 Thread Amogh Vasekar
Hi, Please try removing the combiner and running. I know that if you use multiple outputs from within a mapper, those k,v pairs are not a part of sort and shuffle phase. Your combiner is same as reducer which uses mos, and might be an issue on map side. If I'm to take a guess, mos writes to a

Re: How can I change the mapreduce output coder?

2009-12-01 Thread Amogh Vasekar
Hi, What are your intermediate output K,V class formats? “Text” format is inherently UTF-8 encoded. If you want end-to-end processing to be via gbk encoding, you may have to write a custom writable type. Amogh On 11/30/09 7:09 PM, 郭鹏 gpcus...@gmail.com wrote: I know the default output coder

Re: Problem with mapred.job.reuse.jvm.num.tasks

2009-11-30 Thread Amogh Vasekar
Hi, Task slots reuse JVM over the course of entire job right? Specifically, would like to point to : http://issues.apache.org/jira/browse/MAPREDUCE-453?focusedCommentId=12619492page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12619492 Thanks, Amogh On 11/30/09 5:44

Re: The name of the current input file during a map

2009-11-26 Thread Amogh Vasekar
(); System.out.println(mapred.input.file=+cfg.get(mapred.input.file)); displays null, so maybe this fell out by mistake in the api change? Regards Saptarshi On Thu, Nov 26, 2009 at 2:13 AM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Thank you. Regards Saptarshi On Thu, Nov 26, 2009 at 2:10 AM, Amogh

Re: Saving Intermediate Results from the Mapper

2009-11-24 Thread Amogh Vasekar
Hi, I'm not sure if this will apply to your case since i'm not aware of the common part of job2:mapper and job3:mapper but would like to give it a shot. The whole process can be combined into a single mapred job. The mapper will read a record and process till the saved data part , then for each

Re: Hadoop Performance

2009-11-24 Thread Amogh Vasekar
Hi, For near real time performance you may try Hbase. I had read about Streamy doing this, and their hadoop-world-nyc ppt is available on their blog: http://devblog.streamy.com/2009/07/24/streamy-hadoop-summit-hbase-goes-realtime/ Amogh On 11/25/09 1:31 AM, onur ascigil

Re: Saving Intermediate Results from the Mapper

2009-11-22 Thread Amogh Vasekar
Hi, keep.tasks.files.pattern is what you need, as the name suggests its a pattern match on intermediate outputs generated. Wrt to copying map data to hdfs, your mappers close() method should help you achieve this, but might slow up your tasks. Amogh On 11/23/09 8:08 AM, Jeff Zhang

Re: How to handle imbalanced data in hadoop ?

2009-11-18 Thread Amogh Vasekar
Hi, This is the time for all three phases of reducer right? I think its due to the constant spilling for a single key to disk since the map partitions couldn't be held in-mem due to buffer limit. Did the other reducer have numerous keys with low number of values ( ie smaller partitions? )

Re: new MR API:MutilOutputFormat

2009-11-18 Thread Amogh Vasekar
MultipleOutputFormat and MOS are to be merged : http://issues.apache.org/jira/browse/MAPREDUCE-370 Amogh On 11/18/09 12:03 PM, Y G gymi...@gmail.com wrote: in the old MR API ,there is MutilOutputFormat class which i can use to custom the reduce output file name. it's very useful for me. but i

Re: architecture help

2009-11-15 Thread Amogh Vasekar
I would like the connection management to live separately from the mapper instances per node. The JVM reuse option in Hadoop might be helpful for you in this case. Amogh On 11/16/09 6:22 AM, yz5od2 woods5242-outdo...@yahoo.com wrote: Hi, a) I have a Mapper ONLY job, the job reads in records,

Re: Multiple Input Paths

2009-11-03 Thread Amogh Vasekar
Hi Mark, A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This would allow you to have a different inputformat, mapper depending on the path you are getting the split from. It uses special Delegating[mapper/input] classes to resolve this. I understand

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Amogh Vasekar
Hi, Quick questions... Are you creating too many small files? Are there any task side files being created? Is the heap for NN having enough space to list metadata? Any details on its general health will probably be helpful to people on the list. Amogh On 11/2/09 2:02 PM, Zhang Bingjun (Eddy)

Re: Multiple Input Paths

2009-11-02 Thread Amogh Vasekar
Mark, Set-up for a mapred job consumes a considerable amount of time and resources and so, if possible a single job is preferred. You can add multiple paths to your job, and if you need different processing logic depending upon the input being consumed, you can use parameter map.input.file in

Re: Distribution of data in nodes with different storage capacity

2009-10-28 Thread Amogh Vasekar
Hi, Rebalancer should help you : http://issues.apache.org/jira/browse/HADOOP-1652 Amogh On 10/28/09 2:54 PM, Vibhooti Verma verma.vibho...@gmail.com wrote: Hi All, We are facing the issue with distribution of data in a cluster where nodes have differnt storage capacity. We have 4 nodes with

Re: Problem to create sequence file for

2009-10-27 Thread Amogh Vasekar
Hi Bhushan, If splitting input files is an option, why don't you let hadoop do this for you? If need be you may use a custom input format and sequencefile*outputformat. Amogh On 10/27/09 7:55 PM, bhushan_mahale bhushan_mah...@persistent.co.in wrote: Hi Jason, Thanks for the reply. The string

Re: How To Pass Parameters To Mapper Through Main Method

2009-10-25 Thread Amogh Vasekar
Hi, Many options available here. You can use jobconf (0.18 ) / context.conf (0.20) to pass these lines across all tasks ( assuming the size isnt relatively large ) and use configure / setup to retrieve these.. Or use distributed cache to read a file containing these lines ( possibly with jvm

Re: How to skip fail map to done the job

2009-10-20 Thread Amogh Vasekar
For skipping failed tasks try : mapred.max.map.failures.percent Amogh On 10/21/09 8:58 AM, 梁景明 futur...@gmail.com wrote: hi, I use hadoop0.20 and 8 nodes, there is a job that has 130 map to run ,and completed 128 map, but only 2 map fail ,and its fail in my case is accepted ,but the job fail

Re: Hive vs. Vertica

2009-10-19 Thread Amogh Vasekar
Yahoo! Had an Everest MPP framework based on columnar storage, don't know how popular it was, but required pretty high end machines. Zebra I guess partially aims at getting that into Hadoop using t-file implementation, and its source is available in contrib. Amogh On 10/19/09 10:18 AM,

Re: proper way to configure classes required by mapper job

2009-10-19 Thread Amogh Vasekar
Hi, Check the distributed cache APIs, it provides various functionalities to distribute and add jars to classpath on compute machines. Amogh On 10/19/09 3:38 AM, yz5od2 woods5242-outdo...@yahoo.com wrote: Hi, What is the preferred method to distribute the classes (in various Jars) to my

Re: Hadoop dfs can't allocate memory with enough hard disk space when data gets huge

2009-10-19 Thread Amogh Vasekar
Hi, It would be more helpful if you provide the exact error here. Also, hadoop uses the local FS to store intermediate data, along with HDFS for final output. If your job is memory intensive, try limiting the number of tasks you are running in parallel on a machine. Amogh On 10/19/09 8:27 AM,

Re: How to get IP address of the machine where map task runs

2009-10-15 Thread Amogh Vasekar
Nguyen Dinh munt...@gmail.com wrote: Thanks Amogh. For my application, I want each map task reports to me where it's running. However, I have no idea how to use Java Inetaddress APIs to get that info. Could you explain more? Van On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar am...@yahoo-inc.com

RE: Easiest way to pass dynamic variable to Map Class

2009-10-05 Thread Amogh Vasekar
Hi, I guess configure is now setup(), and using toolrunner can create a configuration / context to mimic the required behavior. Thanks, Amogh -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Tuesday, October 06, 2009 5:43 AM To: common-user@hadoop.apache.org

RE: How can I assign the same mapper class with different data?

2009-10-05 Thread Amogh Vasekar
Hi Huang, Haven't worked with Hbase but in general, If you want to have control over what data split to go as a whole to mapper, easiest way is to compress that split in single file; making as many split files as needed. If you need to know what file is currently being processed, you can use

RE: Best Idea to deal with following situation

2009-09-29 Thread Amogh Vasekar
Along with partitioner, try to plug in a combiner. It would provide significant performance gains. Not sure about the algo you use, but might have to tweak that a little to facilitate a combiner. Thanks, Amogh -Original Message- From: Chandraprakash Bhagtani

RE: Distributed cache - are files unique per job?

2009-09-29 Thread Amogh Vasekar
I believe framework checks timestamps on HDFS for marking an already available copy of the file valid or invalid, since the archived files are not cleaned up till a certain du limit is reached, and no apis for cleanup available. There was a thread on this some time back on the list. Amogh

RE: Program crashed when volume of data getting large

2009-09-23 Thread Amogh Vasekar
Hi, Please check the namenode heap usage. Your cluster may be having too many files to handle / too little free space. It is generally available in the UI. This is one of the causes I have seen for the Timeout. Amogh -Original Message- From: Kunsheng Chen [mailto:ke...@yahoo.com] Sent:

JVM reuse

2009-09-15 Thread Amogh Vasekar
Hi All, Regarding the JVM reuse feature incorporated, it says reuse is generally recommended for streaming and pipes jobs. I'm a little unclear on this and any pointers will be appreciated. Also, in what scenarios will this feature be helpful for java mapred jobs? Thanks, Amogh

RE: about hadoop jvm allocation in job excution

2009-09-15 Thread Amogh Vasekar
Hi, Funny enough was looking at it just yesterday. http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Task+JVM+Reuse Thanks, Amogh -Original Message- From: Zhimin [mailto:wan...@cs.umb.edu] Sent: Tuesday, September 15, 2009 10:53 PM To: core-u...@hadoop.apache.org

RE: DistributedCache purgeCache()

2009-09-07 Thread Amogh Vasekar
: DistributedCache purgeCache() Thanks for your swift response. But where can I find deletecache()? Thanks. -Original Message- From: Amogh Vasekar [mailto:am...@yahoo-inc.com] Sent: Thu 9/3/2009 2:44 PM To: common-user@hadoop.apache.org Subject: RE: DistributedCache purgeCache() AFAIK

RE: multi core nodes

2009-09-04 Thread Amogh Vasekar
Before setting the task limits, do take into account the memory considerations ( many archive posts on this can be found ). Also, your tasktracker and datanode daemons will run on that machine as well, so you might want to set aside some processing power for that. Cheers! Amogh -Original

RE: Some issues!

2009-09-04 Thread Amogh Vasekar
Have a look at jobclient, it should suffice. Cheers! Amogh -Original Message- From: bharath vissapragada [mailto:bharathvissapragada1...@gmail.com] Sent: Friday, September 04, 2009 9:15 PM To: common-user@hadoop.apache.org Subject: Re: Some issues! Hey , I have one more doubt ,

RE: difference between mapper and map runnable

2009-08-28 Thread Amogh Vasekar
Hi, Mapper is used to process the K,V pair passed to it, MapRunnable is an interface, when implemented is responsible for generating a conforming K,V pair and pass it to Mapper. Cheers! Amogh -Original Message- From: Rakhi Khatwani [mailto:rkhatw...@gmail.com] Sent: Thursday, August

RE: Location reduce task running.

2009-08-24 Thread Amogh Vasekar
boxes. Do you have any suggestion? I am thinking about JVM re-use feature of Hadoop or I can set up a chain of two map-reduce pairs. Best regards. Fang. On Mon, Aug 24, 2009 at 1:25 PM, Amogh Vasekar am...@yahoo-inc.commailto:am...@yahoo-inc.com wrote: No, but if you want a reducer like

RE: Hadoop streaming: How is data distributed from mappers to reducers?

2009-08-24 Thread Amogh Vasekar
Hadoop will make sure that every k,v pair with same key will land up in same reducer and consumed in a single reduce instance. -Original Message- From: Nipun Saggar [mailto:nipun.sag...@gmail.com] Sent: Tuesday, August 25, 2009 10:41 AM To: common-user@hadoop.apache.org Subject: Re:

RE: MR job scheduler

2009-08-21 Thread Amogh Vasekar
I'm not sure that is the case with Hadoop. I think its assigning reduce task to an available tasktracker at any instant; Since a reducer polls JT for completed maps. And if it were the case as you said, a reducer wont be initialized until all maps have completed , after which copy phase would

RE: MR job scheduler

2009-08-21 Thread Amogh Vasekar
PM To: common-user@hadoop.apache.org Subject: Re: MR job scheduler Amogh i think Reduce phase starts only when all the map phases are completed . Because it needs all the values corresponding to a particular key! 2009/8/21 Amogh Vasekar am...@yahoo-inc.com I'm not sure that is the case

RE: MR job scheduler

2009-08-21 Thread Amogh Vasekar
across the network(because already many values to that key are on that machine where the map phase completed).. 2009/8/21 Amogh Vasekar am...@yahoo-inc.com Yes, but the copy phase starts with the initialization for a reducer, after which it would keep polling for completed map tasks to fetch

RE: passing job arguments as an xml file

2009-08-20 Thread Amogh Vasekar
Hi, GenericOptionsParser is customized only for Hadoop specific params : * codeGenericOptionsParser/code recognizes several standarad command * line arguments, enabling applications to easily specify a namenode, a * jobtracker, additional configuration resources etc. Ideally, all params

RE: utilizing all cores on single-node hadoop

2009-08-17 Thread Amogh Vasekar
While setting mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum, please consider the memory usage your application might have since all tasks will be competing for the same and might reduce overall performance. Thanks, Amogh -Original Message- From:

RE: Some tasks fail to report status between the end of the map and the beginning of the merge

2009-08-05 Thread Amogh Vasekar
10 mins reminds me of parameter mapred.task.timeout . This is configurable. Or alternatively you might just do a sysout to let tracker know of its existence ( not an ideal solution though ) Thanks, Amogh -Original Message- From: Mathias De Maré [mailto:mathias.dem...@gmail.com] Sent:

RE: :!

2009-08-03 Thread Amogh Vasekar
Maybe I'm missing the point, but in terms of execution performance benefit, what does copying to dfs and then compressing to be fed to a map/reduce job provide? Isn't it better to compress offline / outside latency window and make available on dfs? Also, your mapreduce program will launch one

  1   2   >