Re: Create and write files on mounted HDFS via java api

2013-04-20 Thread Hemanth Yamijala
Are you using Fuse for mounting HDFS ? On Fri, Apr 19, 2013 at 4:30 PM, lijinlong wakingdrea...@163.com wrote: I mounted HDFS to a local directory for storage,that is /mnt/hdfs.I can do the basic file operation such as create ,remove,copy etc just using linux command and GUI.But when I tried

Re: Mapreduce

2013-04-20 Thread Hemanth Yamijala
As this is a HBase specific question, it will be better to ask this question on the HBase user mailing list. Thanks Hemanth On Fri, Apr 19, 2013 at 10:46 PM, Adrian Acosta Mitjans amitj...@estudiantes.uci.cu wrote: Hello: I'm working in a proyect, and i'm using hbase for storage the data,

Re: Errors about MRunit

2013-04-20 Thread Hemanth Yamijala
Hi, If your goal is to use the new API, I am able to get it to work with the following maven configuration: dependency groupIdorg.apache.mrunit/groupId artifactIdmrunit/artifactId version0.9.0-incubating/version classifierhadoop1/classifier /dependency If I

Re: Create and write files on mounted HDFS via java api

2013-04-20 Thread Hemanth Yamijala
Sorry - no. I just wanted to know if you were using FUSE, because I knew of no other way of mounting HDFS.. Basically was wondering if some libraries needed to be system path for the Java programs to work. From your response looks like you aren't using FUSE. So what are you using to mount ?

Re: Errors about MRunit

2013-04-20 Thread Hemanth Yamijala
+ user@ Please do continue the conversation on the mailing list, in case others like you can benefit from / contribute to the discussion Thanks Hemanth On Sat, Apr 20, 2013 at 5:32 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, My code is working with having mrunit-0.9.0

Re: Which version of Hadoop

2013-04-20 Thread Hemanth Yamijala
2.x.x provides NN high availability. http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html However, it is in alpha stage right now. Thanks hemanth On Sat, Apr 20, 2013 at 5:30 PM, Ascot Moss ascot.m...@gmail.com wrote: Hi, I am new to

Re: How to configure mapreduce archive size?

2013-04-18 Thread Hemanth Yamijala
** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Wednesday, April 17, 2013 9:11 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** The check for cache file cleanup is controlled by the property

Re: Run multiple HDFS instances

2013-04-18 Thread Hemanth Yamijala
Are you trying to implement something like namespace federation, that's a part of Hadoop 2.0 - http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao aolixi...@gmail.com wrote: Actually I'm trying to do something

Re: How to configure mapreduce archive size?

2013-04-17 Thread Hemanth Yamijala
, ** ** Jane ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Tuesday, April 16, 2013 9:35 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** You can limit the size by setting local.cache.size in the mapred-site.xml

Re: Hadoop fs -getmerge

2013-04-17 Thread Hemanth Yamijala
I don't think that is possible. When we use -getmerge, the destination filesystem happens to be a LocalFileSystem which extends from ChecksumFileSystem. I believe that's why the CRC files are getting in. Would it not be possible for you to ignore them, since they have a fixed extension ? Thanks

Re: How to configure mapreduce archive size?

2013-04-16 Thread Hemanth Yamijala
** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** TableMapReduceUtil has APIs like addDependencyJars which will use DistributedCache. I

Re: How to configure mapreduce archive size?

2013-04-11 Thread Hemanth Yamijala
*From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Monday, April 08, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Hi, This directory is used as part of the 'DistributedCache' feature

Re: Copy Vs DistCP

2013-04-11 Thread Hemanth Yamijala
AFAIK, the cp command works fully from the DFS client. It reads bytes from the InputStream created when the file is opened and writes the same to the OutputStream of the file. It does not work at the level of data blocks. A configuration io.file.buffer.size is used as the size of the buffer used

Re: How to configure mapreduce archive size?

2013-04-11 Thread Hemanth Yamijala
); ** ** job.setOutputFormatClass(TableOutputFormat.*class*); job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*, tableName); job.setNumReduceTasks(0); *boolean* b = job.waitForCompletion(*true*); ** ** *From:* Hemanth Yamijala [mailto:yhema

Re: How to configure mapreduce archive size?

2013-04-08 Thread Hemanth Yamijala
Hi, This directory is used as part of the 'DistributedCache' feature. ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). There is a configuration key local.cache.size which controls the amount of data stored under DistributedCache. The default limit is 10GB. However,

Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
Hi, Not sure if I am answering your question, but this is the background. Every MapReduce job has a partitioner associated to it. The default partitioner is a HashPartitioner. You can as a user write your own partitioner as well and plug it into the job. The partitioner is responsible for

Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
only the needed lines. Thanks, Alberto On 28 March 2013 11:01, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Not sure if I am answering your question, but this is the background. Every MapReduce job has a partitioner associated to it. The default partitioner

Re: Find reducer for a key

2013-03-28 Thread Hemanth Yamijala
, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. That feels like a join. Can't you read the input file on the map side and output those keys along with the original map output keys.. That way the reducer would automatically get both together ? On Thu, Mar 28, 2013 at 5:20 PM

Re: Child JVM memory allocation / Usage

2013-03-27 Thread Hemanth Yamijala
=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi. Koji On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote: Hi, I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I suspected, the dump goes

Re: Child JVM memory allocation / Usage

2013-03-27 Thread Hemanth Yamijala
... attempt_201302211510_81218_m_00_0: put: File myheapdump.hprof does not exist. attempt_201302211510_81218_m_00_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). On Wed, Mar 27, 2013 at 2:29 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Couple of things to check

Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
. I only have have a edge node through which I can submit the jobs. Is there any other way of getting the dump instead of physically going to that machine and checking out. On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, One option to find what

Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
matching a pattern. However, these are NOT retaining the current working directory. Hence, there is no option to get this from a cluster AFAIK. You are effectively left with the jmap option on pseudo distributed cluster I think. Thanks Hemanth On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala

Re: Child JVM memory allocation / Usage

2013-03-26 Thread Hemanth Yamijala
=./dump.sh' This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi. Koji On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote: Hi, I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I suspected, the dump goes to the current work directory of the task

Re: How to tell my Hadoop cluster to read data from an external server

2013-03-26 Thread Hemanth Yamijala
The stack trace indicates the job client is trying to submit a job to the MR cluster and it is failing. Are you certain that at the time of submitting the job, the JobTracker is running ? (On localhost:54312) ? Regarding using a different file system - it depends a lot on what file system you are

Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB

Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what

Re: Child JVM memory allocation / Usage

2013-03-25 Thread Hemanth Yamijala
in the mapper. So I am trying to read the whole file and load it into list in the mapper. For each and every record Iook in this file which I got from distributed cache. — Sent from iPhone On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. How are you loading

Re: MapReduce Failed and Killed

2013-03-24 Thread Hemanth Yamijala
Any MapReduce task needs to communicate with the tasktracker that launched it periodically in order to let the tasktracker know it is still alive and active. The time for which silence is tolerated is controlled by a configuration property mapred.task.timeout. It looks like in your case, this has

Re: Too many open files error with YARN

2013-03-21 Thread Hemanth Yamijala
the fix is really in 2.0.0-alpha, request you to please clarify me. Thanks, Kishore On Thu, Mar 21, 2013 at 9:57 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: There was an issue related to hung connections (HDFS-3357). But the JIRA indicates the fix is available in Hadoop-2.0.0

Re: Too many open files error with YARN

2013-03-20 Thread Hemanth Yamijala
There was an issue related to hung connections (HDFS-3357). But the JIRA indicates the fix is available in Hadoop-2.0.0-alpha. Still, would be worth checking on Sandy's suggestion On Wed, Mar 20, 2013 at 11:09 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Kishore, 50010 is the datanode

Re: map reduce and sync

2013-02-24 Thread Hemanth Yamijala
23, 2013 at 11:54 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi Lucas, I tried something like this but got different results. I wrote code that opened a file on HDFS, wrote a line and called sync. Without closing the file, I ran a wordcount with that file as input. It did work

Re: Trouble in running MapReduce application

2013-02-23 Thread Hemanth Yamijala
Can you try this ? Pick a class like WordCount from your package and execute this command: javap -classpath path to your jar -verbose org.myorg.Wordcount | grep version. For e.g. here's what I get for my class: $ javap -verbose WCMapper | grep version minor version: 0 major version: 50

Re: Reg job tracker page

2013-02-23 Thread Hemanth Yamijala
Yes. It corresponds to the JT start time. Thanks hemanth On Sat, Feb 23, 2013 at 5:37 PM, Manoj Babu manoj...@gmail.com wrote: Bharath, I can understand that its time stamp. what does identifier means? whether is holds the job tracker instance started time? Cheers! Manoj. On Sat, Feb

Re: map reduce and sync

2013-02-23 Thread Hemanth Yamijala
, and reading the file using org.apache.hadoop.fs.FSDataInputStream also works ok. Last thing, the web interface doesn't see the contents, and command hadoop -fs -ls says the file is empty. What am I doing wrong? Thanks! Lucas On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala yhema

Re: map reduce and sync

2013-02-22 Thread Hemanth Yamijala
Could you please clarify, are you opening the file in your mapper code and reading from there ? Thanks Hemanth On Friday, February 22, 2013, Lucas Bernardi wrote: Hello there, I'm trying to use hadoop map reduce to process an open file. The writing process, writes a line to the file and syncs

Re: Database insertion by HAdoop

2013-02-19 Thread Hemanth Yamijala
which you are planning to do on your data. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 6:44 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, You could consider using sqoop. http://sqoop.apache.org/ there seemed to be a SQL

Re: ClassNotFoundException in Main

2013-02-19 Thread Hemanth Yamijala
, Hemanth Yamijala yhema...@thoughtworks.com wrote: Sorry. I did not read the mail correctly. I think the error is in how the jar has been created. The classes start with root as wordcount_classes, instead of org. Thanks Hemanth On Tuesday, February 19, 2013, Hemanth Yamijala wrote: Have

Re: JUint test failing in HDFS when building Hadoop from source.

2013-02-19 Thread Hemanth Yamijala
Hi, In the past, some tests have been flaky. It would be good if you can search jira and see whether this is a known issue. Else, please file it, and if possible, provide a patch. :) Regarding whether this will be a reliable build, it depends a little bit on what you are going to use it for. For

Re: Database insertion by HAdoop

2013-02-18 Thread Hemanth Yamijala
What database is this ? Was hbase mentioned ? On Monday, February 18, 2013, Mohammad Tariq wrote: Hello Masoud, You can use the Bulk Load feature. You might find it more efficient than normal client APIs or using the TableOutputFormat. The bulk load feature uses a MapReduce job

Re: How to understand DataNode usages ?

2013-02-14 Thread Hemanth Yamijala
This seems to be related to the % used capacity at a datanode. The values are computed for all the live datanodes, and the range / central limits / deviations are computed based on a sorted list of the values. Thanks hemanth On Thu, Feb 14, 2013 at 2:42 PM, Dhanasekaran Anbalagan

Re: Java submit job to remote server

2013-02-12 Thread Hemanth Yamijala
Can you please include the complete stack trace and not just the root. Also, have you set fs.default.name to a hdfs location like hdfs://localhost:9000 ? Thanks Hemanth On Wednesday, February 13, 2013, Alex Thieme wrote: Thanks for the prompt reply and I'm sorry I forgot to include the

Re: Cannot use env variables in hodrc

2013-02-08 Thread Hemanth Yamijala
Hi, Hadoop On Demand is no longer supported with recent releases of Hadoop. There is no separate user list for HOD related questions. Which version of Hadoop are you using right now ? Thanks hemanth On Wed, Feb 6, 2013 at 8:59 PM, Mehmet Belgin mehmet.bel...@oit.gatech.eduwrote: Hello

Re: How to find Blacklisted Nodes via cli.

2013-01-30 Thread Hemanth Yamijala
Hi, Part answer: you can get the blacklisted tasktrackers using the command line: mapred job -list-blacklisted-trackers. Also, I think that a blacklisted tasktracker becomes 'unblacklisted' if it works fine after some time. Though I am not very sure about this. Thanks hemanth On Wed, Jan 30,

Re: Filesystem closed exception

2013-01-30 Thread Hemanth Yamijala
not to close the FS. It will go away when the task ends anyway. Thx On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task

Re: Filesystem closed exception

2013-01-25 Thread Hemanth Yamijala
. On Fri, Jan 25, 2013 at 6:56 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code

Re: mappers-node relationship

2013-01-25 Thread Hemanth Yamijala
This may beof some use, about how maps are decided: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Thanks Hemanth On Friday, January 25, 2013, jamal sasha wrote: Hi. A very very lame question. Does numbers of mapper depends on the number of nodes I have? How I imagine map-reduce is

Re: TT nodes distributed cache failure

2013-01-25 Thread Hemanth Yamijala
Could you post the stack trace from the job logs. Also looking at the task tracker logs on the failed nodes may help. Thanks Hemanth On Friday, January 25, 2013, Terry Healy wrote: Running hadoop-0.20.2 on a 20 node cluster. When running a Map/Reduce job that uses several .jars loaded into

Filesystem closed exception

2013-01-24 Thread Hemanth Yamijala
Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code. Debugging this we found that the mapper is getting a handle to the filesystem object and itself

Re: Where do/should .jar files live?

2013-01-22 Thread Hemanth Yamijala
On top of what Bejoy said, just wanted to add that when you submit a job to Hadoop using the hadoop jar command, the jars which you reference in the command on the edge/client node will be picked up by Hadoop and made available to the cluster nodes where the mappers and reducers run. Thanks

Re: passing arguments to hadoop job

2013-01-21 Thread Hemanth Yamijala
Hi, Please note that you are referring to a very old version of Hadoop. the current stable release is Hadoop 1.x. The API has changed in 1.x. Take a look at the wordcount example here: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v2.0 But, in principle your

Re: passing arguments to hadoop job

2013-01-21 Thread Hemanth Yamijala
(); } output.collect(key, new IntWritable(sum)); } } On Mon, Jan 21, 2013 at 8:29 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Please note that you are referring to a very old version of Hadoop. the current stable release is Hadoop 1.x. The API has changed

Re: How to unit test mappers reading data from DistributedCache?

2013-01-17 Thread Hemanth Yamijala
Hi, Not sure how to do it using MRUnit, but should be possible to do this using a mocking framework like Mockito or EasyMock. In a mapper (or reducer), you'd use the Context classes to get the DistributedCache files. By mocking these to return what you want, you could potentially run a true unit

Re: tcp error

2013-01-16 Thread Hemanth Yamijala
failed when I tried to open it. Restarting the daemons helped. I don't think this problem will come in a normal up-and-running production cluster. Thanks hemanth On Thu, Jan 17, 2013 at 9:48 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: At the place where you get the error, can you

Re: Biggest cluster running YARN in the world?

2013-01-15 Thread Hemanth Yamijala
You may get more updated information from folks at Yahoo!, but here is a mail on hadoop-general mailing list that has some statistics: http://www.mail-archive.com/general@hadoop.apache.org/msg05592.html Please note it is a little dated, so things should be better now :-) Thank hemanth On Tue,

Re: Compile error using contrib.utils.join package with new mapreduce API

2013-01-15 Thread Hemanth Yamijala
in 2.x and trunk. Could you check if this provides functionality you require - so we at least know there is new API support in later versions ? Thanks Hemanth On Mon, Jan 14, 2013 at 7:45 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, No. I didn't find any reference to a working

Re: FileSystem.workingDir vs mapred.local.dir

2013-01-15 Thread Hemanth Yamijala
Hi, AFAIK, the mapred.local.dir property refers to a set of directories under which different types of data related to mapreduce jobs are stored - for e.g. intermediate data, localized files for a job etc. The working directory for a mapreduce job is configured under a sub directory within one of

Re: config file loactions in Hadoop 2.0.2

2013-01-15 Thread Hemanth Yamijala
Hi, One place where I could find the capacity-scheduler.xml was from source - hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/resources. AFAIK, the masters file is only used for starting the secondary namenode - which has in 2.x been replaced by a

Re: Compile error using contrib.utils.join package with new mapreduce API

2013-01-14 Thread Hemanth Yamijala
: Thanks Hemanth ** ** I appreciate your response Did you find any working example of it in use? It looks to me like I’d still be tied to the old API Thanks Mike ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* 14 January 2013 05:08

Re: log server for hadoop MR jobs??

2013-01-13 Thread Hemanth Yamijala
To add to that, log aggregation is a feature available with Hadoop 2.0 (where mapreduce is re-written to YARN). The functionality is available via the History Server: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html Thanks hemanth On Sat, Jan 12, 2013

Re: JobCache directory cleanup

2013-01-11 Thread Hemanth Yamijala
11, 2013 at 3:28 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Thanks for replies! keep.failed.task.files set to false. Config of one of the jobs attached. On Fri, Jan 11, 2013 at 5:44 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Good point. Forgot that one

Re: queues in haddop

2013-01-11 Thread Hemanth Yamijala
Queues in the capacity scheduler are logical data structures into which MapReduce jobs are placed to be picked up by the JobTracker / Scheduler framework, according to some capacity constraints that can be defined for a queue. So, given your use case, I don't think Capacity Scheduler is going to

Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Hemanth On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks

Re: Not committing output in map reduce

2013-01-10 Thread Hemanth Yamijala
Is this the same as: http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs? i.e. LazyOutputFormat, etc. ? On Thu, Jan 10, 2013 at 4:51 PM, Pratyush Chandra chandra.praty...@gmail.com wrote: Hi, I am using s3n as file system. I do not wish to create

Re: JobCache directory cleanup

2013-01-10 Thread Hemanth Yamijala
Good point. Forgot that one :-) On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Can you check the job configuration for these ~100 jobs? Do they have keep.failed.task.files set to true? If so, these files won't be deleted. If it doesn't, it could

Re: JobCache directory cleanup

2013-01-09 Thread Hemanth Yamijala
Hi, The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks are run on the slaves. Hence, I don't think this is related to the parameter

Re: Why the official Hadoop Documents are so messy?

2013-01-08 Thread Hemanth Yamijala
Hi, I am not sure if your complaint is as much about the changing interfaces as it is about documentation. Please note that versions prior to 1.0 did not have stable interfaces as a major requirement. Not by choice, but because the focus was on seemingly more important functionality, stability,

Re: Reg: Fetching TaskAttempt Details from a RunningJob

2013-01-07 Thread Hemanth Yamijala
Hi, In Hadoop 1.0, I don't think this information is exposed. The TaskInProgress is an internal class and hence cannot / should not be used from client applications. The only way out seems to be to screen scrape the information from the Jobtracker web UI. If you can live with completed events,

Re: Differences between 'mapped' and 'mapreduce' packages

2013-01-07 Thread Hemanth Yamijala
From a user perspective, at a high level, the mapreduce package can be thought of as having user facing client code that can be invoked, extended etc as applicable from client programs. The mapred package is to be treated as internal to the mapreduce system, and shouldn't directly be used unless

Re: Skipping entire task

2013-01-06 Thread Hemanth Yamijala
Hi, Are tasks being executed multiple times due to failures? Sorry, it was not very clear from your question. Thanks hemanth On Sat, Jan 5, 2013 at 7:44 PM, David Parks davidpark...@yahoo.com wrote: Thinking here... if you submitted the task programmatically you should be able to capture

Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer

2012-12-30 Thread Hemanth Yamijala
If it is a small number, A seems the best way to me. On Friday, December 28, 2012, Kshiva Kps wrote: Which one is current .. What is the preferred way to pass a small number of configuration parameters to a mapper or reducer? *A. *As key-value pairs in the jobconf object. * *

Re: Selecting a task for the tasktracker

2012-12-27 Thread Hemanth Yamijala
Hi, Firstly, I am talking about Hadoop 1.0. Please note that in Hadoop 2.x and trunk, the Mapreduce framework is completely revamped to Yarn ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) and you may need to look at different interfaces for building your own

Re: Sane max storage size for DN

2012-12-13 Thread Hemanth Yamijala
This is a dated blog post, so it would help if someone with current HDFS knowledge can validate it: http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ . There is a bit about the RAM required for the Namenode and how to compute it: You can look at the 'Namespace

Re: attempt* directories in user logs

2012-12-10 Thread Hemanth Yamijala
However, in the case Oleg is talking about the attempts are: attempt_201212051224_0021_m_00_0 attempt_201212051224_0021_m_02_0 attempt_201212051224_0021_m_03_0 These aren't multiple attempts of a single task, are they ? They are actually different tasks. If they were multiple

Re: Map tasks processing some files multiple times

2012-12-06 Thread Hemanth Yamijala
David, You are using FileNameTextInputFormat. This is not in Hadoop source, as far as I can see. Can you please confirm where this is being used from ? It seems like the isSplittable method of this input format may need checking. Another thing, given you are adding the same input format for all

Re: Map tasks processing some files multiple times

2012-12-06 Thread Hemanth Yamijala
out what I had done. ** ** Dave ** ** ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, December 06, 2012 3:25 PM *To:* user@hadoop.apache.org *Subject:* Re: Map tasks processing some files multiple times ** ** David, ** ** You

Re: Changing hadoop configuration without restarting service

2012-12-04 Thread Hemanth Yamijala
Generally true for the framework config files, but some of the supplementary features can be refreshed without restart. For e.g. scheduler configuration, host files (for included / excluded nodes) ... On Tue, Dec 4, 2012 at 5:33 AM, Cristian Cira cmc0...@tigermail.auburn.eduwrote: No. You will

Re: Failed to call hadoop API

2012-11-29 Thread Hemanth Yamijala
Hi, Little confused about where JNI comes in here (you mentioned this in your original email). Also, where do you want to get the information for the hadoop job ? Is it in a program that is submitting a job, or some sort of monitoring application that is monitoring jobs submitted to a cluster by

Re: problem using s3 instead of hdfs

2012-10-16 Thread Hemanth Yamijala
Hi, I've not tried this on S3. However, the directory mentioned in the exception is based on the value of this particular configuration key: mapreduce.jobtracker.staging.root.dir. This defaults to ${hadoop.tmp.dir}/mapred/staging. Can you please set this to an S3 location and try ? Thanks

Re: problem using s3 instead of hdfs

2012-10-16 Thread Hemanth Yamijala
, 2012 at 3:11 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, I've not tried this on S3. However, the directory mentioned in the exception is based on the value of this particular configuration key: mapreduce.jobtracker.staging.root.dir. This defaults to ${hadoop.tmp.dir}/mapred

Re: Question about how to find which file takes the longest time to process and how to assign more mappers to process that particular file

2012-10-04 Thread Hemanth Yamijala
Hi, Roughly, this information will be available under the 'Hadoop map task list' page in the Mapreduce web ui (in Hadoop-1.0, which I am assuming is what you are using). You can reach this page by selecting the running tasks link from the job information page. The page has a table that lists all

Re: A small portion of map tasks slows down the job

2012-10-03 Thread Hemanth Yamijala
Hi, Would reducing the output from the map tasks solve the problem ? i.e. are reducers slowing down because a lot of data is being shuffled ? If that's the case, you could see if the map output size will reduce by using the framework's combiner or an in-mapper combining technique. Thanks

Re: Can we write output directly to HDFS from Mapper

2012-09-27 Thread Hemanth Yamijala
Can certainly do that. Indeed, if you set the number of reducers to 0, the map output will be directly written to HDFS by the framework itself. You may also want to look at http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Task+Side-Effect+Files to see some things that need to be taken

Re: Passing Command-line Parameters to the Job Submit Command

2012-09-25 Thread Hemanth Yamijala
assumption correct? Thanks, Varad On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.comwrote: Varad, Looking at the code for the PiEstimator class which implements the 'pi' example, the two arguments are mandatory and are used *before* the job is submitted for execution - i.e

Re: Passing Command-line Parameters to the Job Submit Command

2012-09-23 Thread Hemanth Yamijala
Varad, Looking at the code for the PiEstimator class which implements the 'pi' example, the two arguments are mandatory and are used *before* the job is submitted for execution - i.e on the client side. In particular, one of them (nSamples) is used not by the MapReduce job, but by the client code

Re: Will all the intermediate output with the same key go to the same reducer?

2012-09-20 Thread Hemanth Yamijala
Hi, Yes. By contract, all intermediate output with the same key goes to the same reducer. In your example, suppose of the two keys generated from the mapper, one key goes to reducer 1 and the second goes to reducer 2, reducer 3 will not have any records to process and end without producing any

Re: About ant Hadoop

2012-09-19 Thread Hemanth Yamijala
Can you please look at the jobtracker and tasktracker logs on nodes where the task has been launched ? Also see if the job logs are picking up anything. They'll probably give you clues on what is happening. Also, is HDFS ok ? i.e. are you able to read files already loaded etc. Thanks hemanth On

Re: What's the basic idea of pseudo-distributed Hadoop ?

2012-09-14 Thread Hemanth Yamijala
One thing to be careful about is paths of dependent libraries or executables like streaming binaries. In pseudo distributed mode, since all processes are looking on the same machine, it is likely that they will find paths that are really local to only the machine where the job is being launched

Re: Ignore keys while scheduling reduce jobs

2012-09-14 Thread Hemanth Yamijala
Hi, When do you know the keys to ignore ? You mentioned after the map stage .. is this at the end of each map task, or at the end of all map tasks ? Thanks hemanth On Fri, Sep 14, 2012 at 4:36 PM, Aseem Anand aseem.ii...@gmail.com wrote: Hi, Is there anyway I can ignore all keys except a

Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
Hi, Task assignment takes data locality into account first and not block sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. When such a request comes to the jobtracker, it will try to look for an unassigned task which needs data that is close to the tasktracker and will

Re: Error in : hadoop fsck /

2012-09-11 Thread Hemanth Yamijala
Could you please review your configuration to see if you are pointing to the right namenode address ? (This will be in core-site.xml) Please paste it here so we can look for clues. Thanks hemanth On Tue, Sep 11, 2012 at 9:25 PM, yogesh dhari yogeshdh...@live.com wrote: Hi all, I am running

Re: Question about the task assignment strategy

2012-09-11 Thread Hemanth Yamijala
) But, it didn't work like that. What this is happening ? Is there any documents about this ? What part of the source code is doing that ? Regards, Hiroyuki On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Task assignment takes data locality

Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-10 Thread Hemanth Yamijala
Hi, I am not sure if there's any way to restrict the tasks to specific machines. However, I think there are some ways of restricting to number of 'slots' that can be used by the job. Also, not sure which version of Hadoop you are on. The capacityscheduler

Re: Reading from HDFS from inside the mapper

2012-09-10 Thread Hemanth Yamijala
Hi, You could check DistributedCache ( http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). It would allow you to distribute data to the nodes where your tasks are run. Thanks Hemanth On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann

Re: Understanding of the hadoop distribution system (tuning)

2012-09-10 Thread Hemanth Yamijala
Hi, Responses inline to some points. On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan elaine-...@gmo.jp wrote: Hi, I'm new to hadoop and i've just played around with map reduce. I would like to check if my understanding to hadoop is correct and i would appreciate if anyone could correct me if

Re: [Cosmos-dev] Out of memory in identity mapper?

2012-09-06 Thread Hemanth Yamijala
Harsh, Could IsolationRunner be used here. I'd put up a patch for HADOOP-8765, after applying which IsolationRunner works for me. Maybe we could use it to re-run the map task that's failing and debug. Thanks hemanth On Thu, Sep 6, 2012 at 9:42 PM, Harsh J ha...@cloudera.com wrote: Protobuf

Re: Error using hadoop in non-distributed mode

2012-09-04 Thread Hemanth Yamijala
Hi, The path /tmp/hadoop-pat/mapred/local/archive/-4686065962599733460_1587570556_150738331/snip is a location used by the tasktracker process for the 'DistributedCache' - a mechanism to distribute files to all tasks running in a map reduce job. (

Re: Exception while running a Hadoop example on a standalone install on Windows 7

2012-09-04 Thread Hemanth Yamijala
Though I agree with others that it would probably be easier to get Hadoop up and running on Unix based systems, couldn't help notice that this path: \tmp \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging seems to have a space in the first component i.e '\tmp ' and not '\tmp'. Is that

Re: Integrating hadoop with java UI application deployed on tomcat

2012-09-03 Thread Hemanth Yamijala
Hi, If you are getting the LocalFileSystem, you could try by putting core-site.xml in a directory that's there in the classpath for the Tomcat App (or include such a path in the classpath, if that's possible) Thanks hemanth On Mon, Sep 3, 2012 at 4:01 PM, Visioner Sadak visioner.sa...@gmail.com

Yarn defaults for local directories

2012-09-03 Thread Hemanth Yamijala
Hi, Is there a reason why Yarn's directory paths are not defaulting to be relative to hadoop.tmp.dir. For e.g. yarn.nodemanager.local-dirs defaults to /tmp/nm-local-dir. Could it be ${hadoop.tmp.dir}/nm-local-dir instead ? Similarly for the log directories, I guess... Thanks hemanth

Re: knowing the nodes on which reduce tasks will run

2012-09-03 Thread Hemanth Yamijala
Hi, You are right that a change to mapred.tasktracker.reduce.tasks.maximum will require a restart of the tasktrackers. AFAIK, there is no way of modifying this property without restarting. On a different note, could you see if the amount of intermediate data can be reduced using a combiner, or

  1   2   >