You could also try creating a lib directory with the dependant jar and
package that along with the job's jar file. Please refer to this blog post
for information:
http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
On Wed, Sep 26, 2012 at 4:57 PM, sudh
;
> Is my above assumption correct?
>
> Thanks,
> Varad
>
> On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala wrote:
>
>> Varad,
>>
>> Looking at the code for the PiEstimator class which implements the
>> 'pi' example, the two arguments are mandator
Varad,
Looking at the code for the PiEstimator class which implements the
'pi' example, the two arguments are mandatory and are used *before*
the job is submitted for execution - i.e on the client side. In
particular, one of them (nSamples) is used not by the MapReduce job,
but by the client code
Hi,
I am not sure if there's any way to restrict the tasks to specific
machines. However, I think there are some ways of restricting to
number of 'slots' that can be used by the job.
Also, not sure which version of Hadoop you are on. The
capacityscheduler
(http://hadoop.apache.org/common/docs/r2.
Hi,
Do both input files contain data that needs to be processed by the
mapper in the same fashion ? In which case, you could just put the
input files under a directory in HDFS and provide that as input. The
-input option does accept a directory as argument.
Otherwise, can you please explain a lit
Hi,
On Wed, Dec 29, 2010 at 5:51 AM, Jane Chen wrote:
> Is setting dfs.replication to 1 sufficient to stop replication? How do I
> verify that? I have a pseudo cluster running 0.21.0. It seems that the hdfs
> disk consumption triples the amount of data stored.
Setting to 1 is sufficient to
Hi,
On Tue, Dec 28, 2010 at 6:03 PM, Rajgopal Vaithiyanathan
wrote:
> I wrote a script to map the IP's to a rack. The script is as follows. :
>
> for i in $* ; do
> topo=`echo $i | cut -d"." -f1,2,3 | sed 's/\./-/g'`
> topo=/rack-$topo" "
> final=$final$topo
> done
> echo $fi
Not exactly what you may want - but could you try using a HTTP client
in Java ? Some of them have the ability to automatically follow
redirects, manage cookies etc.
Thanks
hemanth
On Thu, Dec 9, 2010 at 4:35 PM, edward choi wrote:
> Excuse me for asking a general Java question here.
> I tried to
Hi,
On Sat, Dec 4, 2010 at 4:50 AM, yogeshv wrote:
>
> Dear all,
>
> Which file in the hadoop svn processes/receives the hadoop command line
> arguments.?
>
> While execution for ex: hadoop jar
> .
>
'hadoop' in the above line is a shell script that's present in the
hadoop-common/bin location
Hi,
> Changing the parameter for a specific job works better for me.
>
> But I was asking in general in which configuration file(s) should I change
> the value of the parameters.
> For parameters in hdfs-site.xml, I should changes the configuration file in
> each machine. But for parameters in mapr
Amandeep,
On Fri, Nov 5, 2010 at 11:54 PM, Amandeep Khurana wrote:
> On Fri, Nov 5, 2010 at 2:00 AM, Hemanth Yamijala wrote:
>
>> Hi,
>>
>> On Fri, Nov 5, 2010 at 2:23 PM, Amandeep Khurana wrote:
>> > Right. I meant I'm not using fair or capacity scheduler
the settings as 'final' on the job tracker and
the task trackers. Then any submission by the job would not override
the settings.
Thanks
Hemanth
>
> -Amandeep
>
> On Nov 5, 2010, at 1:43 AM, Hemanth Yamijala wrote:
>
> Hi,
>
>
> I'm not using any schedule
0.21, and the names of the parameters are different, though you
can see the correspondence with similar variables in Hadoop 0.20.
Thanks
Hemanth
>
> -Amandeep
>
> On Fri, Nov 5, 2010 at 12:21 AM, Hemanth Yamijala wrote:
>
>> Amadeep,
>>
>> Which scheduler are you
Amadeep,
Which scheduler are you using ?
Thanks
hemanth
On Tue, Nov 2, 2010 at 2:44 AM, Amandeep Khurana wrote:
> How are the following configs supposed to be used?
>
> mapred.cluster.map.memory.mb
> mapred.cluster.reduce.memory.mb
> mapred.cluster.max.map.memory.mb
> mapred.cluster.max.reduce.
Hi,
On Thu, Oct 28, 2010 at 5:11 PM, Adarsh Sharma wrote:
> Dear all,
> I am listing all the HDFS delails through -fs shell. I know the superuser
> owns the privileges to list files. But know I want to grant all read and
> write privileges to two new users (for e. g Tom and White ) .
Only these
Hi,
On Wed, Oct 27, 2010 at 2:19 AM, Bibek Paudel wrote:
> [Apologies for cross-posting]
>
> HI all,
> I am rewriting a hadoop java code for the new (0.20.2) API- the code
> was originally written for versions <= 0.19.
>
> 1. What is the equivalent of the getCounter() method ? For example,
> the
Hi,
On Tue, Oct 26, 2010 at 8:14 PM, siddharth raghuvanshi
wrote:
> Hi,
>
> While running Terrior on Hadoop, I am getting the following error again &
> again, can someone please point out where the problem is?
>
> attempt_201010252225_0001_m_09_2: WARN - Error running child
> attempt_20101025
Hi,
On Sat, Oct 23, 2010 at 1:44 AM, Burhan Uddin wrote:
> Hello,
> I am a beginner with hadoop framework. I am trying create a distributed
> crawling application. I have googled a lot. but the resources are too low.
> Can anyone please help me on the following topics.
>
I suppose you know alrea
Hi,
You mentioned you'd like to configure different memory settings for
the process depending on which nodes the tasks run on. Which process
are you referring to here - the Hadoop daemons, or your map/reduce
program ?
An alternative approach could be to see if you can get only those
nodes in Torq
Hi,
On Mon, Sep 6, 2010 at 1:47 AM, Neil Ghosh wrote:
> Hi,
>
> I am trying to sort a list of numbers (one per line) using hadoop
> mapreduce.
> Kindly suggest any reference and code.
>
> How do I implement custom input format and recordreader so that both key and
> value are the number?
>
> I a
Hi,
>
> The optimization of one Hadoop job I'm running would benefit from knowing
> the
> maximum number of map slots in the Hadoop cluster.
>
> This number can be obtained (if my understanding is correct) by:
>
> * parsing the mapred-site.xml file to get
> the mapred.tasktracker.map.tasks.maximu
Hi,
Can you please confirm if you've set JAVA_HOME in
/hadoop-env.sh on all the nodes ?
Thanks
Hemanth
On Tue, Aug 31, 2010 at 6:21 AM, Mohit Anchlia wrote:
> Hi,
>
> I am running some basic setup and test to know about hadoop. When I
> try to start nodes I get this error. I am already using ja
Hi,
On Mon, Aug 30, 2010 at 8:19 AM, Gang Luo wrote:
> Hi all,
> I am trying to configure and start a hadoop cluster on EC2. I got some
> problems
> here.
>
>
> 1. Can I share hadoop code and its configuration across nodes? Say I have a
> distributed file system running in the cluster and all th
Hi,
On Sun, Aug 29, 2010 at 10:14 PM, Gang Luo wrote:
> HI all,
> I am setting a hadoop cluster where I have to specify the local directory for
> temp files/logs, etc. Should I allow everybody have the write permission to
> these directories? Who actually does the write operation?
The temp and l
Hmm. Without the / in the property tag, isn't the file malformed XML ?
I am pretty sure Hadoop complains in such cases ?
On Wed, Aug 25, 2010 at 4:44 AM, cliff palmer wrote:
> Thanks Allen - that has resolved the problem. Good catch!
> Cliff
>
> On Tue, Aug 24, 2010 at 3:05 PM, Allen Wittenauer
Mark,
On Wed, Aug 18, 2010 at 10:59 PM, Mark wrote:
> What is the preferred way of managing multiple configurations.. ie
> development, production etc.
>
> Is there someway I can tell hadoop to use a separate conf directory other
> than ${hadoop_home}/conf? I think I've read somewhere that one
Hi,
> Hi, Hemanth. Thinks for your reply!
>
> I tried your recommendation, absolute path, it worked, I was able to run the
> jobs successfully. Thank you!
> I was wondering why hadoop.tmp.dir ( or mapred.local.dir ? ) with relative
> path didn't work.
I am not entirely sure, but when the daemon
Hi,
> 1. I login through SSH without password from master and slaves, it's all
> right :-)
>
> 2.
>
> hadoop.tmp.dir
> tmp
>
>
> In fact, 'tmp' is what I want :-)
>
> $HADOOP_HOME
> + tmp
> + dfs
>
Hi,
On Thu, Aug 12, 2010 at 10:31 AM, Hemanth Yamijala wrote:
> Hi,
>
> On Thu, Aug 12, 2010 at 3:35 AM, Bobby Dennett
> wrote:
>> From what I've read/seen, it appears that, if not the "default"
>> scheduler, most installations are using Hadoop's
Hi,
On Thu, Aug 12, 2010 at 3:35 AM, Bobby Dennett
wrote:
> From what I've read/seen, it appears that, if not the "default"
> scheduler, most installations are using Hadoop's Fair Scheduler. Based
> on features and our requirements, we're leaning towards using the
> Capacity Scheduler; however, t
Hi,
On Tue, Aug 3, 2010 at 9:42 AM, saurabhsuman8989
wrote:
>
> By By 'tasks' i mean different tasks under one job. When a Job is distributed
> in different tasks , can i add prioroty to those tasks.
It would interesting to know why you want to do this. Can you please
explain your use case ?
Th
Hi,
It would also be worthwhile to look at the Tool interface
(http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Tool),
which is used by example programs in the MapReduce examples as well.
This would allow any arguments to be passed using the
-Dvar.name=var.value convention on comm
. This is *not* to be used by client code, and is not
guaranteed to work. In the latter versions of Hadoop (0.21 and trunk),
these methods have been deprecated in the public API and will be
removed altogether.
Thanks
hemanth
>
> Thanks,
> -Gang
>
>
>
> - 原始邮件
> 发件人
Hi,
> Actually I enabled all level logs. But I didn't realize to check logs in .out
> files and only looked at .log file and didn't see any error msgs. now I
> opened the .out file and saw the following logged exception:
>
> Exception in thread "IPC Server handler 5 on 50002"
> java.lang.OutOfM
Hi,
> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop to
> resent exactly the same files to cache for every job?
I may be able to answer this better if I understand the use case. If
you need the same files for every job, why would you need to send them
afresh each time ? I
Hi,
> Is there a list of configuration parameters that can be set per job.
I'm almost certain there's no list that documents per-job settable
parameters that well. From 0.21 onwards, I think a convention adopted
is to name all job-related or task-related parameters to include 'job'
or 'map' or 'r
Hi,
> if I use distributed cache to send some files to all the nodes in one MR job,
> can I reuse these cached files locally in my next job, or will hadoop re-sent
> these files again?
Cache files are reused across Jobs. From trunk onwards, they will be
restricted to be reused across jobs of the
Hi,
> I'd like to run a Hadoop (0.20.2) job
> from within another application, using ToolRunner.
>
> One class of this other application implements the Tool interface.
> The implemented run() method:
> * constructs a Job()
> * sets the input/output/mapper/reducer
> * sets the jar file by calling j
Hi,
> I am trying to use the hadoop's datajoin for joining two relation. According
> to
> the Readme file of datajoin, it gives the following syntax:
>
> $HADOOP_HOME/bin/hadoop jar hadoop-datajoin-examples.jar
> org.apache.hadoop.contrib.utils.join.DataJoinJob datajoin/input
> datajoin/output
Hi,
> Thanks for the information. I got your point. What I specifically want to ask
> is
> that if I use the following method to read my file now in each mapper:
>
> FileSystem hdfs=FileSystem.get(conf);
> URI[] uris=DistributedCache.getCacheFiles(conf);
>
Edward,
Overall, I think the consideration should be about how much load do
you expect to support on your cluster. For HDFS, there's a good amount
of information about how much RAM is required to support a certain
amount of data stored in DFS; something similar can be found for
Map/Reduce as well.
John,
Can you please redirect this to pig-u...@hadoop.apache.org ? You're
more likely to get good responses there.
Thanks
hemanth
On Thu, Jul 8, 2010 at 7:01 AM, John Seer wrote:
>
> Hello, Is there any way to share shema file in pig for the same table between
> projects?
>
>
> --
> View this m
Alex,
> I don't think this is what I am looking for. Essential, I wish to run both
> mapper as well as reducer. But at the same time, i wish to make sure that
> the temp files that are used between mappers and reducers are of my choice.
> Here, the choice means that I can specify the files in HDFS
Hi,
> I am running a mapreduce job on my hadoop cluster.
>
> I am running a 10 gigabytes data and one tiny failed task crashes the whole
> operation.
> I am up to 98% complete and throwing away all the finished data seems just
> like an awful waste.
> I'd like to save the finished data and run aga
Michael,
Configuration is not reloaded for daemons. There is currently no way
to refresh configuration once the cluster is started. Some specific
aspects - like queue configuration, blacklist nodes can be reloaded
based on commands like hadoop admin refreshQueues or some such.
Thanks
Hemanth
On
Michael,
> In addition to default FIFO scheduler, there are fair scheduler and capacity
> scheduler. In some sense, fair scheduler can be considered a user-based
> scheduling while capacity scheduler does a queue-based scheduling. Is there
> or will there be a hybrid scheduler that combines the
Shashank,
> Hi,
>
> Setup Info:
> I have 2 node hadoop (20.2) cluster on Linux boxes.
> HW info: 16 CPU (Hyperthreaded)
> RAM: 32 GB
>
> I am trying to configure capacity scheduling. I want to use memory
> management provided by capacity scheduler. But I am facing few issues.
> I have added hadoop
he values set for some specific configuration
variables. Unfortunately, the names of those variables have changed
from 20 to 21 and trunk. Hence, I need to know the version to specify
which ones to look up for.
Thanks
Hemanth
> Vidhya
>
> On 6/23/10 3:16 AM, "Hemanth Yamijala" wrot
> You can use --config to your bin/hadoop commands. I
> think it would also work if you set the HADOOP_CONF_DIR environment
> variable to point to this path.
>
>>
>>
>> On Wed, Jun 23, 2010 at 10:52 AM, Hemanth Yamijala wrote:
>>
>>> Pierre,
>&g
Vidhya,
> Hi
> This looks like a trivial problem but would be glad if someone can help..
>
> I have been trying to run a m-r job on my cluster. I had modified my configs
> (primarily reduced the heap sizes for the task tracker and the data nodes)
> and restarted my hadoop cluster and the job w
lso work if you set the HADOOP_CONF_DIR environment
variable to point to this path.
>
>
> On Wed, Jun 23, 2010 at 10:52 AM, Hemanth Yamijala wrote:
>
>> Pierre,
>>
>> > I have a program that generates the data that's supposed to be treated by
>> >
Pierre,
> I have a program that generates the data that's supposed to be treated by
> hadoop.
> It's a java program that should write right on hdfs.
> So as a test, I do this:
>
>
>
> Configuration config = new Configuration();
> FileSystem dfs = FileSystem.get(config);
>
There was also https://issues.apache.org/jira/browse/MAPREDUCE-1316
whose cause hit clusters at Yahoo! very badly last year. The situation
was particularly noticeable in the face of lots of jobs with failed
tasks and a specific fix that enabled OutOfBand heartbeats. The latter
(i.e. the OOB heartbe
Felix,
> I'm using the new Job class:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html
>
> There is a way to set the number of reduce tasks:
>
> setNumReduceTasks(int tasks)
>
> However, I don't see how to set the number of MAP tasks?
>
> I tried to set it
Ted,
> When the user calling FileSystem.copyFromLocalFile() doesn't have permission
> to write to certain hdfs path:
> Thread [main] (Suspended (exception AccessControlException))
> DFSClient.mkdirs(String, FsPermission) line: 905
> DistributedFileSystem.mkdirs(Path, FsPermission) line: 262
Edward,
If it's an option to copy the libraries to a fixed location on all the
cluster nodes, you could do that and configure them in the library
path via mapred.child.java.opts. Please look at http://bit.ly/ab93Z8
(MapReduce tutorial on Hadoop site) to see how to use this config
option for settin
Peter,
> I'm getting the following errors:
>
> WARN org.apache.hadoop.mapred.JobTracker: Serious problem, cannot find record
> of 'previous' heartbeat for
> 'tracker_m351.ra.wink.com:localhost/127.0.0.1:41885';
> reinitializing the tasktracker
>
> INFO org.apache.hadoop.mapred.JobTracker: Adding
Erik,
>
> I've been unable to resolve this problem on my own so I've decided to ask
> for help. I've pasted the logs I have for the DataNode on of the slave
> nodes. The logs for TaskTracker are essentially the same (i.e. same
> exception causing a shutdown).
>
> Any suggestions or hints as to wha
Keith,
On Sat, May 22, 2010 at 5:01 AM, Keith Wiley wrote:
> On May 21, 2010, at 16:07 , Mikhail Yakshin wrote:
>
>> On Fri, May 21, 2010 at 11:09 PM, Keith Wiley wrote:
>>> My Java mapper hands its processing off to C++ through JNI. On the C++
>>> side I need to access a file. I have already
Andrew,
> Just to be clear, I'm only sharing the Hadoop binaries and config files via
> NFS. I don't see how this would cause a conflict - do you have any
> additional information?
FWIW, we had an experience where we were storing config files on NFS
on a large cluster. Randomly, (and we guess
Jim,
> I have two machines, one is Windows XP and another one is Widows Vista. I
> did the same thing on two machines. Hadoop Eclipse Plugin works fine in
> Windows XP. But I got an error when I run it in Windows Vista.
>
> I copied hadoop-0.20.2-eclipse-plugin into Eclipse/plugins folder and
> re
Vasilis,
> I 'd like to pass different JVM options for map tasks and different
> ones for reduce tasks. I think it should be straightforward to add
> mapred.mapchild.java.opts, mapred.reducechild.java.opts to my
> conf/mapred-site.xml and process the new options accordingly in
> src/mapred/org/apa
Song,
> I guess you are very close to my point. I mean whether we can find a way
> to set the qsub parameter "ppn"?
>From what I could see in the HOD code, it appears you cannot override
the ppn value with HOD. You could look at
src/contrib/hod/hodlib/NodePools/torque.py, and specifically the
m
Song,
> I know it is the way to set the capacity of each node, however, I want to
> know, how can we make Torque manager that we will run more than 1 mapred
> tasks on each machine. Because if we dont do this, torque will assign other
> cores on this machine to other tasks, which may cause a com
Song,
> HOD is good, and can manage a large virtual cluster on a huge physical
> cluster. but the problem is, it doesnt apply more than one core for each
> machine, and I have already recieved complaint from our admin!
>
I assume what you want is the Map/Reduce cluster that is started by
HOD
65 matches
Mail list logo