Hi Amit
Are you seeing any errors or warnings on JT logs?
Regards
Bejoy KS
Hi
I found that the task tracker still appear on the web interface after I
killed the task tracker process, then I tried to restart it again,
But old task tracker remains. No matter how many times I repeated it
kill-restart.
Only restarting the job tracker solved my problem.
Hi,
I have a question related to VM reuse in Hadoop.I now understand the
purpose of VM reuse , but I am wondering how is it useful.
Example. for VM reuse to be effective or kicked in , we need more than one
mapper task to be submitted to a single node (for the same job).Hadoop
would consider
Hi Rahul
If you look at larger cluster and jobs that involve larger input data sets.
The data would be spread across the whole cluster, and a single node might
have various blocks of that entire data set. Imagine you have a cluster
with 100 map slots and your job has 500 map tasks, now in that
We are thinking to distribute like 50 node cluster. And trying to figure out
what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs, network). I
cannot actually come around any examples that people ran and found it working
well and cost effectively.
If anybody could share their best
This is the regular behavior. You should see it disappear after ~10 mins of
the timeout period. Reason is that every TT starts on an ephemeral port and
therefore appears as a new TT to the JT (TTs aren't persistent members of a
cluster).
On Tue, Apr 16, 2013 at 2:01 PM, dylan dwld0...@gmail.com
Tadas,
Hadoop Operations has pretty useful, up-to-date information. The chapter on
hardware selection is available here:
http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689
Regards,
Marcos
Em 16-04-2013 07:13, Tadas Makčinskas escreveu:
+1 for Hadoop Operations
On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI
marc...@buscapecompany.com wrote:
Tadas,
Hadoop Operations has pretty useful, up-to-date information. The chapter
on hardware selection is available here:
Ok, Thanks Bejoy.
Only in some typical scenarios it's possible , like the one that you have
mentioned.
Much more number of mappers and less number of mappers slots.
Regards,
Rahul
On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks bejoy.had...@gmail.com wrote:
Hi Rahul
If you look at larger cluster
When you process larger data volumes, this is the case mostly. :)
Say you have a job with smaller input size and if you have 2 blocks on a
single node and then the JT may schedule two tasks on the same TT if there
are available free slots. So those tasks can take advantage of JVM reuse.
Which
Agreed.
Not sure about the behavour of JT.Consider the situation.
N1 has split 1 and split 2 of a file and there are two map slots.N2 has
split 2 and it also has one mapper slot. I think the JT would probably
schedule a single map in N1 and another map in N2.For better parallel IO.
Rather than
Mighty users@hadoop
anyone on this.
On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee
rahul.rec@gmail.com wrote:
Hi,
I have a question related to Hadoop's input sampler ,which is used for
investigating the data set before hand using random selection , sampling
etc .Mainly used for
Hi Rahul
AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both
can be on N1 as well.
JT has no notion of JVM reuse. It doesn't consider that for task scheduling.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original Message-
From: Rahul
+1 for Hadoop operations.
There is one document from Hortonworks that explain hadoop cluster
infrastructure. That doc is a brand specific. But after referring the
hadoop operations, we can refer this doc to get a clear overview.
On Tue, Apr 16, 2013 at 4:50 PM, Bejoy Ks bejoy.had...@gmail.com
There are also reference architectures available from a variety of hardware
vendors - the likes of Dell, HP, IBM, Cisco, and others. They often outline
a reasonable framework for disk/cpu/memory mix, and usually include some
description of network as well. If you have a preferred hardware vendor,
Hadoop by default limit 5 concurrent threads per node for balancing
purpose. That causes your problem.
On Mon, Apr 15, 2013 at 10:24 PM, rauljin liujin666...@sina.com wrote:
**
HI:
The hadoop cluster is running balance.
And one datannode 172.16.80.72 is :
Datanode :Not
Nothing on JT log, but as I mentioned I see this in the client log:
[WARN ] org.apache.hadoop.mapred.JobClient » Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
[INFO ] org.apache.hadoop.mapred.JobClient » Cleaning up the staging
Hi Hemanth,
I did not explicitly using DistributedCache in my code. I did not use any
command line arguments like -libjars neither.
Where can I find job.xml? I am using Hbase MapReduce API and not setting any
job.xml.
The key point is I want to limit the size of
We've recently run into jobtracker memory issues on our new hadoop cluster. A
heap dump shows that there are thousands of copies of DistributedFileSystem
kept in FileSystem$Cache, a bit over one for each job run on the cluster and
their jobconf objects support this view. I believe these are
I understand that when Namenode starts up it reads fsimage to get the state
of HDFS and applies the edits file to complete it.
But how about the cluster topology ? Does the namenode read the config
files like core-site.xml/slaves/... etc to determine its cluster topology
or uses an API to build
On Tue, Apr 16, 2013 at 11:34 PM, Diwakar Sharma
diwakar.had...@gmail.comwrote:
uster topology or uses an API to build it.
If you stop and start the cluster Hadoop Reads thes configuration files for
sure.
∞
Shashwat Shriparv
From http://archive.cloudera.com/cdh/3/hadoop/hdfs_user_guide.html
(Assuming you are using Cloudera Hadoop Distribution 3)
$ hadoop dfsadmin -refreshNodes # would help do the same.
-refreshNodes : Updates the set of hosts allowed to connect to namenode.
Re-reads the config file to update values
Hello!
I'm working on a research project, and I also happen to be relatively new
to Hadoop/MapReduce. So apologies ahead of time for any glaring errors.
On my local machine, my project runs within a JVM and uses a Java API to
communicate with a Prolog server to do information lookups. I was
Assuming that the server can handle high volume and multiple queries there
is no reason not to run it on a large and powerful machine outside the
cluster. Nothing prevents your mappers from accessing a server or even,
depending on the design, a custom InputFormat from pulling data from the
server.
Hi All,
I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have
been trying to run a couple of mapreduce jobs as part of an oozie workflow
but have been blocked by the following exception: (my reducer always hangs
because of this) -
2013-04-17 00:32:02,268 WARN
Thank you very much .
This morning I found it, and tested it again.
The results as what you said.
发件人: Harsh J [mailto:ha...@cloudera.com]
发送时间: 2013年4月16日 18:21
收件人: user@hadoop.apache.org
主题: Re: Task Trackers accumulation
This is the regular behavior. You should see it disappear
try use job.waitFromComplete(true) instead of job.submit().
it should show more details.
On Mon, Apr 15, 2013 at 6:06 PM, Amit Sela am...@infolinks.com wrote:
Hi all,
I'm trying to submit a mapreduce job remotely using job.submit()
I get the following:
[WARN ]
For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).
Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
You can limit the size by setting local.cache.size in the mapred-site.xml
(or core-site.xml if that works for you). I mistakenly mentioned
mapred-default.xml in my last mail - apologies for that. However, please
note that this does not prevent whatever is writing into the distributed
cache from
Hi,
I am new to Hadoop. I started reading the standard Wordcount program. I got
this basic doubt in Hadoop.
After the Map - Reduce is done, where is the output generated? Does the
reducer ouput sit on individual DataNodes ? Please advise.
Thanks,
Raj
8 datanode in my hadoop cluseter ,when running reduce job,there is only 2
datanode running the job .
I want to use the 8 datanode to run the reduce job,so I can balance the I/O
press.
Any ideas?
Thanks.
rauljin
The data is in HDFS in case of WordCount MR sample.
In hdfs, you have the metadata in NameNode and actual data as blocks replicated
across DataNodes.
In case of reducer, If a reducer is running on a particular node then you have
one replica of the blocks in the same node (If there is no space
Hi Rauljin
Few things to check here.
What is the number of reduce slots in each Task Tracker? What is the number of
reduce tasks for your job?
Based on the availability of slots the reduce tasks are scheduled on TTs.
You can do the following
Set the number of reduce tasks to 8 or more.
Play
Just to add to Bejoy's comments, it also depends on the data distribution.
Is your data properly distributed across the HDFS?
Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com
On Wed, Apr 17, 2013 at 10:39 AM, bejoy.had...@gmail.com wrote:
**
Hi Rauljin
Few things to
Uniform Data distribution across HDFS is one of the factor that ensures map
tasks are uniformly distributed across nodes. But reduce tasks doesn't depend
on data distribution it is purely based on slot availability.
Regards
Bejoy KS
Sent from remote device, Please excuse typos
-Original
35 matches
Mail list logo