Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Bejoy Ks
Hi Amit Are you seeing any errors or warnings on JT logs? Regards Bejoy KS

Task Trackers accumulation

2013-04-16 Thread dylan
Hi I found that the task tracker still appear on the web interface after I killed the task tracker process, then I tried to restart it again, But old task tracker remains. No matter how many times I repeated it kill-restart. Only restarting the job tracker solved my problem.

VM reuse!

2013-04-16 Thread Rahul Bhattacharjee
Hi, I have a question related to VM reuse in Hadoop.I now understand the purpose of VM reuse , but I am wondering how is it useful. Example. for VM reuse to be effective or kicked in , we need more than one mapper task to be submitted to a single node (for the same job).Hadoop would consider

Re: VM reuse!

2013-04-16 Thread Bejoy Ks
Hi Rahul If you look at larger cluster and jobs that involve larger input data sets. The data would be spread across the whole cluster, and a single node might have various blocks of that entire data set. Imagine you have a cluster with 100 map slots and your job has 500 map tasks, now in that

HW infrastructure for Hadoop

2013-04-16 Thread Tadas Makčinskas
We are thinking to distribute like 50 node cluster. And trying to figure out what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs, network). I cannot actually come around any examples that people ran and found it working well and cost effectively. If anybody could share their best

Re: Task Trackers accumulation

2013-04-16 Thread Harsh J
This is the regular behavior. You should see it disappear after ~10 mins of the timeout period. Reason is that every TT starts on an ephemeral port and therefore appears as a new TT to the JT (TTs aren't persistent members of a cluster). On Tue, Apr 16, 2013 at 2:01 PM, dylan dwld0...@gmail.com

Re: HW infrastructure for Hadoop

2013-04-16 Thread MARCOS MEDRADO RUBINELLI
Tadas, Hadoop Operations has pretty useful, up-to-date information. The chapter on hardware selection is available here: http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689 Regards, Marcos Em 16-04-2013 07:13, Tadas Makčinskas escreveu:

Re: HW infrastructure for Hadoop

2013-04-16 Thread Bejoy Ks
+1 for Hadoop Operations On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI marc...@buscapecompany.com wrote: Tadas, Hadoop Operations has pretty useful, up-to-date information. The chapter on hardware selection is available here:

Re: VM reuse!

2013-04-16 Thread Rahul Bhattacharjee
Ok, Thanks Bejoy. Only in some typical scenarios it's possible , like the one that you have mentioned. Much more number of mappers and less number of mappers slots. Regards, Rahul On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Rahul If you look at larger cluster

Re: VM reuse!

2013-04-16 Thread Bejoy Ks
When you process larger data volumes, this is the case mostly. :) Say you have a job with smaller input size and if you have 2 blocks on a single node and then the JT may schedule two tasks on the same TT if there are available free slots. So those tasks can take advantage of JVM reuse. Which

Re: VM reuse!

2013-04-16 Thread Rahul Bhattacharjee
Agreed. Not sure about the behavour of JT.Consider the situation. N1 has split 1 and split 2 of a file and there are two map slots.N2 has split 2 and it also has one mapper slot. I think the JT would probably schedule a single map in N1 and another map in N2.For better parallel IO. Rather than

Re: Hadoop sampler related query!

2013-04-16 Thread Rahul Bhattacharjee
Mighty users@hadoop anyone on this. On Tue, Apr 16, 2013 at 2:19 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I have a question related to Hadoop's input sampler ,which is used for investigating the data set before hand using random selection , sampling etc .Mainly used for

Re: VM reuse!

2013-04-16 Thread bejoy . hadoop
Hi Rahul AFAIK there is no guarantee that 1 task would be on N1 and another on N2. Both can be on N1 as well. JT has no notion of JVM reuse. It doesn't consider that for task scheduling. Regards Bejoy KS Sent from remote device, Please excuse typos -Original Message- From: Rahul

Re: HW infrastructure for Hadoop

2013-04-16 Thread Amal G Jose
+1 for Hadoop operations. There is one document from Hortonworks that explain hadoop cluster infrastructure. That doc is a brand specific. But after referring the hadoop operations, we can refer this doc to get a clear overview. On Tue, Apr 16, 2013 at 4:50 PM, Bejoy Ks bejoy.had...@gmail.com

Re: HW infrastructure for Hadoop

2013-04-16 Thread Adam Smieszny
There are also reference architectures available from a variety of hardware vendors - the likes of Dell, HP, IBM, Cisco, and others. They often outline a reasonable framework for disk/cpu/memory mix, and usually include some description of network as well. If you have a preferred hardware vendor,

Re: threads quota is exceeded question

2013-04-16 Thread Thanh Do
Hadoop by default limit 5 concurrent threads per node for balancing purpose. That causes your problem. On Mon, Apr 15, 2013 at 10:24 PM, rauljin liujin666...@sina.com wrote: ** HI: The hadoop cluster is running balance. And one datannode 172.16.80.72 is : Datanode :Not

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Amit Sela
Nothing on JT log, but as I mentioned I see this in the client log: [WARN ] org.apache.hadoop.mapred.JobClient » Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [INFO ] org.apache.hadoop.mapred.JobClient » Cleaning up the staging

RE: How to configure mapreduce archive size?

2013-04-16 Thread Xia_Yang
Hi Hemanth, I did not explicitly using DistributedCache in my code. I did not use any command line arguments like -libjars neither. Where can I find job.xml? I am using Hbase MapReduce API and not setting any job.xml. The key point is I want to limit the size of

Jobtracker memory issues due to FileSystem$Cache

2013-04-16 Thread Marcin Mejran
We've recently run into jobtracker memory issues on our new hadoop cluster. A heap dump shows that there are thousands of copies of DistributedFileSystem kept in FileSystem$Cache, a bit over one for each job run on the cluster and their jobconf objects support this view. I believe these are

Get Hadoop cluster topology

2013-04-16 Thread Diwakar Sharma
I understand that when Namenode starts up it reads fsimage to get the state of HDFS and applies the edits file to complete it. But how about the cluster topology ? Does the namenode read the config files like core-site.xml/slaves/... etc to determine its cluster topology or uses an API to build

Re: Get Hadoop cluster topology

2013-04-16 Thread shashwat shriparv
On Tue, Apr 16, 2013 at 11:34 PM, Diwakar Sharma diwakar.had...@gmail.comwrote: uster topology or uses an API to build it. If you stop and start the cluster Hadoop Reads thes configuration files for sure. ∞ Shashwat Shriparv

Re: Get Hadoop cluster topology

2013-04-16 Thread Nikhil
From http://archive.cloudera.com/cdh/3/hadoop/hdfs_user_guide.html (Assuming you are using Cloudera Hadoop Distribution 3) $ hadoop dfsadmin -refreshNodes # would help do the same. -refreshNodes : Updates the set of hosts allowed to connect to namenode. Re-reads the config file to update values

Querying a Prolog Server from a JVM during a MapReduce Job

2013-04-16 Thread Robert Spurrier
Hello! I'm working on a research project, and I also happen to be relatively new to Hadoop/MapReduce. So apologies ahead of time for any glaring errors. On my local machine, my project runs within a JVM and uses a Java API to communicate with a Prolog server to do information lookups. I was

Re: Querying a Prolog Server from a JVM during a MapReduce Job

2013-04-16 Thread Steve Lewis
Assuming that the server can handle high volume and multiple queries there is no reason not to run it on a large and powerful machine outside the cluster. Nothing prevents your mappers from accessing a server or even, depending on the design, a custom InputFormat from pulling data from the server.

Problem: org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: connect timed out

2013-04-16 Thread Som Satpathy
Hi All, I have just set up a CDH cluster on EC2 using cloudera manager 4.5. I have been trying to run a couple of mapreduce jobs as part of an oozie workflow but have been blocked by the following exception: (my reducer always hangs because of this) - 2013-04-17 00:32:02,268 WARN

答复: Task Trackers accumulation

2013-04-16 Thread dylan
Thank you very much . This morning I found it, and tested it again. The results as what you said. 发件人: Harsh J [mailto:ha...@cloudera.com] 发送时间: 2013年4月16日 18:21 收件人: user@hadoop.apache.org 主题: Re: Task Trackers accumulation This is the regular behavior. You should see it disappear

Re: Submitting mapreduce and nothing happens

2013-04-16 Thread Zizon Qiu
try use job.waitFromComplete(true) instead of job.submit(). it should show more details. On Mon, Apr 15, 2013 at 6:06 PM, Amit Sela am...@infolinks.com wrote: Hi all, I'm trying to submit a mapreduce job remotely using job.submit() I get the following: [WARN ]

Mapreduce jobs to download job input from across the internet

2013-04-16 Thread David Parks
For a set of jobs to run I need to download about 100GB of data from the internet (~1000 files of varying sizes from ~10 different domains). Currently I do this in a simple linux script as it's easy to script FTP, curl, and the like. But it's a mess to maintain a separate server for that

Re: How to configure mapreduce archive size?

2013-04-16 Thread Hemanth Yamijala
You can limit the size by setting local.cache.size in the mapred-site.xml (or core-site.xml if that works for you). I mistakenly mentioned mapred-default.xml in my last mail - apologies for that. However, please note that this does not prevent whatever is writing into the distributed cache from

Basic Doubt in Hadoop

2013-04-16 Thread Raj Hadoop
Hi, I am new to Hadoop. I started reading the standard Wordcount program. I got this basic doubt in Hadoop. After the Map - Reduce is done, where is the output generated?  Does the reducer ouput sit on individual DataNodes ? Please advise. Thanks, Raj

How to balance reduce job

2013-04-16 Thread rauljin
8 datanode in my hadoop cluseter ,when running reduce job,there is only 2 datanode running the job . I want to use the 8 datanode to run the reduce job,so I can balance the I/O press. Any ideas? Thanks. rauljin

Re: Basic Doubt in Hadoop

2013-04-16 Thread bejoy . hadoop
The data is in HDFS in case of WordCount MR sample. In hdfs, you have the metadata in NameNode and actual data as blocks replicated across DataNodes. In case of reducer, If a reducer is running on a particular node then you have one replica of the blocks in the same node (If there is no space

Re: How to balance reduce job

2013-04-16 Thread bejoy . hadoop
Hi Rauljin Few things to check here. What is the number of reduce slots in each Task Tracker? What is the number of reduce tasks for your job? Based on the availability of slots the reduce tasks are scheduled on TTs. You can do the following Set the number of reduce tasks to 8 or more. Play

Re: How to balance reduce job

2013-04-16 Thread Mohammad Tariq
Just to add to Bejoy's comments, it also depends on the data distribution. Is your data properly distributed across the HDFS? Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, Apr 17, 2013 at 10:39 AM, bejoy.had...@gmail.com wrote: ** Hi Rauljin Few things to

Re: How to balance reduce job

2013-04-16 Thread bejoy . hadoop
Uniform Data distribution across HDFS is one of the factor that ensures map tasks are uniformly distributed across nodes. But reduce tasks doesn't depend on data distribution it is purely based on slot availability. Regards Bejoy KS Sent from remote device, Please excuse typos -Original