adding new datanode into cluster needs restarting the namenode?

2012-11-21 Thread Maoke
hi all, is there anyone having experience with adding a new datanode into a rack-aware cluster without restarting the namenode, in cdh4 distribution? as it is said that adding a new datanode is a hot operation that can be done when the cluster is online. i tried that but it looked not working

Re: reducer not starting

2012-11-21 Thread Jean-Marc Spaggiari
Just FYI, you don't need to stop the job, update the host, and retry. Just update the host while the job is running and it should retry and restart. I had a similar issue with one of my node where the hosts file were not updated. After the updated it has automatically resume the work... JM

Not able to change the priority of job using fair scheduler

2012-11-21 Thread Chunky Gupta
Hi, I have enabled the fair scheduler and everything is set to default with only few configuration changes. It is working fine and multiple users can run queries simultaneously. But I am not able to change the priority from *http://JobTracker URL/scheduler* . Priority column is coming as a

Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar
its not data loss, problem is caused that multipleoutputs do not work with standard committer if you do not write into subdirectory of main job output.

Re: reducer not starting

2012-11-21 Thread jamal sasha
Hi Thanks for the insights. I noticed that these restarts of mappers were because in the shebang i had Usr/env/bin instead of usr/env/bin python Any clue of what was going on with reducers not starting but mappers being executed again and again. Probably a very naive question but i am newbie you

Re: reducer not starting

2012-11-21 Thread bharath vissapragada
As harsh suggested, you might want to check the task logs on slaves (you can do it though web UI by clicking on map/reduce task links) and see if there are any exceptions . On Wed, Nov 21, 2012 at 8:06 PM, jamal sasha jamalsha...@gmail.com wrote: Hi Thanks for the insights. I noticed that

Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread AnilKumar B
Thanks Radim. Yes, as you said we are not writing into sub-directory of main job. I will try by making them as sub-directories of output dir. But one question, when I turn of speculative execution then it is working fine with same multiple output directory structure. May I know, how exactly it

Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar
Dne 21.11.2012 16:07, AnilKumar B napsal(a): Thanks Radim. Yes, as you said we are not writing into sub-directory of main job. I will try by making them as sub-directories of output dir. But one question, when I turn of speculative execution then it is working fine with same multiple output

Re: When speculative execution is true, there is a data loss issue with multpleoutputs

2012-11-21 Thread Radim Kolar
this is another problem with fileoutputformat committer, its related to your. https://issues.apache.org/jira/browse/MAPREDUCE-3772 it works like this: if multipleoutput is relative to job output, then there is a workaround to make it work with commiter and outputs from multiple tasks do not

io.file.buffer.size

2012-11-21 Thread Kartashov, Andy
Guys, I've read that increasing above (default 4kb) number to, say 128kb, might speed things up. My input is 40mln serialised records coming from RDMS and I noticed that with increased IO my job actually runs a tiny bit slower. Is that possible? p.s. got two questions: 1. During Sqoop import

guessing number of reducers.

2012-11-21 Thread jamal sasha
By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring...

RE: guessing number of reducers.

2012-11-21 Thread Kartashov, Andy
Jamal, This is what I am using... After you start your job, visit jobtracker's WebUI ip-address:50030 And look for Cluster summary. Reduce Task Capacity shall hint you what optimally set your number to. I could be wrong but it works for me. :) Cluster Summary (Heap Size is *** MB/966.69 MB)

Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS
Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive

RE: guessing number of reducers.

2012-11-21 Thread Kartashov, Andy
Bejoy, I've read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested: Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: 1 Reducer - 22mins 4 Reducers - 11.5mins 8 Reducers - 5mins 10 Reducers - 7mins 12 Reducers -

Re: guessing number of reducers.

2012-11-21 Thread Mohammad Tariq
Hello Jamal, I use a different approach based on the no of cores. If you have, say a 4 cores machine then you can have (0.75*no cores)no. of MR slots. For example, if you have 4 physical cores OR 8 virtual cores then you can have 0.75*8=6 MR slots. You can then set 3M+3R or 4M+2R and so on as

Re: Facebook corona compatibility

2012-11-21 Thread Robert Molina
Hi Amit, There is a mention here to Start in the hadoop-20 parent path : https://github.com/facebook/hadoop-20/wiki/Corona-Single-Node-Setup Regards, Rob On Mon, Nov 12, 2012 at 8:01 AM, Amit Sela am...@infolinks.com wrote: Hi everyone, Anyone knows if the new corona tools (Facebook just

Re: guessing number of reducers.

2012-11-21 Thread Bejoy KS
Hi Manoj If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and say you intended to give n bytes to a reducer then the number of reducers can be computed as Total input size/ bytes per reducer.

MapReduce logs

2012-11-21 Thread Jean-Marc Spaggiari
Hi, When we run a MapReduce job, the logs are stored on all the tasktracker nodes. Is there an easy way to agregate all those logs together and see them in a single place instead of going to the tasks one by one and open the file? Thanks, JM

Re: MapReduce logs

2012-11-21 Thread Dino Kečo
Hi, We had similar requirement and we built small Java application which gets information about task nodes from Job Tracker and download logs into one file using URLs of each task tracker. For huge logs this becomes slow and time consuming. Hope this helps. Regards, Dino Kečo msn:

Re: MapReduce logs

2012-11-21 Thread Jean-Marc Spaggiari
Thanks for the info. I have quickly draft this bash script in case it can help someone... You just neeed to make sure the IP inside is replaced. To call it, you need to give the job task page. ./showLogs.sh http://192.168.23.7:50030/jobtasks.jsp?jobid=job_201211211408_0001type=mappagenum=1;

Re: fundamental doubt

2012-11-21 Thread Mohammad Tariq
Hello Jamal, For efficient processing all the values associated with the same key get sorted and go to same reducer. As a result the reducer gets a key and a list of values as its input. To me your assumption seems correct. Regards, Mohammad Tariq On Thu, Nov 22, 2012 at 1:20 AM,

Re: fundamental doubt

2012-11-21 Thread Bejoy KS
Hi Jamal It is performed at a frame work level map emits key value pairs and the framework collects and groups all the values corresponding to a key from all the map tasks. Now the reducer takes the input as a key and a collection of values only. The reduce method signature defines it.

Re: fundamental doubt

2012-11-21 Thread jamal sasha
got it. thanks for clarification On Wed, Nov 21, 2012 at 3:03 PM, Bejoy KS bejoy.had...@gmail.com wrote: ** Hi Jamal It is performed at a frame work level map emits key value pairs and the framework collects and groups all the values corresponding to a key from all the map tasks. Now the

Re: Pentaho

2012-11-21 Thread Harsh J
A better place to ask this at, is at the Pentaho's own community http://wiki.pentaho.com/display/BAD/Pentaho+Big+Data+Community+Home. At a glance, they have forums and IRC you could use to ask your questions about their product. On Wed, Nov 21, 2012 at 11:40 PM, suneel hadoop

Re: Facebook corona compatibility

2012-11-21 Thread Harsh J
IIRC, Facebook's own hadoop branch (Github: facebook/hadoop I guess), does not support or carry any security features, which Apache Hadoop 0.20.203 - 1.1.x now carries. So out of the box, I expect it to be incompatible with any of the recent Apache releases. On Mon, Nov 12, 2012 at 9:31 PM, Amit

Re: guessing number of reducers.

2012-11-21 Thread Manoj Babu
Thank you for the info Bejoy. Cheers! Manoj. On Thu, Nov 22, 2012 at 12:04 AM, Bejoy KS bejoy.had...@gmail.com wrote: ** Hi Manoj If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and say

Re: Hadoop Web Interface Security

2012-11-21 Thread Visioner Sadak
thanks harsh any hints on how to give user.name in configuration files for simple authentication,is that given as a property On Wed, Nov 21, 2012 at 5:52 PM, Harsh J ha...@cloudera.com wrote: Yes, see http://hadoop.apache.org/docs/current/hadoop-auth/Configuration.html and also see

RE: HADOOP UPGRADE ISSUE

2012-11-21 Thread Uma Maheswara Rao G
start-all.sh will not carry any arguments to pass to nodes. Start with start-dfs.sh or start directly namenode with upgrade option. ./hadoop namenode -upgrade Regards, Uma From: yogesh dhari [yogeshdh...@live.com] Sent: Thursday, November 22, 2012 12:23 PM