separate JVM flags for map and reduce tasks

2010-04-22 Thread Vasilis Liaskovitis
Hi, I 'd like to pass different JVM options for map tasks and different ones for reduce tasks. I think it should be straightforward to add mapred.mapchild.java.opts, mapred.reducechild.java.opts to my conf/mapred-site.xml and process the new options accordingly in src/mapred/org/apache/mapreduce/T

Re: swapping on hadoop

2010-04-01 Thread Vasilis Liaskovitis
Hi, On Thu, Apr 1, 2010 at 2:02 PM, Scott Carey wrote: >> In this example, what hadoop config parameters do the above 2 buffers >> refer to? io.sort.mb=250, but which parameter does the "map side join" >> 100MB refer to? Are you referring to the split size of the input data >> handled by a single

Re: swapping on hadoop

2010-04-01 Thread Vasilis Liaskovitis
All, thanks for your suggestions everyone, these are valuable. Some comments: On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey wrote: > On Linux, check out the 'swappiness' OS tunable -- you can turn this down > from the default to reduce swapping at the expense of some system file cache. > However

swapping on hadoop

2010-03-30 Thread Vasilis Liaskovitis
Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping

hadoop.log.dir

2010-03-29 Thread Vasilis Liaskovitis
Hi all, is there a config option that controls placement of all hadoop logs? I 'd like to put all hadoop logs under a specific directory e.g. /tmp. on the namenode and all datanodes. Is hadoop.log.dir the right config? Can I change this in the log4j.properties file, or pass it e.g. in the JVM op

reuse JVMs across multiple jobs

2010-02-19 Thread Vasilis Liaskovitis
Hi, Is it possible (and does it make sense) to reuse JVMs across jobs? The job.reuse.jvm.num.tasks config option is a job specific parameter, as its name implies. When running multiple independent jobs simultaneously with job.reuse.jvm=-1 (this means always reuse), I see a lot of different Java P

JVM heap and sort buffer size guidelines

2010-02-19 Thread Vasilis Liaskovitis
Hi, For a node with M gigabytes of memory and N total child tasks (both map + reduce) running on the node, what do people typically use for the following parameters: - Xmx (heap size per child task JVM)? I.e. my question here is what percentage of the total memory node do you use for the heaps of

using multiple disks for HDFS

2010-02-09 Thread Vasilis Liaskovitis
Hi, I am trying to use 4 SATA disks per node in my hadoop cluster. This is a JBOD configuration, no RAID is involved. There is one single xfs partition per disk, each one mounted as /local/, /local2/, /local3, /local4 - with sufficient privileges for running hadoop jobs. HDFS is setup across the 4

maximum number of jobs

2010-02-08 Thread Vasilis Liaskovitis
Hi, I am trying to submit many independent jobs in paralllel (same user). This works for up to 16 jobs, but after that I only get 16 jobs in parallel no matter how many I try to submit. I am using fair scheduler with the following config: 12 12 100 4 100 Judging by this config

Re: ClassCastException in lzo indexer

2010-02-02 Thread Vasilis Liaskovitis
ec io.compression.codec.lzo.class com.hadoop.compression.lzo.LzoCodec my error sounds lzop-specific, maybe my io.compression.codec.lzo.class should include something about lzop? thanks, - Vasilis > Thanks > -Todd > > On Tue, Feb 2, 2010 at 9:09 AM, Vasilis Liaskovitis wrote:

ClassCastException in lzo indexer

2010-02-02 Thread Vasilis Liaskovitis
Hi, I am trying to use hadoop-0.20.1 and hadoop-lzo (http://github.com/kevinweil/hadoop-lzo) to index an lzo file. I 've followed the instructions and copied both jar and native libs in my classpaths. I am getting this error in both local and distributed indexer mode bin/hadoop jar lib/hadoop-lzo

verifying that lzo compression is being used

2010-01-27 Thread Vasilis Liaskovitis
I am trying to use lzo for intermediate map compression and gzip for output compression in my hadoop-0.20.1 jobs. For lzo usage, I 've compiled .jar and jni/native library from http://code.google.com/p/hadoop-gpl-compression/ (version 0.1.0). Also using native lzo library v2.03. Is there an easy w

Re: hadoop idle time on terasort

2009-12-08 Thread Vasilis Liaskovitis
Hi Scott, thanks for the extra tips, these are very helpful. On Mon, Dec 7, 2009 at 3:57 PM, Scott Carey wrote: > >> >> I am using hadoop-0.20.1 to run terasort and randsort benchmarking >> tests on a small 8-node linux cluster. Most runs consist of usually >> low (<50%) core utilizations in the

Re: hadoop idle time on terasort

2009-12-02 Thread Vasilis Liaskovitis
oop-common, hadoop-mapred and hadoop-hdfs? thanks again, - Vasilis > Thanks, > -Todd > > On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis > wrote: > >> Hi, >> >> I am using hadoop-0.20.1 to run terasort and randsort benchmarking >> tests on a small 8-

hadoop idle time on terasort

2009-12-02 Thread Vasilis Liaskovitis
Hi, I am using hadoop-0.20.1 to run terasort and randsort benchmarking tests on a small 8-node linux cluster. Most runs consist of usually low (<50%) core utilizations in the map and reduce phase, as well as heavy I/O phases . There is usually a large fraction of runtime for which cores are idling

build and use hadoop-git

2009-11-29 Thread Vasilis Liaskovitis
Hi, how can I build and use hadoop-git? The project has recently been split into 3 repositories hadoop-common, hadoop-hdfs and hadoop-mapred. It's not clear to me how to build/compile and use the git/tip for the whole framework. E.g. would building all jars from the 3 subprojects (and copying them

default job scheduler behaviour

2009-09-26 Thread Vasilis Liaskovitis
Hi, given a single cluster running with the default job scheduler: Is only one job executing on the cluster, regardless of how many task map/reduce slots it can keep busy? In other words, If a job does not use all task slots, would the default scheduler consider scheduling map/reduce from other jo

filesystem counters HDFS_BYTES vs FILE_BYTES

2009-09-16 Thread Vasilis Liaskovitis
Hi, in the filesystem counters for each job, what is the difference between HDFS_BYTES_WRITTEN and FILE_BYTES_WRITTEN? - Do they refer to disjoint data, perhaps hdfs-metadata and map/reduce application data respectively? - another interpretation is that HDFS_BYTES refers to bytes "virtually" writ

duplicate START_TIME, FINISH_TIME timestamps in history log

2009-09-10 Thread Vasilis Liaskovitis
Hi, I am getting different values for START_TIME, FINISH_TIME regarding the exact same task when looking at the history log of a sorter job. E.g greping for a particular reduce task in the history log: masternode:/home/hadoop-git # grep -r "task_200909031613_0002_r_02" /home/vliaskov/hadoop-

performance counters & vaidya diagnostics help

2009-08-28 Thread Vasilis Liaskovitis
Hi, a) Is there a wiki page or other documentation explaining the exact meaning of the job / filesystem / mapreduce counters reported after every job run? 9/08/27 15:04:10 INFO mapred.JobClient: Job complete: job_200908271428_0002 09/08/27 15:04:10 INFO mapred.JobClient: Counters: 19 09/08/27 15:

Re: utilizing all cores on single-node hadoop

2009-08-23 Thread Vasilis Liaskovitis
lize all the 8 cores to the maximum >> (there's >> a little bit of over-subscription to account for tasks idling while doing >> I/O). >> >> In the web admin console, how many map-tasks and reduce-tasks are reported >> to have been launched for your job? >

utilizing all cores on single-node hadoop

2009-08-17 Thread Vasilis Liaskovitis
Hi, I am a beginner trying to setup a few simple hadoop tests on a single node before moving on to a cluster. I am just using the simple wordcount example for now. My question is what's the best way to guarantee utilization of all cores on a single-node? So assuming a single node with 16-cores wha