Hi,
I 'd like to pass different JVM options for map tasks and different
ones for reduce tasks. I think it should be straightforward to add
mapred.mapchild.java.opts, mapred.reducechild.java.opts to my
conf/mapred-site.xml and process the new options accordingly in
src/mapred/org/apache/mapreduce/T
Hi,
On Thu, Apr 1, 2010 at 2:02 PM, Scott Carey wrote:
>> In this example, what hadoop config parameters do the above 2 buffers
>> refer to? io.sort.mb=250, but which parameter does the "map side join"
>> 100MB refer to? Are you referring to the split size of the input data
>> handled by a single
All,
thanks for your suggestions everyone, these are valuable.
Some comments:
On Wed, Mar 31, 2010 at 6:06 PM, Scott Carey wrote:
> On Linux, check out the 'swappiness' OS tunable -- you can turn this down
> from the default to reduce swapping at the expense of some system file cache.
> However
Hi all,
I 've noticed swapping for a single terasort job on a small 8-node
cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I
can have back to back runs of the same job from the same hdfs input
data and get swapping only on 1 out of 4 identical runs. I 've noticed
this swapping
Hi all,
is there a config option that controls placement of all hadoop logs?
I 'd like to put all hadoop logs under a specific directory e.g. /tmp.
on the namenode and all datanodes.
Is hadoop.log.dir the right config? Can I change this in the
log4j.properties file, or pass it e.g. in the JVM op
Hi,
Is it possible (and does it make sense) to reuse JVMs across jobs?
The job.reuse.jvm.num.tasks config option is a job specific parameter,
as its name implies. When running multiple independent jobs
simultaneously with job.reuse.jvm=-1 (this means always reuse), I see
a lot of different Java P
Hi,
For a node with M gigabytes of memory and N total child tasks (both
map + reduce) running on the node, what do people typically use for
the following parameters:
- Xmx (heap size per child task JVM)?
I.e. my question here is what percentage of the total memory node do
you use for the heaps of
Hi,
I am trying to use 4 SATA disks per node in my hadoop cluster. This is
a JBOD configuration, no RAID is involved. There is one single xfs
partition per disk, each one mounted as /local/, /local2/, /local3,
/local4 - with sufficient privileges for running hadoop jobs. HDFS is
setup across the 4
Hi,
I am trying to submit many independent jobs in paralllel (same user).
This works for up to 16 jobs, but after that I only get 16 jobs in
parallel no matter how many I try to submit. I am using fair scheduler
with the following config:
12
12
100
4
100
Judging by this config
ec
io.compression.codec.lzo.class
com.hadoop.compression.lzo.LzoCodec
my error sounds lzop-specific, maybe my io.compression.codec.lzo.class
should include something about lzop?
thanks,
- Vasilis
> Thanks
> -Todd
>
> On Tue, Feb 2, 2010 at 9:09 AM, Vasilis Liaskovitis wrote:
Hi,
I am trying to use hadoop-0.20.1 and hadoop-lzo
(http://github.com/kevinweil/hadoop-lzo) to index an lzo file. I 've
followed the instructions and copied both jar and native libs in my
classpaths. I am getting this error in both local and distributed
indexer mode
bin/hadoop jar lib/hadoop-lzo
I am trying to use lzo for intermediate map compression and gzip for
output compression in my hadoop-0.20.1 jobs. For lzo usage, I 've
compiled .jar and jni/native library from
http://code.google.com/p/hadoop-gpl-compression/ (version 0.1.0). Also
using native lzo library v2.03.
Is there an easy w
Hi Scott,
thanks for the extra tips, these are very helpful.
On Mon, Dec 7, 2009 at 3:57 PM, Scott Carey wrote:
>
>>
>> I am using hadoop-0.20.1 to run terasort and randsort benchmarking
>> tests on a small 8-node linux cluster. Most runs consist of usually
>> low (<50%) core utilizations in the
oop-common, hadoop-mapred and hadoop-hdfs?
thanks again,
- Vasilis
> Thanks,
> -Todd
>
> On Wed, Dec 2, 2009 at 12:22 PM, Vasilis Liaskovitis
> wrote:
>
>> Hi,
>>
>> I am using hadoop-0.20.1 to run terasort and randsort benchmarking
>> tests on a small 8-
Hi,
I am using hadoop-0.20.1 to run terasort and randsort benchmarking
tests on a small 8-node linux cluster. Most runs consist of usually
low (<50%) core utilizations in the map and reduce phase, as well as
heavy I/O phases . There is usually a large fraction of runtime for
which cores are idling
Hi,
how can I build and use hadoop-git?
The project has recently been split into 3 repositories hadoop-common,
hadoop-hdfs and hadoop-mapred. It's not clear to me how to
build/compile and use the git/tip for the whole framework. E.g. would
building all jars from the 3 subprojects (and copying them
Hi,
given a single cluster running with the default job scheduler: Is only
one job executing on the cluster, regardless of how many task
map/reduce slots it can keep busy?
In other words, If a job does not use all task slots, would the
default scheduler consider scheduling map/reduce from other jo
Hi,
in the filesystem counters for each job, what is the difference
between HDFS_BYTES_WRITTEN and FILE_BYTES_WRITTEN?
- Do they refer to disjoint data, perhaps hdfs-metadata and map/reduce
application data respectively?
- another interpretation is that HDFS_BYTES refers to bytes
"virtually" writ
Hi,
I am getting different values for START_TIME, FINISH_TIME regarding
the exact same task when looking at the history log of a sorter job.
E.g greping for a particular reduce task in the history log:
masternode:/home/hadoop-git # grep -r
"task_200909031613_0002_r_02"
/home/vliaskov/hadoop-
Hi,
a) Is there a wiki page or other documentation explaining the exact
meaning of the job / filesystem / mapreduce counters reported after
every job run?
9/08/27 15:04:10 INFO mapred.JobClient: Job complete: job_200908271428_0002
09/08/27 15:04:10 INFO mapred.JobClient: Counters: 19
09/08/27 15:
lize all the 8 cores to the maximum
>> (there's
>> a little bit of over-subscription to account for tasks idling while doing
>> I/O).
>>
>> In the web admin console, how many map-tasks and reduce-tasks are reported
>> to have been launched for your job?
>
Hi,
I am a beginner trying to setup a few simple hadoop tests on a single
node before moving on to a cluster. I am just using the simple
wordcount example for now. My question is what's the best way to
guarantee utilization of all cores on a single-node? So assuming a
single node with 16-cores wha
22 matches
Mail list logo