Installing Hadoop on Debian Squeeze

2011-03-15 Thread Dieter Plaetinck
Hi, I see there are various posts claiming hadoop is available through official debian mirrors (for debian squeeze, i.e. stable): * http://www.debian-news.net/2010/07/17/apache-hadoop-in-debian-squeeze/ * http://blog.isabel-drost.de/index.php/archives/213/apache-hadoop-in-debian-squeeze

Re: Installing Hadoop on Debian Squeeze

2011-03-21 Thread Dieter Plaetinck
On Thu, 17 Mar 2011 19:33:02 +0100 Thomas Koch tho...@koch.ro wrote: Currently my advise is to use the Debian packages from cloudera. That's the problem, it appears there are none. Like I said in my earlier mail, Debian is not in Cloudera's list of supported distros, and they do not have a

# of keys per reducer invocation (streaming api)

2011-03-29 Thread Dieter Plaetinck
Hi, I'm using the streaming API and I notice my reducer gets - in the same invocation - a bunch of different keys, and I wonder why. I would expect to get one key per reducer run, as with the normal hadoop. Is this to limit the amount of spawned processes, assuming creating and destroying

Re: # of keys per reducer invocation (streaming api)

2011-03-31 Thread Dieter Plaetinck
On Tue, 29 Mar 2011 23:17:13 +0530 Harsh J qwertyman...@gmail.com wrote: Hello, On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: Hi, I'm using the streaming API and I notice my reducer gets - in the same invocation - a bunch of different keys

hadoop streaming shebang line for python and mappers jumping to 100% completion right away

2011-03-31 Thread Dieter Plaetinck
Hi, I use 0.20.2 on Debian 6.0 (squeeze) nodes. I have 2 problems with my streaming jobs: 1) I start the job like so: hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -file /proj/Search/wall/experiment/ \ -mapper './nolog.sh mapper' \ -reducer

sorting reducer input numerically in hadoop streaming

2011-03-31 Thread Dieter Plaetinck
hi, I use hadoop 0.20.2, more specifically hadoop-streaming, on Debian 6.0 (squeeze) nodes. My question is: how do I make sure input keys being fed to the reducer are sorted numerically rather then alphabetically? example: - standard behavior: #1 some-value1 #10 some-value10 #100 some-value100

INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors

2011-04-04 Thread Dieter Plaetinck
Hi, I have a cluster of 4 debian squeeze machines, on all of them I installed the same version ( hadoop-0.20.2.tar.gz ) I have : n-0 namenode, n-1: jobtracker and n-{0,1,2,3} slaves but you can see all my configs in more detail @ http://pastie.org/1754875 the machines have 3GiB RAM. I don't

Re: INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors

2011-04-11 Thread Dieter Plaetinck
of the datanode or tasktracker logs. And the NameNode webinterface even tells me all nodes are live, none are dead. This is effectively holding me back from using the cluster, I'm completely in the dark, I find this very frustrating. :( Thank you, Dieter On Mon, 4 Apr 2011 18:45:49 +0200 Dieter Plaetinck

Re: sorting reducer input numerically in hadoop streaming

2011-04-13 Thread Dieter Plaetinck
parameter for it, as noted in: http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Comparator+Class [The -D mapred.output.key.comparator.class=xyz part] On Thu, Mar 31, 2011 at 6:26 PM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: couldn't find how I should do that.

can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Dieter Plaetinck
Hi, I have a script something like this (simplified): for i in $(seq 1 200); do regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.job.name=$i \ -file $dir \ -mapper ... -reducer ... -input $i-input -output

Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Dieter Plaetinck
exec_stream_job.sh regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.job.name=$i \ -file $dir \         -mapper ... -reducer ... -input $i-input -output $i-output From: Dieter

What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

2011-05-12 Thread Dieter Plaetinck
Hi, I'm running some experiments using hadoop streaming. I always get a output_dir/part-0 file at the end, but I wonder: when exactly will this filename show up? when it's completely written, or will it already show up while the hapreduce software is still writing to it? Is the write atomic?

Re: What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

2011-05-13 Thread Dieter Plaetinck
On Thu, 12 May 2011 09:49:23 -0700 (PDT) Aman aman_d...@hotmail.com wrote: The creation of files part-n is atomic. When you run a MR job, these files are created in directory output_dir/_temporary and moved to output_dir after the files is closed for writing. This move is atomic hence as

Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles patr...@cloudera.com wrote: kinda clunky but you could do this via

Re: Are hadoop fs commands serial or parallel

2011-05-23 Thread Dieter Plaetinck
On Fri, 20 May 2011 10:11:13 -0500 Brian Bockelman bbock...@cse.unl.edu wrote: On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Try giving it to a user; watch them feed it a list of 10,000 files

executing dfs copyFromLocal, rm, ... synchronously? i.e. wait until done?

2011-06-20 Thread Dieter Plaetinck
Hi, if I simplify my code, I basically do this: hadoop dfs -rm -skipTrash $file hadoop dfs -copyFromLocal - $local $file (the removal is needed because I run a job but previous input/output may exist, so I need to delete it first, as -copyFromLocal does not support overwrite) During the 2nd

Re: next gen map reduce

2011-08-01 Thread Dieter Plaetinck
On Thu, 28 Jul 2011 06:13:01 -0700 Thomas Graves tgra...@yahoo-inc.com wrote: Its currently still on the MR279 branch - http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/. It is planned to be merged to trunk soon. Tom On 7/28/11 7:31 AM, real great..

Re: Namenode Scalability

2011-08-17 Thread Dieter Plaetinck
Hi, On Wed, 10 Aug 2011 13:26:18 -0500 Michel Segel michael_se...@hotmail.com wrote: This sounds like a homework assignment than a real world problem. Why? just wondering. I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-) huh?

Exception in thread main java.io.IOException: No FileSystem for scheme: file

2011-08-26 Thread Dieter Plaetinck
Hi, I know this question has been asked before, but I could not find the right solution. Maybe because I use hadoop 0.20.2, some posts assumed older versions. My code (relevant chunk): import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; Configuration conf =

Re: Binary content

2011-09-01 Thread Dieter Plaetinck
On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some

Re: risks of using Hadoop

2011-09-21 Thread Dieter Plaetinck
On Wed, 21 Sep 2011 11:21:01 +0100 Steve Loughran ste...@apache.org wrote: On 20/09/11 22:52, Michael Segel wrote: PS... There's this junction box in your machine room that has this very large on/off switch. If pulled down, it will cut power to your cluster and you will lose everything.

Re: HDFS and Openstack - avoiding excessive redundancy

2011-11-14 Thread Dieter Plaetinck
Or more general: isn't using virtualized i/o counter effective when dealing with hadoop M/R? I would think that for running hadoop M/R you'd want predictable and consistent i/o on each node, not to mention your bottlenecks are usually disk i/o (and maybe CPU), so using virtualisation makes

DIR 2012 (CFP)

2011-11-28 Thread Dieter Plaetinck
Hello friends of hadoop, I just want to inform you about the 12th edition of the Dutch Information Retrieval conference which will be organized in the lovely city of Ghent, Belgium on 23/24 february 2012. There's the usual CFP, see the website at http://dir2012.intec.ugent.be/ There's definitely

Re: HDFS Explained as Comics

2011-12-01 Thread Dieter Plaetinck
Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted

Re: Optimized Hadoop

2012-02-22 Thread Dieter Plaetinck
Great work folks! Very interesting. PS: did you notice if you google for hanborq or HDH it's very hard to find your website, hanborq.com ? Dieter On Tue, 21 Feb 2012 02:17:31 +0800 Schubert Zhang zson...@gmail.com wrote: We just update the slides of this improvements: