Re: Optimized Hadoop

2012-02-22 Thread Dieter Plaetinck
Great work folks! Very interesting.

PS: did you notice if you google for hanborq or HDH it's very hard to find 
your website, hanborq.com ?

Dieter

On Tue, 21 Feb 2012 02:17:31 +0800
Schubert Zhang zson...@gmail.com wrote:

 We just update the slides of this improvements:
 http://www.slideshare.net/hanborq/hanborq-optimizations-on-hadoop-mapreduce-20120216a
 
 Updates:
 (1) modified some describes to make things more clear and accuracy.
 (2) add some benchmarks to make sense.
 
 On Sat, Feb 18, 2012 at 11:12 PM, Anty anty@gmail.com wrote:
 
 
 
  On Fri, Feb 17, 2012 at 3:27 AM, Todd Lipcon t...@cloudera.com wrote:
 
  Hey Schubert,
 
  Looking at the code on github, it looks like your rewritten shuffle is
  in fact just a backport of the shuffle from MR2. I didn't look closely
 
 
  additionally, the rewritten shuffle in MR2 has some bugs, which harm the
  overall performance, for which I have already file a jira to report this,
  with a patch available.
  MAPREDUCE-3685 https://issues.apache.org/jira/browse/MAPREDUCE-3685
 
 
 
  - are there any distinguishing factors?
  Also, the OOB heartbeat and adaptive heartbeat code seems to be the
  same as what's in 1.0?
 
  -Todd
 
  On Thu, Feb 16, 2012 at 9:44 AM, Schubert Zhang zson...@gmail.com
  wrote:
   Here is the presentation to describe our job,
  
  http://www.slideshare.net/hanborq/hanborq-optimizations-on-hadoop-mapreduce-20120216a
   Wellcome to give your advises.
   It's just a little step, and we are continue to do more improvements,
  thanks
   for your help.
  
  
  
  
   On Thu, Feb 16, 2012 at 11:01 PM, Anty anty@gmail.com wrote:
  
   Hi: Guys
  We just deliver a optimized hadoop , if you are interested, Pls
   refer to https://github.com/hanborq/hadoop
  
   --
   Best Regards
   Anty Rao
  
  
 
 
 
  --
  Todd Lipcon
  Software Engineer, Cloudera
 
 
 
 
  --
  Best Regards
  Anty Rao
 



Re: HDFS Explained as Comics

2011-12-01 Thread Dieter Plaetinck
Very clear.  The comic format works indeed quite well.
I never considered comics as a serious (professional) way to get something 
explained efficiently,
but this shows people should think twice before they start writing their next 
documentation.

one question though: if a DN has a corrupted block, why does the NN only remove 
the bad DN from the block's list, and not the block from the DN list?
(also, does it really store the data in 2 separate tables?  This looks to me 
like 2 different views of the same data?)

Dieter

On Thu, 1 Dec 2011 08:53:31 +0100
Alexander C.H. Lorenz wget.n...@googlemail.com wrote:

 Hi all,
 
 very cool comic!
 
 Thanks,
  Alex
 
 On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh
 manu.i...@gmail.com
  wrote:
 
  Hi,
 
  This is indeed a good way to explain, most of the improvement has
  already been discussed. waiting for sequel of this comic.
 
  Regards,
  Abhishek
 
  On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney
  mvarsh...@gmail.com
  wrote:
 
   Hi Matthew
  
   I agree with both you and Prashant. The strip needs to be
   modified to explain that these can be default values that can be
   optionally
  overridden
   (which I will fix in the next iteration).
  
   However, from the 'understanding concepts of HDFS' point of view,
   I still think that block size and replication factors are the
   real strengths of HDFS, and the learners must be exposed to them
   so that they get to see
  how
   hdfs is significantly different from conventional file systems.
  
   On personal note: thanks for the first part of your message :)
  
   -Maneesh
  
  
   On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) 
   matthew.go...@monsanto.com wrote:
  
Maneesh,
   
Firstly, I love the comic :)
   
Secondly, I am inclined to agree with Prashant on this latest
point.
   While
one code path could take us through the user defining command
line overrides (e.g. hadoop fs -D blah -put foo bar) I think it
might
  confuse
   a
person new to Hadoop. The most common flow would be using admin
   determined
values from hdfs-site and the only thing that would need to
change is
   that
conversation happening between client / server and not user /
client.
   
Matt
   
-Original Message-
From: Prashant Kommireddi [mailto:prash1...@gmail.com]
Sent: Wednesday, November 30, 2011 3:28 PM
To: common-user@hadoop.apache.org
Subject: Re: HDFS Explained as Comics
   
Sure, its just a case of how readers interpret it.
   
  1. Client is required to specify block size and replication
factor
  each
  time
  2. Client does not need to worry about it since an admin has
set the properties in default configuration files
   
A client could not be allowed to override the default configs
if they
  are
set final (well there are ways to go around it as well as you
suggest
  by
using create() :)
   
The information is great and helpful. Just want to make sure a
beginner
   who
wants to write a WordCount in Mapreduce does not worry about
  specifying
block size' and replication factor in his code.
   
Thanks,
Prashant
   
On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney
mvarsh...@gmail.com
wrote:
   
 Hi Prashant

 Others may correct me if I am wrong here..

 The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge
 of
  block
size
 and replication factor. In the source code, I see the
 following in
  the
 DFSClient constructor:

defaultBlockSize = conf.getLong(dfs.block.size,
   DEFAULT_BLOCK_SIZE);

defaultReplication = (short)
 conf.getInt(dfs.replication, 3);

 My understanding is that the client considers the following
 chain for
   the
 values:
 1. Manual values (the long form constructor; when a user
 provides
  these
 values)
 2. Configuration file values (these are cluster level
 defaults: dfs.block.size and dfs.replication)
 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3)

 Moreover, in the
 org.apache.hadoop.hdfs.protocool.ClientProtocol the
   API
to
 create a file is
 void create(, short replication, long blocksize);

 I presume it means that the client already has knowledge of
 these
   values
 and passes them to the NameNode when creating a new file.

 Hope that helps.

 thanks
 -Maneesh

 On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi 
prash1...@gmail.com
 wrote:

  Thanks Maneesh.
 
  Quick question, does a client really need to know Block
  size and replication factor - A lot of times client has no
  control over
  these
(set
  at cluster level)
 
  -Prashant Kommireddi
 
  On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges 
   dejan.men...@gmail.com
  wrote:
 
   Hi 

DIR 2012 (CFP)

2011-11-28 Thread Dieter Plaetinck
Hello friends of hadoop,
I just want to inform you about the 12th edition of the Dutch Information
Retrieval conference which will be organized in the lovely city of Ghent,
Belgium on 23/24 february 2012.

There's the usual CFP, see the website at http://dir2012.intec.ugent.be/
There's definitely a place for hadoop/map-reduce related items.
The name says Dutch/Belgian because that's where most research institutions
and companies that show up are located, but obviously everybody is welcome.
:-)

I hope to see you there!

Dieter 


Re: HDFS and Openstack - avoiding excessive redundancy

2011-11-14 Thread Dieter Plaetinck
Or more general:
isn't using virtualized i/o counter effective when dealing with hadoop M/R?
I would think that for running hadoop M/R you'd want predictable and consistent 
i/o on each node,
not to mention your bottlenecks are usually disk i/o (and maybe CPU), so using 
virtualisation makes
things less performant and less predictable, so, inferior.  Or am I missing 
something?

Dieter

On Sat, 12 Nov 2011 07:54:05 +
Graeme Seaton li...@graemes.com wrote:

 One advantage to using Hadoop replication though, is that it provides
 a greater pool of potential servers for M/R jobs to execute on.  If
 you simply use Openstack replication it will appear to the JobTracker
 that a particular block only exists on a single server and should
 only be executed on that node.  This may have have an impact
 depending on your workload profile.
 
 Regards,
 Graeme
 
 On 12/11/11 07:24, Dejan Menges wrote:
  Replication factor for HDFS can easily be changed to 1 if you don't
  need it's redundancy in hdfs-site.xml
 
  Regards,
  Dejo
 
  Sent from my iPhone
 
  On 12. 11. 2011., at 03:58, Edmon Begoliebeg...@gmail.com  wrote:
 
  A question related to standing up cloud infrastructure for running
  Hadoop/HDFS.
 
  We are building up an infrastructure using Openstack which has its
  own storage management redundancy.
 
  We are planning to use Openstack to instantiate Hadoop nodes (HDFS,
  M/R tasks, Hive, HBase)
  on demand.
 
  The problem is that HDFS by design creates three copies of the
  data, so there is a 4x times redundancy
  which we would prefer to avoid.
 
  I am asking here if anyone has had a similar case and if anyone has
  had any helpful solution to recommend.
 
  Thank you in advance,
  Edmon
 



Re: risks of using Hadoop

2011-09-21 Thread Dieter Plaetinck
On Wed, 21 Sep 2011 11:21:01 +0100
Steve Loughran ste...@apache.org wrote:

 On 20/09/11 22:52, Michael Segel wrote:
 
  PS... There's this junction box in your machine room that has this
  very large on/off switch. If pulled down, it will cut power to your
  cluster and you will lose everything. Now would you consider this a
  risk? Sure. But is it something you should really lose sleep over?
  Do you understand that there are risks and there are improbable
  risks?
 
 We follow the @devops_borat Ops book and have a post-it-note on the 
 switch saying not a light switch

:D


Re: Binary content

2011-09-01 Thread Dieter Plaetinck
On Wed, 31 Aug 2011 08:44:42 -0700
Mohit Anchlia mohitanch...@gmail.com wrote:

 Does map-reduce work well with binary contents in the file? This
 binary content is basically some CAD files and map reduce program need
 to read these files using some proprietry tool extract values and do
 some processing. Wondering if there are others doing similar type of
 processing. Best practices etc.

yes, it works.  you just need to select the right input format.
Personally i store all my binary files into a sequencefile (because my binary 
files are small)

Dieter


Exception in thread main java.io.IOException: No FileSystem for scheme: file

2011-08-26 Thread Dieter Plaetinck
Hi,
I know this question has been asked before, but I could not find 
the right solution.  Maybe because I use hadoop 0.20.2, some posts assumed 
older versions.


My code (relevant chunk):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);

the last line gives:
Exception in thread main java.io.IOException: No FileSystem for
scheme: file


I launched this like so:
java -cp /usr/local/hadoop/src/core/:/usr/local/hadoop/conf/ -jar
myjar.jar
AFAICT, this should make sure all the configuration can be found and it should 
be able to connect to the filesystem:

dplaetin@n-0:~$ ls -alh /usr/local/hadoop/src/core/ /usr/local/hadoop/conf/
/usr/local/hadoop/conf/:
total 64K
drwxr-xr-x  2 dplaetin Search 4.0K Aug 26 17:21 .
drwxr-xr-x 12 root root   4.0K Feb 19  2010 ..
-rw-rw-r--  1 root root   3.9K Feb 19  2010 capacity-scheduler.xml
-rw-rw-r--  1 root root535 Feb 19  2010 configuration.xsl
-rw-r--r--  1 dplaetin Search  459 Apr 29 15:06 core-site.xml
-rw-r--r--  1 dplaetin Search 2.3K Apr 11 14:23 hadoop-env.sh
-rw-rw-r--  1 root root   1.3K Feb 19  2010 hadoop-metrics.properties
-rw-rw-r--  1 root root   4.1K Feb 19  2010 hadoop-policy.xml
-rw-r--r--  1 dplaetin Search  490 Apr 11 10:18 hdfs-site.xml
-rw-r--r--  1 dplaetin Search 2.8K Apr 11 14:23 log4j.properties
-rw-r--r--  1 dplaetin Search 1.1K Aug  3 09:49 mapred-site.xml
-rw-rw-r--  1 root root 10 Feb 19  2010 masters
-rw-r--r--  1 dplaetin Search   95 Apr  4 17:17 slaves
-rw-rw-r--  1 root root   1.3K Feb 19  2010 ssl-client.xml.example
-rw-rw-r--  1 root root   1.2K Feb 19  2010 ssl-server.xml.example

/usr/local/hadoop/src/core/:
total 36K
drwxr-xr-x  3 root root 4.0K Aug 24 10:40 .
drwxr-xr-x 15 root root 4.0K Aug 24 10:40 ..
-rw-rw-r--  1 root root  14K Feb 19  2010 core-default.xml
drwxr-xr-x  3 root root 4.0K Feb 19  2010 org
-rw-rw-r--  1 root root 7.9K Feb 19  2010 overview.html

Specifically, you can see I have a core-default.xml and a core-site.xml, which 
should be all that's needed,
according to the org.apache.hadoop.conf.Configuration documentation
(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html)
I read somewhere there should be a hadoop-default.xml; I thought this was 
deprecated but to be sure I
created the empty file in /usr/local/hadoop/src/core/ , but the error remained 
the same.

The cluster works fine, I've done tens of jobs on it, but as you can see, 
something fails when I try to interface to it directly.

Thanks in advance for any help,
Dieter




Re: Namenode Scalability

2011-08-17 Thread Dieter Plaetinck
Hi,

On Wed, 10 Aug 2011 13:26:18 -0500
Michel Segel michael_se...@hotmail.com wrote:

 This sounds like a homework assignment than a real world problem.

Why? just wondering.

 I guess people don't race cars against trains or have two trains
 traveling in different directions anymore... :-)

huh?


Re: next gen map reduce

2011-08-01 Thread Dieter Plaetinck
On Thu, 28 Jul 2011 06:13:01 -0700
Thomas Graves tgra...@yahoo-inc.com wrote:

 Its currently still on the MR279 branch -
 http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/.  It is
 planned to be merged to trunk soon.
 
 Tom
 
 On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com
 wrote:
 
  In which Hadoop version is next gen introduced?
 

Hi,
what exactly is contained within this next generation mysterious
sounding MRV2? What's it about?

Dieter


executing dfs copyFromLocal, rm, ... synchronously? i.e. wait until done?

2011-06-20 Thread Dieter Plaetinck
Hi,
if I simplify my code, I basically do this:
hadoop dfs -rm -skipTrash $file
hadoop dfs -copyFromLocal - $local $file

(the removal is needed because I run a job but previous input/output may exist, 
so I need to delete it first, as -copyFromLocal does not support overwrite)

During the 2nd command I often get NotReplicatedYetException errors. (see below 
for complete stacktrace):
Is there a way to make commands not return until the file removal/ (and in 
extension: addition) has completely replicated?
I couldn't find it in the help for `hadoop dfs -copyFromLocal, -rm, -rmr, ...` 
etc.
I found https://issues.apache.org/jira/browse/HADOOP-1595 which introduces a 
flag for synchronous operation, but it only affects setrep.

Alternatively (a bit more ugly but still acceptable ): is there a
command/script I can execute to check whatever the latest operation was on 
this file, has it been replicated yet? 
I've been looking at `hadoop fsck` which seems to do what I want, at
least for files that should exist, but for removed files I'm not so sure.
And it's hard to manually test all possible edge cases and race
conditions.

Currently I'm running my script without the -skipTrash in the assumption the 
operation is much faster, and the race condition will be less likely.
However I'm not too fond of this approach as it could still break 
(copyFromLocal could break when the first file hasn't been fully moved to 
thrash yet) and I don't have that much diskspace and I rather save diskspace as 
soon as I can,
I'm always sure I won't need the files again anyway.
I guess yet another trick could be for me to move the file first to a temporary 
junkfile and start a `hadoop dfs -rm -skipTrash` on the junkfile, but that 
could cause 2 race conditions (both the copyFromLocal of the original file and 
the rm of the junkfile could break if the move is not yet fully replicated)

thanks for any input,
Dieter


11/06/20 14:15:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not 
replicated 
yet:/user/dplaetin/wip-dirty-CF0-CHNK100-EB1-FW1-FW_K0-FW_NA1-FW_NB0-M2-MP0-NFmerged_ner_mofis-NUMBEST10-PATRN_INCL-PR_A0-PR_Fpunctuation_v1-S_L0-S_P0-SR_Fstopwords_ranks.nl_v1-SQ_K10-SQ_R1-TFIDF1-input
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1257)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:740)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)


Re: Are hadoop fs commands serial or parallel

2011-05-23 Thread Dieter Plaetinck
On Fri, 20 May 2011 10:11:13 -0500
Brian Bockelman bbock...@cse.unl.edu wrote:

 
 On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:
 
  What do you mean clunky?
  IMHO this is a quite elegant, simple, working solution.
 
 Try giving it to a user; watch them feed it a list of 10,000 files;
 watch the machine swap to death and the disks uselessly thrash.
 
  Sure this spawns multiple processes, but it beats any
  api-overcomplications, imho.
  
 
 Simple doesn't imply scalable, unfortunately.
 
 Brian

True, I assumed if anyone wants this, he knows what he's doing (i.e.
the files could be small and already in the Linux block cache).
Because why would anyone read files in parrallel if that causes disk
seeks all over the place? Ideally, you should tune for 1 sequential read
per disk at the time. In that respect, I definitely agree that some
clever logic in userspace to optimize disk reads (across a bunch of
different possible hardware setups) would be beneficial.

Dieter


Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky?
IMHO this is a quite elegant, simple, working solution.
Sure this spawns multiple processes, but it beats any
api-overcomplications, imho.

Dieter


On Wed, 18 May 2011 11:39:36 -0500
Patrick Angeles patr...@cloudera.com wrote:

 kinda clunky but you could do this via shell:
 
 for $FILE in $LIST_OF_FILES ; do
   hadoop fs -copyFromLocal $FILE $DEST_PATH 
 done
 
 If doing this via the Java API, then, yes you will have to use
 multiple threads.
 
 On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
 mapred.le...@gmail.comwrote:
 
  Thanks harsh !
  That means basically both APIs as well as hadoop client commands
  allow only serial writes.
  I was wondering what could be other ways to write data in parallel
  to HDFS other than using multiple parallel threads.
 
  Thanks,
  JJ
 
  Sent from my iPhone
 
  On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:
 
   Hello,
  
   Adding to Joey's response, copyFromLocal's current implementation
   is
  serial
   given a list of files.
  
   On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
   mapred.le...@gmail.com wrote:
   Thanks Joey !
   I will try to find out abt copyFromLocal. Looks like Hadoop Apis
   write
   serially as you pointed out.
  
   Thanks,
   -JJ
  
   On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com
   wrote:
  
   The sequence file writer definitely does it serially as you can
   only ever write to the end of a file in Hadoop.
  
   Doing copyFromLocal could write multiple files in parallel (I'm
   not sure if it does or not), but a single file would be written
   serially.
  
   -Joey
  
   On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
   mapred.le...@gmail.com
   wrote:
   Hi,
   My question is when I run a command from hdfs client, for eg.
   hadoop
  fs
   -copyFromLocal or create a sequence file writer in java code
   and
  append
   key/values to it through Hadoop APIs, does it internally
  transfer/write
   data
   to HDFS serially or in parallel ?
  
   Thanks in advance,
   -JJ
  
  
  
  
   --
   Joseph Echeverria
   Cloudera, Inc.
   443.305.9434
  
  
   --
   Harsh J
 



Re: What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

2011-05-13 Thread Dieter Plaetinck
On Thu, 12 May 2011 09:49:23 -0700 (PDT)
Aman aman_d...@hotmail.com wrote:

 The creation of files part-n is atomic. When you run a MR job,
 these files are created in directory output_dir/_temporary and
 moved to output_dir after the files is closed for writing. This
 move is atomic hence as long as you don't try to read these files
 from temporary directory (which I see you are not) you will be fine. 

Perfect!
thanks.

Dieter


What exactly are the output_dir/part-00000 semantics (of a streaming job) ?

2011-05-12 Thread Dieter Plaetinck
Hi,
I'm running some experiments using hadoop streaming.
I always get a output_dir/part-0 file at the end, but I wonder:
when exactly will this filename show up? when it's completely written,
or will it already show up while the hapreduce software is still
writing to it? Is the write atomic?


The reason I'm asking this, I have a script which submits +- 200 of
jobs to mapreduce, and I have an other script collecting the
part-00 files of all jobs. (not just once when all experiments are
done, but I frequently collect all results of thus far finished jobs)

For this, I just do (simplified code):

for i in $(seq 1 200); do
  if $(ssh $master bin/hadoop dfs -ls $i/output/part-00); then
ssh $master bin/hadoop dfs -cat $i/output/part-00  output_$i
  fi
done

and I wonder if this is prone to race conditions, is there any change I
will run this while $i/output/part-00 in the process of being
written to, and hence I end up with incomplete output_$i files?

If so, what's the proper way to check if the file is really stable?
fetching the jobtracker webpage and checking if job $i is finished?

Dieter


can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Dieter Plaetinck
Hi,
I have a script something like this (simplified):

for i in $(seq 1 200); do
   regenerate-files $dir $i
   hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-D mapred.job.name=$i \
-file $dir \
-mapper ... -reducer ... -input $i-input -output $i-output

so I want to launch 200 hadoop jobs, each which needs files from $dir, more 
specifically some files in $dir are being generated to be used with
job $i (and only that job)
the problem is, generating those files takes some time.
currently the hadoop command packs and submits the job and then waits for the 
job to complete.
This causes the regenerate-files program to cause needless delays.
Is there a way to make the hadoop jar command return when the job is packaged 
and submitted?
I obviously cannot just background the hadoop calls, because that would start 
regenerating the files while previous jobs are still being packaged.
I thought about using different directories per job to generate those files in, 
but that would needlessly consume disk space, which is not good.

I've been looking at 
http://hadoop.apache.org/mapreduce/docs/current/streaming.html and googling for 
answers, but couldn't find a solution.
I found 
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.html
 but that doesn't seem to work with streaming.

thanks,
Dieter


Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?

2011-05-06 Thread Dieter Plaetinck
that will cause 200 regenerate-files processes running on the same
files, at the same time.  not good..

Dieter

On Fri, 6 May 2011 07:49:45 -0700 (PDT)
Bharath Mundlapudi bharathw...@yahoo.com wrote:

 how about this?
 
 for i in $(seq 1 200); do
 exec_stream_job.sh $dir $i 
 
 
 exec_stream_job.sh
 
 
 regenerate-files $dir $i
 hadoop
 jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
 -D mapred.job.name=$i \ -file $dir \
         -mapper ... -reducer ... -input $i-input -output $i-output
 
 
 
 
 
 From: Dieter Plaetinck dieter.plaeti...@intec.ugent.be
 To: common-user@hadoop.apache.org
 Sent: Friday, May 6, 2011 7:09 AM
 Subject: Re: can a `hadoop -jar streaming.jar` command return when a
 job is packaged and submitted?
 
 Hi,
 I have a script something like this (simplified):
 
 for i in $(seq 1 200); do
    regenerate-files $dir $i
    hadoop
 jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
 -D mapred.job.name=$i \ -file $dir \
         -mapper ... -reducer ... -input $i-input -output $i-output
 
 so I want to launch 200 hadoop jobs, each which needs files from
 $dir, more specifically some files in $dir are being generated to be
 used with job $i (and only that job) the problem is, generating those
 files takes some time. currently the hadoop command packs and submits
 the job and then waits for the job to complete. This causes the
 regenerate-files program to cause needless delays. Is there a way to
 make the hadoop jar command return when the job is packaged and
 submitted? I obviously cannot just background the hadoop calls,
 because that would start regenerating the files while previous jobs
 are still being packaged. I thought about using different directories
 per job to generate those files in, but that would needlessly consume
 disk space, which is not good.
 
 I've been looking at
 http://hadoop.apache.org/mapreduce/docs/current/streaming.html and
 googling for answers, but couldn't find a solution. I found
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.html
 but that doesn't seem to work with streaming.
 
 thanks,
 Dieter



Re: sorting reducer input numerically in hadoop streaming

2011-04-13 Thread Dieter Plaetinck
Thank you Harsh,
that works fine!
(looks like the page I was looking at was the same, but for an older
version of hadoop)

Dieter

On Fri, 1 Apr 2011 13:07:38 +0530
Harsh J qwertyman...@gmail.com wrote:

 You will need to supply your own Key-comparator Java class by setting
 an appropriate parameter for it, as noted in:
 http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Comparator+Class
 [The -D mapred.output.key.comparator.class=xyz part]
 
 On Thu, Mar 31, 2011 at 6:26 PM, Dieter Plaetinck
 dieter.plaeti...@intec.ugent.be wrote:
  couldn't find how I should do that.
 



Re: INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors

2011-04-11 Thread Dieter Plaetinck
Anyone? Anyone at all?
I figured out the issue with the jobtracker, but I still have the
errors:

* Error register getProtocolVersion
* File (..) could only be replicated to 0 nodes, instead of 1

as explained in my first mail.

The 2nd error can appear without _any_ errors in _any_ of the datanode
or tasktracker logs.  And the NameNode webinterface even tells me all
nodes are live, none are dead.

This is effectively holding me back from using the cluster, 
I'm completely in the dark, I find this very frustrating. :(

Thank you,

Dieter

On Mon, 4 Apr 2011 18:45:49 +0200
Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote:

 Hi,
 I have a cluster of 4 debian squeeze machines, on all of them I
 installed the same version ( hadoop-0.20.2.tar.gz )
 
 I have : n-0 namenode, n-1: jobtracker and n-{0,1,2,3} slaves
 but you can see all my configs in more detail @
 http://pastie.org/1754875
 
 the machines have 3GiB RAM.
 I don't think disk space is an issue anywhere, but FWIW:
 /var/bomvl (which would house hdfs stuff) is a filesystem of 140GB on
 all nodes. they all have 1.8GB free on /
 
 
 so, I invoke the cluster on n-0:
 
 $ bin/hadoop namenode -format
 11/04/04 18:27:33 INFO namenode.NameNode: STARTUP_MSG: 
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = n-0.map-reduce.search.wall2.test/10.2.1.90
 STARTUP_MSG:   args = [-format]
 STARTUP_MSG:   version = 0.20.2
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20
 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 /
 11/04/04 18:27:33 INFO namenode.FSNamesystem:
 fsOwner=dplaetin,Search,root 11/04/04 18:27:33 INFO
 namenode.FSNamesystem: supergroup=supergroup 11/04/04 18:27:33 INFO
 namenode.FSNamesystem: isPermissionEnabled=true 11/04/04 18:27:33
 INFO common.Storage: Image file of size 98 saved in 0 seconds.
 11/04/04 18:27:34 INFO common.Storage: Storage
 directory /var/bomvl/name has been successfully formatted. 11/04/04
 18:27:34 INFO namenode.NameNode:
 SHUTDOWN_MSG: /
 SHUTDOWN_MSG: Shutting down NameNode at
 n-0.map-reduce.search.wall2.test/10.2.1.90
 /
 
 $ bin/start-all.sh
 starting namenode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.out
 n-0: starting datanode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-datanode-n-0.map-reduce.search.wall2.test.out
 n-2: starting datanode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-datanode-n-2.map-reduce.search.wall2.test.out
 n-3: starting datanode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-datanode-n-3.map-reduce.search.wall2.test.out
 n-1: starting datanode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-datanode-n-1.map-reduce.search.wall2.test.out
 localhost: starting secondarynamenode, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-secondarynamenode-n-0.map-reduce.search.wall2.test.out
 starting jobtracker, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-jobtracker-n-0.map-reduce.search.wall2.test.out
 n-2: starting tasktracker, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-tasktracker-n-2.map-reduce.search.wall2.test.out
 n-3: starting tasktracker, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-tasktracker-n-3.map-reduce.search.wall2.test.out
 n-0: starting tasktracker, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-tasktracker-n-0.map-reduce.search.wall2.test.out
 n-1: starting tasktracker, logging
 to 
 /var/log/hadoop/hadoop-dplaetin-tasktracker-n-1.map-reduce.search.wall2.test.out
 
 
 However, in
 n-0:/var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.log,
 I see this: 2011-04-04 18:27:34,760 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode:
 STARTUP_MSG: /
 STARTUP_MSG: Starting NameNode STARTUP_MSG:   host =
 n-0.map-reduce.search.wall2.test/10.2.1.90 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20
 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 /
 2011-04-04 18:27:34,846 INFO
 org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics
 with hostName=NameNode, port=9000 2011-04-04 18:27:34,851 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 n-0-lan0/10.1.1.2:9000 2011-04-04 18:27:34,853 INFO
 org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics
 with processName=NameNode, sessionId=null 2011-04-04 18:27:34,854
 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext 2011-04-04
 18:27:34,893 INFO

INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors

2011-04-04 Thread Dieter Plaetinck
Hi,
I have a cluster of 4 debian squeeze machines, on all of them I
installed the same version ( hadoop-0.20.2.tar.gz )

I have : n-0 namenode, n-1: jobtracker and n-{0,1,2,3} slaves
but you can see all my configs in more detail @
http://pastie.org/1754875

the machines have 3GiB RAM.
I don't think disk space is an issue anywhere, but FWIW:
/var/bomvl (which would house hdfs stuff) is a filesystem of 140GB on
all nodes. they all have 1.8GB free on /


so, I invoke the cluster on n-0:

$ bin/hadoop namenode -format
11/04/04 18:27:33 INFO namenode.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = n-0.map-reduce.search.wall2.test/10.2.1.90
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; 
compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
/
11/04/04 18:27:33 INFO namenode.FSNamesystem: fsOwner=dplaetin,Search,root
11/04/04 18:27:33 INFO namenode.FSNamesystem: supergroup=supergroup
11/04/04 18:27:33 INFO namenode.FSNamesystem: isPermissionEnabled=true
11/04/04 18:27:33 INFO common.Storage: Image file of size 98 saved in 0 seconds.
11/04/04 18:27:34 INFO common.Storage: Storage directory /var/bomvl/name has 
been successfully formatted.
11/04/04 18:27:34 INFO namenode.NameNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down NameNode at 
n-0.map-reduce.search.wall2.test/10.2.1.90
/

$ bin/start-all.sh
starting namenode, logging to 
/var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.out
n-0: starting datanode, logging to 
/var/log/hadoop/hadoop-dplaetin-datanode-n-0.map-reduce.search.wall2.test.out
n-2: starting datanode, logging to 
/var/log/hadoop/hadoop-dplaetin-datanode-n-2.map-reduce.search.wall2.test.out
n-3: starting datanode, logging to 
/var/log/hadoop/hadoop-dplaetin-datanode-n-3.map-reduce.search.wall2.test.out
n-1: starting datanode, logging to 
/var/log/hadoop/hadoop-dplaetin-datanode-n-1.map-reduce.search.wall2.test.out
localhost: starting secondarynamenode, logging to 
/var/log/hadoop/hadoop-dplaetin-secondarynamenode-n-0.map-reduce.search.wall2.test.out
starting jobtracker, logging to 
/var/log/hadoop/hadoop-dplaetin-jobtracker-n-0.map-reduce.search.wall2.test.out
n-2: starting tasktracker, logging to 
/var/log/hadoop/hadoop-dplaetin-tasktracker-n-2.map-reduce.search.wall2.test.out
n-3: starting tasktracker, logging to 
/var/log/hadoop/hadoop-dplaetin-tasktracker-n-3.map-reduce.search.wall2.test.out
n-0: starting tasktracker, logging to 
/var/log/hadoop/hadoop-dplaetin-tasktracker-n-0.map-reduce.search.wall2.test.out
n-1: starting tasktracker, logging to 
/var/log/hadoop/hadoop-dplaetin-tasktracker-n-1.map-reduce.search.wall2.test.out


However, in 
n-0:/var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.log,
I see this:
2011-04-04 18:27:34,760 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = n-0.map-reduce.search.wall2.test/10.2.1.90
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; 
compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
/
2011-04-04 18:27:34,846 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=9000
2011-04-04 18:27:34,851 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
Namenode up at: n-0-lan0/10.1.1.2:9000
2011-04-04 18:27:34,853 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2011-04-04 18:27:34,854 INFO 
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing 
NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
2011-04-04 18:27:34,893 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
fsOwner=dplaetin,Search,root
2011-04-04 18:27:34,893 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2011-04-04 18:27:34,893 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2011-04-04 18:27:34,899 INFO 
org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: 
Initializing FSNamesystemMetrics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2011-04-04 18:27:34,900 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered 
FSNamesystemStatusMBean
2011-04-04 18:27:34,927 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files = 1
2011-04-04 18:27:34,931 INFO 

Re: # of keys per reducer invocation (streaming api)

2011-03-31 Thread Dieter Plaetinck
On Tue, 29 Mar 2011 23:17:13 +0530
Harsh J qwertyman...@gmail.com wrote:

 Hello,
 
 On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck
 dieter.plaeti...@intec.ugent.be wrote:
  Hi, I'm using the streaming API and I notice my reducer gets - in
  the same invocation - a bunch of different keys, and I wonder why.
  I would expect to get one key per reducer run, as with the normal
  hadoop.
 
  Is this to limit the amount of spawned processes, assuming creating
  and destroying processes is usually expensive compared to the
  amount of work they'll need to do (not much, if you have many keys
  with each a handful of values)?
 
  OTOH if you have a high number of values over a small number of
  keys, I would rather stick to one-key-per-reducer-invocation, then
  I don't need to worry about supporting (and allocating memory for)
  multiple input keys.  Is there a config setting to enable such
  behavior?
 
  Maybe I'm missing something, but this seems like a big difference in
  comparison to the default way of working, and should maybe be added
  to the FAQ at
  http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions
 
  thanks,
  Dieter
 
 
 I think it would make more sense to think of streaming programs as
 complete map/reduce 'tasks', instead of trying to apply the Map/Reduce
 functional concept. Both of the programs need to be written from the
 reading level onwards, which in map's case each line is record input
 and in reduce's case it is one uniquely grouped key and all values
 associated to it. One would need to handle the reading-loop
 themselves.
 
 Some non-Java libraries that provide abstractions atop the
 streaming/etc. layer allow for more fluent representations of the
 map() and reduce() functions, hiding away the other fine details (like
 the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce
 programs, for example.
 
 A FAQ entry on this should do good too! You can file a ticket for an
 addition of this observation to the streaming docs' FAQ.
 
 [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial
 

Thanks,
this makes it a little clearer.
I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410

Dieter


hadoop streaming shebang line for python and mappers jumping to 100% completion right away

2011-03-31 Thread Dieter Plaetinck
Hi,
I use 0.20.2 on Debian 6.0 (squeeze) nodes.
I have 2 problems with my streaming jobs:
1) I start the job like so:
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \
-file /proj/Search/wall/experiment/ \
-mapper './nolog.sh mapper' \
-reducer './nolog.sh reducer' \
-input sim-input -output sim-output

nolog.sh is just a simple wrapper for my python program,
it calls build-models.py with --mapper or --reducer, depending on which 
argument it got,
and it removes any bogus logging output using grep.
it looks like this:

#!/bin/sh
python $(dirname $0)/build-models.py --$1 | egrep -v 'INFO|DEBUG|WARN'

build-models.py is a python 2 program containing all mapper/reducer/etc logic, 
it has the executable flag set for owner/group/other.
(I even added `chmod +x` on it in nolog.sh to be really sure)

The problems:
When I use this shebang for build-models.py: #!/usr/bin/python or 
#!/usr/bin/env python (I would expect the last to work for sure?),
and 
$(dirname $0)/build-models.py in nolog.sh
I get this error: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././nolog.sh:
9: 
/tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././build-models.py:
Permission denied
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1


So, despite not understanding why it's needed (python is installed correctly, 
executable flags set, etc), I can solve this by using the invocation in 
nolog.sh as shown above (`python scriptname`).
Since, if you invoke a python program like that, you can just as well remove 
the shebang because it's not needed (I verified this manually).
However when running it in hadoop it tries to execute the python file as a bash 
file, and yields a bunch of command not found errors.
What is going on? Why can't I just execute the file and rely on the shebang? 
And if I invoke the file as argument to the python program, why is the shebang 
still needed?


2) the second problem is somewhat related: I notice my mappers jump to 100% 
completion right away - but they take about an hour to complete, so I see them 
running for an hour in 'RUNNING' with 100% completion, then they really finish.
this is probably an issue with the reading of stdin, as python uses
buffering by default (see
http://stackoverflow.com/questions/3670323/setting-smaller-buffer-size-for-sys-stdin
 )
In my code I iterate over stdin like this: `for line in sys.stdin:`, so I 
process line by line, but apparently python reads the entire stdin right away, 
my hdfs blocksize is 20KiB (which according to the thread above happens to be 
pretty much the size of the python buffer)

Now, why is this related? - Because I can invoke python in a different way to 
keep it from doing the buffering.
apparently using the -u flag should do the trick, or setting the environment 
variable PYTHONUNBUFFERED to a nonempty string.
However:
- putting `python -u` in nolog.sh doesn't do it, why?
- neither does putting `export PYTHONUNBUFFERED=true` in nolog.sh before the 
invocation, why?
- in build-models.py shebang:
  putting `/usr/bin/env python -u` or '/usr/bin/env 'python -u'` gives:
  /usr/bin/env: python -u: No such file or directory, why?
I did find a working variant, that is, I can use this shebang:
`#!/usr/bin/env PYTHONUNBUFFERED=true python2`, however since I use the same 
file for multiple things, this made i/o for a bunch of other things way too 
slow, so I tried solving this in the python code (as per the tip in the above 
link), but to no avail. (I know, my final question is a bit less related)

So I tried remapping sys.stdin (before iterating it) with these two attemptst:
( see http://docs.python.org/library/os.html#os.fdopen )
newin = os.fdopen(sys.stdin.fileno(), 'r', 100) # should make buffersize +- 
100bytes
newin = os.fdopen(sys.stdin.fileno(), 'r', 1) # should make python buffer line 
by line

however, neither of those worked..

Any help/input is welcome.
I'm usually pretty good at figuring out issues with these kinds of issues of 
invocation, but this one blows my mind :/

Dieter


sorting reducer input numerically in hadoop streaming

2011-03-31 Thread Dieter Plaetinck
hi,
I use hadoop 0.20.2, more specifically hadoop-streaming, on Debian 6.0
(squeeze) nodes.

My question is: how do I make sure input keys being fed to the reducer
are sorted numerically rather then alphabetically?

example:
- standard behavior:
#1 some-value1
#10 some-value10
#100 some-value100
#2 some-value2
#3 some-value3

- what I want:
#1 some-value1
#2 some-value2
#3 some-value3
#10 some-value10
#100 some-value100

I found
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/KeyFieldBasedComparator.html,
which supposedly supports GNU sed-like numeric sorting,
there are also some examples of jobconf parameters at
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html,
however that seems to be meant for key-value configuration flags,
whereas I somehow need to instruct streamer I want to use that specific
java class with that specific option for numeric sorting, and I
couldn't find how I should do that.

Thanks,
Dieter


# of keys per reducer invocation (streaming api)

2011-03-29 Thread Dieter Plaetinck
Hi, I'm using the streaming API and I notice my reducer gets - in the same
invocation - a bunch of different keys, and I wonder why.
I would expect to get one key per reducer run, as with the normal
hadoop.

Is this to limit the amount of spawned processes, assuming creating and
destroying processes is usually expensive compared to the amount of
work they'll need to do (not much, if you have many keys with each a
handful of values)?

OTOH if you have a high number of values over a small number of keys, I
would rather stick to one-key-per-reducer-invocation, then I don't need
to worry about supporting (and allocating memory for) multiple input
keys.  Is there a config setting to enable such behavior?

Maybe I'm missing something, but this seems like a big difference in
comparison to the default way of working, and should maybe be added to
the FAQ at
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions

thanks,
Dieter


Re: Installing Hadoop on Debian Squeeze

2011-03-21 Thread Dieter Plaetinck
On Thu, 17 Mar 2011 19:33:02 +0100
Thomas Koch tho...@koch.ro wrote:

 Currently my advise is to use the Debian packages from cloudera.

That's the problem, it appears there are none.
Like I said in my earlier mail, Debian is not in Cloudera's list of
supported distros, and they do not have a repository for Debian
packages. (I tried the ubuntu repository but that didn't work)

I now have installed it by just downloading and extracting the
tarball, it seems that's basically all that is needed.


Dieter


Installing Hadoop on Debian Squeeze

2011-03-15 Thread Dieter Plaetinck
Hi,
I see there are various posts claiming hadoop is available through official 
debian mirrors (for debian squeeze, i.e. stable):
* http://www.debian-news.net/2010/07/17/apache-hadoop-in-debian-squeeze/
* 
http://blog.isabel-drost.de/index.php/archives/213/apache-hadoop-in-debian-squeeze

However, it seems this is not (no longer?) the case:
http://packages.debian.org/search?keywords=hadoop - the packages are only in 
unstable?
(I can easily verify this, my debian squeeze box does not find the packages, 
even after updating)

What happened? I couldn't find any info on whether the packages were removed 
from squeeze.

I also tried using the Cloudera repository.
Note that the official installation document does not list instructions for 
Debian, only Ubuntu, Suse and RH.
( see https://docs.cloudera.com/display/DOC/CDH3+Installation)
In fact it seems Debian is not supported at all:
https://docs.cloudera.com/display/DOC/Before+You+Install+CDH3+on+a+Cluster

Just for the heck of it, I tried following the Ubuntu instructions, which 
failed as well.
(the cloudera repository does not have squeeze packages).
FWIW:

# lsb_release -c 
Codename:   squeeze
# echo 'deb http://archive.cloudera.com/debian squeeze-cdh3 contrib'  
/etc/apt/sources.list.d/cloudera.list
# echo 'deb-src http://archive.cloudera.com/debian squeeze-cdh3 contrib'  
/etc/apt/sources.list.d/cloudera.list
# curl -s http://archive.cloudera.com/debian/archive.key | apt-key add -
OK
# aptitude update
(...)
Hit http://ftp.belnet.be squeeze Release
Err http://archive.cloudera.com squeeze-cdh3/contrib Sources
   
  404  Not Found
Err http://archive.cloudera.com squeeze-cdh3/contrib i386 Packages  
   
  404  Not Found
Get:6 http://ftp.belnet.be squeeze-updates Release [41.8 kB]
(...)
Fetched 150 kB in 5s (28.7 kB/s)

# aptitude search hadoop
#


So, what's the best way to install Hadoop on Debian Squeeze?

Thanks,
Dieter