Re: Optimized Hadoop
Great work folks! Very interesting. PS: did you notice if you google for hanborq or HDH it's very hard to find your website, hanborq.com ? Dieter On Tue, 21 Feb 2012 02:17:31 +0800 Schubert Zhang zson...@gmail.com wrote: We just update the slides of this improvements: http://www.slideshare.net/hanborq/hanborq-optimizations-on-hadoop-mapreduce-20120216a Updates: (1) modified some describes to make things more clear and accuracy. (2) add some benchmarks to make sense. On Sat, Feb 18, 2012 at 11:12 PM, Anty anty@gmail.com wrote: On Fri, Feb 17, 2012 at 3:27 AM, Todd Lipcon t...@cloudera.com wrote: Hey Schubert, Looking at the code on github, it looks like your rewritten shuffle is in fact just a backport of the shuffle from MR2. I didn't look closely additionally, the rewritten shuffle in MR2 has some bugs, which harm the overall performance, for which I have already file a jira to report this, with a patch available. MAPREDUCE-3685 https://issues.apache.org/jira/browse/MAPREDUCE-3685 - are there any distinguishing factors? Also, the OOB heartbeat and adaptive heartbeat code seems to be the same as what's in 1.0? -Todd On Thu, Feb 16, 2012 at 9:44 AM, Schubert Zhang zson...@gmail.com wrote: Here is the presentation to describe our job, http://www.slideshare.net/hanborq/hanborq-optimizations-on-hadoop-mapreduce-20120216a Wellcome to give your advises. It's just a little step, and we are continue to do more improvements, thanks for your help. On Thu, Feb 16, 2012 at 11:01 PM, Anty anty@gmail.com wrote: Hi: Guys We just deliver a optimized hadoop , if you are interested, Pls refer to https://github.com/hanborq/hadoop -- Best Regards Anty Rao -- Todd Lipcon Software Engineer, Cloudera -- Best Regards Anty Rao
Re: HDFS Explained as Comics
Very clear. The comic format works indeed quite well. I never considered comics as a serious (professional) way to get something explained efficiently, but this shows people should think twice before they start writing their next documentation. one question though: if a DN has a corrupted block, why does the NN only remove the bad DN from the block's list, and not the block from the DN list? (also, does it really store the data in 2 separate tables? This looks to me like 2 different views of the same data?) Dieter On Thu, 1 Dec 2011 08:53:31 +0100 Alexander C.H. Lorenz wget.n...@googlemail.com wrote: Hi all, very cool comic! Thanks, Alex On Wed, Nov 30, 2011 at 11:58 PM, Abhishek Pratap Singh manu.i...@gmail.com wrote: Hi, This is indeed a good way to explain, most of the improvement has already been discussed. waiting for sequel of this comic. Regards, Abhishek On Wed, Nov 30, 2011 at 1:55 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Matthew I agree with both you and Prashant. The strip needs to be modified to explain that these can be default values that can be optionally overridden (which I will fix in the next iteration). However, from the 'understanding concepts of HDFS' point of view, I still think that block size and replication factors are the real strengths of HDFS, and the learners must be exposed to them so that they get to see how hdfs is significantly different from conventional file systems. On personal note: thanks for the first part of your message :) -Maneesh On Wed, Nov 30, 2011 at 1:36 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common flow would be using admin determined values from hdfs-site and the only thing that would need to change is that conversation happening between client / server and not user / client. Matt -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Wednesday, November 30, 2011 3:28 PM To: common-user@hadoop.apache.org Subject: Re: HDFS Explained as Comics Sure, its just a case of how readers interpret it. 1. Client is required to specify block size and replication factor each time 2. Client does not need to worry about it since an admin has set the properties in default configuration files A client could not be allowed to override the default configs if they are set final (well there are ways to go around it as well as you suggest by using create() :) The information is great and helpful. Just want to make sure a beginner who wants to write a WordCount in Mapreduce does not worry about specifying block size' and replication factor in his code. Thanks, Prashant On Wed, Nov 30, 2011 at 1:18 PM, maneesh varshney mvarsh...@gmail.com wrote: Hi Prashant Others may correct me if I am wrong here.. The client (org.apache.hadoop.hdfs.DFSClient) has a knowledge of block size and replication factor. In the source code, I see the following in the DFSClient constructor: defaultBlockSize = conf.getLong(dfs.block.size, DEFAULT_BLOCK_SIZE); defaultReplication = (short) conf.getInt(dfs.replication, 3); My understanding is that the client considers the following chain for the values: 1. Manual values (the long form constructor; when a user provides these values) 2. Configuration file values (these are cluster level defaults: dfs.block.size and dfs.replication) 3. Finally, the hardcoded values (DEFAULT_BLOCK_SIZE and 3) Moreover, in the org.apache.hadoop.hdfs.protocool.ClientProtocol the API to create a file is void create(, short replication, long blocksize); I presume it means that the client already has knowledge of these values and passes them to the NameNode when creating a new file. Hope that helps. thanks -Maneesh On Wed, Nov 30, 2011 at 1:04 PM, Prashant Kommireddi prash1...@gmail.com wrote: Thanks Maneesh. Quick question, does a client really need to know Block size and replication factor - A lot of times client has no control over these (set at cluster level) -Prashant Kommireddi On Wed, Nov 30, 2011 at 12:51 PM, Dejan Menges dejan.men...@gmail.com wrote: Hi
DIR 2012 (CFP)
Hello friends of hadoop, I just want to inform you about the 12th edition of the Dutch Information Retrieval conference which will be organized in the lovely city of Ghent, Belgium on 23/24 february 2012. There's the usual CFP, see the website at http://dir2012.intec.ugent.be/ There's definitely a place for hadoop/map-reduce related items. The name says Dutch/Belgian because that's where most research institutions and companies that show up are located, but obviously everybody is welcome. :-) I hope to see you there! Dieter
Re: HDFS and Openstack - avoiding excessive redundancy
Or more general: isn't using virtualized i/o counter effective when dealing with hadoop M/R? I would think that for running hadoop M/R you'd want predictable and consistent i/o on each node, not to mention your bottlenecks are usually disk i/o (and maybe CPU), so using virtualisation makes things less performant and less predictable, so, inferior. Or am I missing something? Dieter On Sat, 12 Nov 2011 07:54:05 + Graeme Seaton li...@graemes.com wrote: One advantage to using Hadoop replication though, is that it provides a greater pool of potential servers for M/R jobs to execute on. If you simply use Openstack replication it will appear to the JobTracker that a particular block only exists on a single server and should only be executed on that node. This may have have an impact depending on your workload profile. Regards, Graeme On 12/11/11 07:24, Dejan Menges wrote: Replication factor for HDFS can easily be changed to 1 if you don't need it's redundancy in hdfs-site.xml Regards, Dejo Sent from my iPhone On 12. 11. 2011., at 03:58, Edmon Begoliebeg...@gmail.com wrote: A question related to standing up cloud infrastructure for running Hadoop/HDFS. We are building up an infrastructure using Openstack which has its own storage management redundancy. We are planning to use Openstack to instantiate Hadoop nodes (HDFS, M/R tasks, Hive, HBase) on demand. The problem is that HDFS by design creates three copies of the data, so there is a 4x times redundancy which we would prefer to avoid. I am asking here if anyone has had a similar case and if anyone has had any helpful solution to recommend. Thank you in advance, Edmon
Re: risks of using Hadoop
On Wed, 21 Sep 2011 11:21:01 +0100 Steve Loughran ste...@apache.org wrote: On 20/09/11 22:52, Michael Segel wrote: PS... There's this junction box in your machine room that has this very large on/off switch. If pulled down, it will cut power to your cluster and you will lose everything. Now would you consider this a risk? Sure. But is it something you should really lose sleep over? Do you understand that there are risks and there are improbable risks? We follow the @devops_borat Ops book and have a post-it-note on the switch saying not a light switch :D
Re: Binary content
On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc. yes, it works. you just need to select the right input format. Personally i store all my binary files into a sequencefile (because my binary files are small) Dieter
Exception in thread main java.io.IOException: No FileSystem for scheme: file
Hi, I know this question has been asked before, but I could not find the right solution. Maybe because I use hadoop 0.20.2, some posts assumed older versions. My code (relevant chunk): import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); the last line gives: Exception in thread main java.io.IOException: No FileSystem for scheme: file I launched this like so: java -cp /usr/local/hadoop/src/core/:/usr/local/hadoop/conf/ -jar myjar.jar AFAICT, this should make sure all the configuration can be found and it should be able to connect to the filesystem: dplaetin@n-0:~$ ls -alh /usr/local/hadoop/src/core/ /usr/local/hadoop/conf/ /usr/local/hadoop/conf/: total 64K drwxr-xr-x 2 dplaetin Search 4.0K Aug 26 17:21 . drwxr-xr-x 12 root root 4.0K Feb 19 2010 .. -rw-rw-r-- 1 root root 3.9K Feb 19 2010 capacity-scheduler.xml -rw-rw-r-- 1 root root535 Feb 19 2010 configuration.xsl -rw-r--r-- 1 dplaetin Search 459 Apr 29 15:06 core-site.xml -rw-r--r-- 1 dplaetin Search 2.3K Apr 11 14:23 hadoop-env.sh -rw-rw-r-- 1 root root 1.3K Feb 19 2010 hadoop-metrics.properties -rw-rw-r-- 1 root root 4.1K Feb 19 2010 hadoop-policy.xml -rw-r--r-- 1 dplaetin Search 490 Apr 11 10:18 hdfs-site.xml -rw-r--r-- 1 dplaetin Search 2.8K Apr 11 14:23 log4j.properties -rw-r--r-- 1 dplaetin Search 1.1K Aug 3 09:49 mapred-site.xml -rw-rw-r-- 1 root root 10 Feb 19 2010 masters -rw-r--r-- 1 dplaetin Search 95 Apr 4 17:17 slaves -rw-rw-r-- 1 root root 1.3K Feb 19 2010 ssl-client.xml.example -rw-rw-r-- 1 root root 1.2K Feb 19 2010 ssl-server.xml.example /usr/local/hadoop/src/core/: total 36K drwxr-xr-x 3 root root 4.0K Aug 24 10:40 . drwxr-xr-x 15 root root 4.0K Aug 24 10:40 .. -rw-rw-r-- 1 root root 14K Feb 19 2010 core-default.xml drwxr-xr-x 3 root root 4.0K Feb 19 2010 org -rw-rw-r-- 1 root root 7.9K Feb 19 2010 overview.html Specifically, you can see I have a core-default.xml and a core-site.xml, which should be all that's needed, according to the org.apache.hadoop.conf.Configuration documentation (http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html) I read somewhere there should be a hadoop-default.xml; I thought this was deprecated but to be sure I created the empty file in /usr/local/hadoop/src/core/ , but the error remained the same. The cluster works fine, I've done tens of jobs on it, but as you can see, something fails when I try to interface to it directly. Thanks in advance for any help, Dieter
Re: Namenode Scalability
Hi, On Wed, 10 Aug 2011 13:26:18 -0500 Michel Segel michael_se...@hotmail.com wrote: This sounds like a homework assignment than a real world problem. Why? just wondering. I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-) huh?
Re: next gen map reduce
On Thu, 28 Jul 2011 06:13:01 -0700 Thomas Graves tgra...@yahoo-inc.com wrote: Its currently still on the MR279 branch - http://svn.apache.org/viewvc/hadoop/common/branches/MR-279/. It is planned to be merged to trunk soon. Tom On 7/28/11 7:31 AM, real great.. greatness.hardn...@gmail.com wrote: In which Hadoop version is next gen introduced? Hi, what exactly is contained within this next generation mysterious sounding MRV2? What's it about? Dieter
executing dfs copyFromLocal, rm, ... synchronously? i.e. wait until done?
Hi, if I simplify my code, I basically do this: hadoop dfs -rm -skipTrash $file hadoop dfs -copyFromLocal - $local $file (the removal is needed because I run a job but previous input/output may exist, so I need to delete it first, as -copyFromLocal does not support overwrite) During the 2nd command I often get NotReplicatedYetException errors. (see below for complete stacktrace): Is there a way to make commands not return until the file removal/ (and in extension: addition) has completely replicated? I couldn't find it in the help for `hadoop dfs -copyFromLocal, -rm, -rmr, ...` etc. I found https://issues.apache.org/jira/browse/HADOOP-1595 which introduces a flag for synchronous operation, but it only affects setrep. Alternatively (a bit more ugly but still acceptable ): is there a command/script I can execute to check whatever the latest operation was on this file, has it been replicated yet? I've been looking at `hadoop fsck` which seems to do what I want, at least for files that should exist, but for removed files I'm not so sure. And it's hard to manually test all possible edge cases and race conditions. Currently I'm running my script without the -skipTrash in the assumption the operation is much faster, and the race condition will be less likely. However I'm not too fond of this approach as it could still break (copyFromLocal could break when the first file hasn't been fully moved to thrash yet) and I don't have that much diskspace and I rather save diskspace as soon as I can, I'm always sure I won't need the files again anyway. I guess yet another trick could be for me to move the file first to a temporary junkfile and start a `hadoop dfs -rm -skipTrash` on the junkfile, but that could cause 2 race conditions (both the copyFromLocal of the original file and the rm of the junkfile could break if the move is not yet fully replicated) thanks for any input, Dieter 11/06/20 14:15:40 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/user/dplaetin/wip-dirty-CF0-CHNK100-EB1-FW1-FW_K0-FW_NA1-FW_NB0-M2-MP0-NFmerged_ner_mofis-NUMBEST10-PATRN_INCL-PR_A0-PR_Fpunctuation_v1-S_L0-S_P0-SR_Fstopwords_ranks.nl_v1-SQ_K10-SQ_R1-TFIDF1-input at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1257) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
Re: Are hadoop fs commands serial or parallel
On Fri, 20 May 2011 10:11:13 -0500 Brian Bockelman bbock...@cse.unl.edu wrote: On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Try giving it to a user; watch them feed it a list of 10,000 files; watch the machine swap to death and the disks uselessly thrash. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Simple doesn't imply scalable, unfortunately. Brian True, I assumed if anyone wants this, he knows what he's doing (i.e. the files could be small and already in the Linux block cache). Because why would anyone read files in parrallel if that causes disk seeks all over the place? Ideally, you should tune for 1 sequential read per disk at the time. In that respect, I definitely agree that some clever logic in userspace to optimize disk reads (across a bunch of different possible hardware setups) would be beneficial. Dieter
Re: Are hadoop fs commands serial or parallel
What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles patr...@cloudera.com wrote: kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J
Re: What exactly are the output_dir/part-00000 semantics (of a streaming job) ?
On Thu, 12 May 2011 09:49:23 -0700 (PDT) Aman aman_d...@hotmail.com wrote: The creation of files part-n is atomic. When you run a MR job, these files are created in directory output_dir/_temporary and moved to output_dir after the files is closed for writing. This move is atomic hence as long as you don't try to read these files from temporary directory (which I see you are not) you will be fine. Perfect! thanks. Dieter
What exactly are the output_dir/part-00000 semantics (of a streaming job) ?
Hi, I'm running some experiments using hadoop streaming. I always get a output_dir/part-0 file at the end, but I wonder: when exactly will this filename show up? when it's completely written, or will it already show up while the hapreduce software is still writing to it? Is the write atomic? The reason I'm asking this, I have a script which submits +- 200 of jobs to mapreduce, and I have an other script collecting the part-00 files of all jobs. (not just once when all experiments are done, but I frequently collect all results of thus far finished jobs) For this, I just do (simplified code): for i in $(seq 1 200); do if $(ssh $master bin/hadoop dfs -ls $i/output/part-00); then ssh $master bin/hadoop dfs -cat $i/output/part-00 output_$i fi done and I wonder if this is prone to race conditions, is there any change I will run this while $i/output/part-00 in the process of being written to, and hence I end up with incomplete output_$i files? If so, what's the proper way to check if the file is really stable? fetching the jobtracker webpage and checking if job $i is finished? Dieter
can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?
Hi, I have a script something like this (simplified): for i in $(seq 1 200); do regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.job.name=$i \ -file $dir \ -mapper ... -reducer ... -input $i-input -output $i-output so I want to launch 200 hadoop jobs, each which needs files from $dir, more specifically some files in $dir are being generated to be used with job $i (and only that job) the problem is, generating those files takes some time. currently the hadoop command packs and submits the job and then waits for the job to complete. This causes the regenerate-files program to cause needless delays. Is there a way to make the hadoop jar command return when the job is packaged and submitted? I obviously cannot just background the hadoop calls, because that would start regenerating the files while previous jobs are still being packaged. I thought about using different directories per job to generate those files in, but that would needlessly consume disk space, which is not good. I've been looking at http://hadoop.apache.org/mapreduce/docs/current/streaming.html and googling for answers, but couldn't find a solution. I found http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.html but that doesn't seem to work with streaming. thanks, Dieter
Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted?
that will cause 200 regenerate-files processes running on the same files, at the same time. not good.. Dieter On Fri, 6 May 2011 07:49:45 -0700 (PDT) Bharath Mundlapudi bharathw...@yahoo.com wrote: how about this? for i in $(seq 1 200); do exec_stream_job.sh $dir $i exec_stream_job.sh regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.job.name=$i \ -file $dir \ -mapper ... -reducer ... -input $i-input -output $i-output From: Dieter Plaetinck dieter.plaeti...@intec.ugent.be To: common-user@hadoop.apache.org Sent: Friday, May 6, 2011 7:09 AM Subject: Re: can a `hadoop -jar streaming.jar` command return when a job is packaged and submitted? Hi, I have a script something like this (simplified): for i in $(seq 1 200); do regenerate-files $dir $i hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -D mapred.job.name=$i \ -file $dir \ -mapper ... -reducer ... -input $i-input -output $i-output so I want to launch 200 hadoop jobs, each which needs files from $dir, more specifically some files in $dir are being generated to be used with job $i (and only that job) the problem is, generating those files takes some time. currently the hadoop command packs and submits the job and then waits for the job to complete. This causes the regenerate-files program to cause needless delays. Is there a way to make the hadoop jar command return when the job is packaged and submitted? I obviously cannot just background the hadoop calls, because that would start regenerating the files while previous jobs are still being packaged. I thought about using different directories per job to generate those files in, but that would needlessly consume disk space, which is not good. I've been looking at http://hadoop.apache.org/mapreduce/docs/current/streaming.html and googling for answers, but couldn't find a solution. I found http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/JobClient.html but that doesn't seem to work with streaming. thanks, Dieter
Re: sorting reducer input numerically in hadoop streaming
Thank you Harsh, that works fine! (looks like the page I was looking at was the same, but for an older version of hadoop) Dieter On Fri, 1 Apr 2011 13:07:38 +0530 Harsh J qwertyman...@gmail.com wrote: You will need to supply your own Key-comparator Java class by setting an appropriate parameter for it, as noted in: http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#A+Useful+Comparator+Class [The -D mapred.output.key.comparator.class=xyz part] On Thu, Mar 31, 2011 at 6:26 PM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: couldn't find how I should do that.
Re: INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors
Anyone? Anyone at all? I figured out the issue with the jobtracker, but I still have the errors: * Error register getProtocolVersion * File (..) could only be replicated to 0 nodes, instead of 1 as explained in my first mail. The 2nd error can appear without _any_ errors in _any_ of the datanode or tasktracker logs. And the NameNode webinterface even tells me all nodes are live, none are dead. This is effectively holding me back from using the cluster, I'm completely in the dark, I find this very frustrating. :( Thank you, Dieter On Mon, 4 Apr 2011 18:45:49 +0200 Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: Hi, I have a cluster of 4 debian squeeze machines, on all of them I installed the same version ( hadoop-0.20.2.tar.gz ) I have : n-0 namenode, n-1: jobtracker and n-{0,1,2,3} slaves but you can see all my configs in more detail @ http://pastie.org/1754875 the machines have 3GiB RAM. I don't think disk space is an issue anywhere, but FWIW: /var/bomvl (which would house hdfs stuff) is a filesystem of 140GB on all nodes. they all have 1.8GB free on / so, I invoke the cluster on n-0: $ bin/hadoop namenode -format 11/04/04 18:27:33 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = n-0.map-reduce.search.wall2.test/10.2.1.90 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 11/04/04 18:27:33 INFO namenode.FSNamesystem: fsOwner=dplaetin,Search,root 11/04/04 18:27:33 INFO namenode.FSNamesystem: supergroup=supergroup 11/04/04 18:27:33 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/04/04 18:27:33 INFO common.Storage: Image file of size 98 saved in 0 seconds. 11/04/04 18:27:34 INFO common.Storage: Storage directory /var/bomvl/name has been successfully formatted. 11/04/04 18:27:34 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at n-0.map-reduce.search.wall2.test/10.2.1.90 / $ bin/start-all.sh starting namenode, logging to /var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.out n-0: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-0.map-reduce.search.wall2.test.out n-2: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-2.map-reduce.search.wall2.test.out n-3: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-3.map-reduce.search.wall2.test.out n-1: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-1.map-reduce.search.wall2.test.out localhost: starting secondarynamenode, logging to /var/log/hadoop/hadoop-dplaetin-secondarynamenode-n-0.map-reduce.search.wall2.test.out starting jobtracker, logging to /var/log/hadoop/hadoop-dplaetin-jobtracker-n-0.map-reduce.search.wall2.test.out n-2: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-2.map-reduce.search.wall2.test.out n-3: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-3.map-reduce.search.wall2.test.out n-0: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-0.map-reduce.search.wall2.test.out n-1: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-1.map-reduce.search.wall2.test.out However, in n-0:/var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.log, I see this: 2011-04-04 18:27:34,760 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = n-0.map-reduce.search.wall2.test/10.2.1.90 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 2011-04-04 18:27:34,846 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2011-04-04 18:27:34,851 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: n-0-lan0/10.1.1.2:9000 2011-04-04 18:27:34,853 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-04-04 18:27:34,854 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-04-04 18:27:34,893 INFO
INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion and other errors
Hi, I have a cluster of 4 debian squeeze machines, on all of them I installed the same version ( hadoop-0.20.2.tar.gz ) I have : n-0 namenode, n-1: jobtracker and n-{0,1,2,3} slaves but you can see all my configs in more detail @ http://pastie.org/1754875 the machines have 3GiB RAM. I don't think disk space is an issue anywhere, but FWIW: /var/bomvl (which would house hdfs stuff) is a filesystem of 140GB on all nodes. they all have 1.8GB free on / so, I invoke the cluster on n-0: $ bin/hadoop namenode -format 11/04/04 18:27:33 INFO namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = n-0.map-reduce.search.wall2.test/10.2.1.90 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 11/04/04 18:27:33 INFO namenode.FSNamesystem: fsOwner=dplaetin,Search,root 11/04/04 18:27:33 INFO namenode.FSNamesystem: supergroup=supergroup 11/04/04 18:27:33 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/04/04 18:27:33 INFO common.Storage: Image file of size 98 saved in 0 seconds. 11/04/04 18:27:34 INFO common.Storage: Storage directory /var/bomvl/name has been successfully formatted. 11/04/04 18:27:34 INFO namenode.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at n-0.map-reduce.search.wall2.test/10.2.1.90 / $ bin/start-all.sh starting namenode, logging to /var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.out n-0: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-0.map-reduce.search.wall2.test.out n-2: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-2.map-reduce.search.wall2.test.out n-3: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-3.map-reduce.search.wall2.test.out n-1: starting datanode, logging to /var/log/hadoop/hadoop-dplaetin-datanode-n-1.map-reduce.search.wall2.test.out localhost: starting secondarynamenode, logging to /var/log/hadoop/hadoop-dplaetin-secondarynamenode-n-0.map-reduce.search.wall2.test.out starting jobtracker, logging to /var/log/hadoop/hadoop-dplaetin-jobtracker-n-0.map-reduce.search.wall2.test.out n-2: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-2.map-reduce.search.wall2.test.out n-3: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-3.map-reduce.search.wall2.test.out n-0: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-0.map-reduce.search.wall2.test.out n-1: starting tasktracker, logging to /var/log/hadoop/hadoop-dplaetin-tasktracker-n-1.map-reduce.search.wall2.test.out However, in n-0:/var/log/hadoop/hadoop-dplaetin-namenode-n-0.map-reduce.search.wall2.test.log, I see this: 2011-04-04 18:27:34,760 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = n-0.map-reduce.search.wall2.test/10.2.1.90 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 / 2011-04-04 18:27:34,846 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2011-04-04 18:27:34,851 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: n-0-lan0/10.1.1.2:9000 2011-04-04 18:27:34,853 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2011-04-04 18:27:34,854 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-04-04 18:27:34,893 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=dplaetin,Search,root 2011-04-04 18:27:34,893 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-04-04 18:27:34,893 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-04-04 18:27:34,899 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-04-04 18:27:34,900 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-04-04 18:27:34,927 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1 2011-04-04 18:27:34,931 INFO
Re: # of keys per reducer invocation (streaming api)
On Tue, 29 Mar 2011 23:17:13 +0530 Harsh J qwertyman...@gmail.com wrote: Hello, On Tue, Mar 29, 2011 at 8:25 PM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: Hi, I'm using the streaming API and I notice my reducer gets - in the same invocation - a bunch of different keys, and I wonder why. I would expect to get one key per reducer run, as with the normal hadoop. Is this to limit the amount of spawned processes, assuming creating and destroying processes is usually expensive compared to the amount of work they'll need to do (not much, if you have many keys with each a handful of values)? OTOH if you have a high number of values over a small number of keys, I would rather stick to one-key-per-reducer-invocation, then I don't need to worry about supporting (and allocating memory for) multiple input keys. Is there a config setting to enable such behavior? Maybe I'm missing something, but this seems like a big difference in comparison to the default way of working, and should maybe be added to the FAQ at http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions thanks, Dieter I think it would make more sense to think of streaming programs as complete map/reduce 'tasks', instead of trying to apply the Map/Reduce functional concept. Both of the programs need to be written from the reading level onwards, which in map's case each line is record input and in reduce's case it is one uniquely grouped key and all values associated to it. One would need to handle the reading-loop themselves. Some non-Java libraries that provide abstractions atop the streaming/etc. layer allow for more fluent representations of the map() and reduce() functions, hiding away the other fine details (like the Java API). Dumbo[1] is such a library for Python Hadoop Map/Reduce programs, for example. A FAQ entry on this should do good too! You can file a ticket for an addition of this observation to the streaming docs' FAQ. [1] - https://github.com/klbostee/dumbo/wiki/Short-tutorial Thanks, this makes it a little clearer. I made a ticket @ https://issues.apache.org/jira/browse/MAPREDUCE-2410 Dieter
hadoop streaming shebang line for python and mappers jumping to 100% completion right away
Hi, I use 0.20.2 on Debian 6.0 (squeeze) nodes. I have 2 problems with my streaming jobs: 1) I start the job like so: hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar \ -file /proj/Search/wall/experiment/ \ -mapper './nolog.sh mapper' \ -reducer './nolog.sh reducer' \ -input sim-input -output sim-output nolog.sh is just a simple wrapper for my python program, it calls build-models.py with --mapper or --reducer, depending on which argument it got, and it removes any bogus logging output using grep. it looks like this: #!/bin/sh python $(dirname $0)/build-models.py --$1 | egrep -v 'INFO|DEBUG|WARN' build-models.py is a python 2 program containing all mapper/reducer/etc logic, it has the executable flag set for owner/group/other. (I even added `chmod +x` on it in nolog.sh to be really sure) The problems: When I use this shebang for build-models.py: #!/usr/bin/python or #!/usr/bin/env python (I would expect the last to work for sure?), and $(dirname $0)/build-models.py in nolog.sh I get this error: /tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././nolog.sh: 9: /tmp/hadoop-dplaetin/mapred/local/taskTracker/jobcache/job_201103311017_0008/attempt_201103311017_0008_m_00_0/work/././build-models.py: Permission denied java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 So, despite not understanding why it's needed (python is installed correctly, executable flags set, etc), I can solve this by using the invocation in nolog.sh as shown above (`python scriptname`). Since, if you invoke a python program like that, you can just as well remove the shebang because it's not needed (I verified this manually). However when running it in hadoop it tries to execute the python file as a bash file, and yields a bunch of command not found errors. What is going on? Why can't I just execute the file and rely on the shebang? And if I invoke the file as argument to the python program, why is the shebang still needed? 2) the second problem is somewhat related: I notice my mappers jump to 100% completion right away - but they take about an hour to complete, so I see them running for an hour in 'RUNNING' with 100% completion, then they really finish. this is probably an issue with the reading of stdin, as python uses buffering by default (see http://stackoverflow.com/questions/3670323/setting-smaller-buffer-size-for-sys-stdin ) In my code I iterate over stdin like this: `for line in sys.stdin:`, so I process line by line, but apparently python reads the entire stdin right away, my hdfs blocksize is 20KiB (which according to the thread above happens to be pretty much the size of the python buffer) Now, why is this related? - Because I can invoke python in a different way to keep it from doing the buffering. apparently using the -u flag should do the trick, or setting the environment variable PYTHONUNBUFFERED to a nonempty string. However: - putting `python -u` in nolog.sh doesn't do it, why? - neither does putting `export PYTHONUNBUFFERED=true` in nolog.sh before the invocation, why? - in build-models.py shebang: putting `/usr/bin/env python -u` or '/usr/bin/env 'python -u'` gives: /usr/bin/env: python -u: No such file or directory, why? I did find a working variant, that is, I can use this shebang: `#!/usr/bin/env PYTHONUNBUFFERED=true python2`, however since I use the same file for multiple things, this made i/o for a bunch of other things way too slow, so I tried solving this in the python code (as per the tip in the above link), but to no avail. (I know, my final question is a bit less related) So I tried remapping sys.stdin (before iterating it) with these two attemptst: ( see http://docs.python.org/library/os.html#os.fdopen ) newin = os.fdopen(sys.stdin.fileno(), 'r', 100) # should make buffersize +- 100bytes newin = os.fdopen(sys.stdin.fileno(), 'r', 1) # should make python buffer line by line however, neither of those worked.. Any help/input is welcome. I'm usually pretty good at figuring out issues with these kinds of issues of invocation, but this one blows my mind :/ Dieter
sorting reducer input numerically in hadoop streaming
hi, I use hadoop 0.20.2, more specifically hadoop-streaming, on Debian 6.0 (squeeze) nodes. My question is: how do I make sure input keys being fed to the reducer are sorted numerically rather then alphabetically? example: - standard behavior: #1 some-value1 #10 some-value10 #100 some-value100 #2 some-value2 #3 some-value3 - what I want: #1 some-value1 #2 some-value2 #3 some-value3 #10 some-value10 #100 some-value100 I found http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/KeyFieldBasedComparator.html, which supposedly supports GNU sed-like numeric sorting, there are also some examples of jobconf parameters at http://hadoop.apache.org/common/docs/r0.15.2/streaming.html, however that seems to be meant for key-value configuration flags, whereas I somehow need to instruct streamer I want to use that specific java class with that specific option for numeric sorting, and I couldn't find how I should do that. Thanks, Dieter
# of keys per reducer invocation (streaming api)
Hi, I'm using the streaming API and I notice my reducer gets - in the same invocation - a bunch of different keys, and I wonder why. I would expect to get one key per reducer run, as with the normal hadoop. Is this to limit the amount of spawned processes, assuming creating and destroying processes is usually expensive compared to the amount of work they'll need to do (not much, if you have many keys with each a handful of values)? OTOH if you have a high number of values over a small number of keys, I would rather stick to one-key-per-reducer-invocation, then I don't need to worry about supporting (and allocating memory for) multiple input keys. Is there a config setting to enable such behavior? Maybe I'm missing something, but this seems like a big difference in comparison to the default way of working, and should maybe be added to the FAQ at http://hadoop.apache.org/common/docs/r0.20.2/streaming.html#Frequently+Asked+Questions thanks, Dieter
Re: Installing Hadoop on Debian Squeeze
On Thu, 17 Mar 2011 19:33:02 +0100 Thomas Koch tho...@koch.ro wrote: Currently my advise is to use the Debian packages from cloudera. That's the problem, it appears there are none. Like I said in my earlier mail, Debian is not in Cloudera's list of supported distros, and they do not have a repository for Debian packages. (I tried the ubuntu repository but that didn't work) I now have installed it by just downloading and extracting the tarball, it seems that's basically all that is needed. Dieter
Installing Hadoop on Debian Squeeze
Hi, I see there are various posts claiming hadoop is available through official debian mirrors (for debian squeeze, i.e. stable): * http://www.debian-news.net/2010/07/17/apache-hadoop-in-debian-squeeze/ * http://blog.isabel-drost.de/index.php/archives/213/apache-hadoop-in-debian-squeeze However, it seems this is not (no longer?) the case: http://packages.debian.org/search?keywords=hadoop - the packages are only in unstable? (I can easily verify this, my debian squeeze box does not find the packages, even after updating) What happened? I couldn't find any info on whether the packages were removed from squeeze. I also tried using the Cloudera repository. Note that the official installation document does not list instructions for Debian, only Ubuntu, Suse and RH. ( see https://docs.cloudera.com/display/DOC/CDH3+Installation) In fact it seems Debian is not supported at all: https://docs.cloudera.com/display/DOC/Before+You+Install+CDH3+on+a+Cluster Just for the heck of it, I tried following the Ubuntu instructions, which failed as well. (the cloudera repository does not have squeeze packages). FWIW: # lsb_release -c Codename: squeeze # echo 'deb http://archive.cloudera.com/debian squeeze-cdh3 contrib' /etc/apt/sources.list.d/cloudera.list # echo 'deb-src http://archive.cloudera.com/debian squeeze-cdh3 contrib' /etc/apt/sources.list.d/cloudera.list # curl -s http://archive.cloudera.com/debian/archive.key | apt-key add - OK # aptitude update (...) Hit http://ftp.belnet.be squeeze Release Err http://archive.cloudera.com squeeze-cdh3/contrib Sources 404 Not Found Err http://archive.cloudera.com squeeze-cdh3/contrib i386 Packages 404 Not Found Get:6 http://ftp.belnet.be squeeze-updates Release [41.8 kB] (...) Fetched 150 kB in 5s (28.7 kB/s) # aptitude search hadoop # So, what's the best way to install Hadoop on Debian Squeeze? Thanks, Dieter