Re: What happens in HDFS DataNode recovery?

2009-01-24 Thread jason hadoop
The blocks will be invalidated on the returned to service datanode. If you want to save your namenode and network a lot of work, wipe the hdfs block storage directory before returning the Datanode to service. dfs.data.dir will be the directory, most likley the value is ${hadoop.tmp.dir}/dfs/data J

Re: Lingering TaskTracker$Child

2009-01-25 Thread jason hadoop
We had trouble like that with some jobs, when the child ran additional threads that were not set at daemon priority. These hold the Child JVM from exiting. JMX was the cause in our case, but we have seen our JNI jobs do it also. In the end we made a local mod that forced a System.exit in the finall

Re: HDFS - millions of files in one directory?

2009-01-25 Thread jason hadoop
With large numbers of files you run the risk of the Datanodes timing out when they are performing their block report and or DU reports. Basically if a *find* in the dfs.data.dir takes more than 10 minutes you will have catastrophic problems with your hdfs. At attributor with 2million blocks on a da

Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
We like compression if the data is readily compressible and large as it saves on IO time. On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner wrote: > Doug, > SequenceFile looks like a perfect candidate to use in my project, but are > you saying that I better use uncompressed data if I am not interes

Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
Sequence files rock, and you can use the * bin/hadoop dfs -text FILENAME* command line tool to get a toString level unpacking of the sequence file key,value pairs. If you provide your own key or value classes, you will need to implement a toString method to get some use out of this. Also, your cla

Re: Mapred job parallelism

2009-01-26 Thread jason hadoop
I believe that the schedule code in 0.19.0 has a framework for this, but I haven't dug into it in detail yet. http://hadoop.apache.org/core/docs/r0.19.0/capacity_scheduler.html >From what I gather you would set up 2 queues, each with guaranteed access to 1/2 of the cluster Then you submit your jo

Re: How does Hadoop choose machines for Reducers?

2009-01-30 Thread jason hadoop
Hadoop just distributes to the available reduce execution slots. I don't believe it pays attention to what machine they are on. I believe the plan is to take account data locality in future (ie: distribute tasks to machines that are considered more topologically close to their input split first, bu

Re: HDFS formatting

2009-02-01 Thread jason hadoop
are you changing the definition of hadop.tmp.dir in the hadoop-site.xml file. 1) the default location is in /tmp and your tmp watch cron job may be deleteing the files 2) if you change the location, or the location is removed, you will need to reformat. On Sun, Feb 1, 2009 at 3:39 PM, Mark Kerzn

Re: How can HDFS spread the data across the data nodes ?

2009-02-01 Thread jason hadoop
If the write is taking place on a datanode, by design, 1 replica will be written to that datanode. The other replicas will be written to different nodes. When you write on the namenode, it generally is not a datanode, and hadoop will pseudo randomly allocate the replica blocks across all of your

Re: HDFS Namenode Heap Size woes

2009-02-01 Thread jason hadoop
If your datanodes are pausing and falling out of the cluster you will get a large workload for the namenode of blocks to replicate and when the paused datanode comes back, a large workload of blocks to delete. These lists are stored in memory on the namenode. The startup messages lead me to wonder

Re: How to add nodes to existing cluster?

2009-02-01 Thread jason hadoop
If you want them to also start automatically, and for the slaves.sh command to work as expected, add the names to the conf/slaves file also. On Fri, Jan 30, 2009 at 7:15 PM, Amandeep Khurana wrote: > Thanks Lohit > > > On Fri, Jan 30, 2009 at 7:13 PM, lohit wrote: > > > Just starting DataNode a

Re: Setting up cluster

2009-02-01 Thread jason hadoop
It is possible that your slaves are unable to contact the master due to a network routing, firewall or hostname resolution error. The alternative is that your namenode is either failing to start, or running from a different configuration file and binding to a different port. On Fri, Jan 30, 2009

Re: How does Hadoop choose machines for Reducers?

2009-02-01 Thread jason hadoop
ce Hadoop to distribute reduce tasks evenly > across all the machines? > > > > On Jan 30, 2009, at 7:32 AM, jason hadoop wrote: > > Hadoop just distributes to the available reduce execution slots. I don't >> believe it pays attention to what machine they are on. >>

Re: How does Hadoop choose machines for Reducers?

2009-02-01 Thread jason hadoop
may have some task trackers handling more reduces. If mapred.tasktracker.reduce.tasks.maximum*Number_Of_Slaves == number of reduces configured and mapred.tasktracker.reduce.tasks.maximum == 1, you will get 1 reduce per task tracker (almost always). On Sun, Feb 1, 2009 at 5:51 PM, jason hadoop wrote

Re: problem with completion notification from block movement

2009-02-01 Thread jason hadoop
The Datanode's use multiple threads with locking and one of the assumptions is that the block report (1ce per hour by default) takes little time. The datanode will pause while the block report is running and if it happens to take a while weird things start to happen. On Fri, Jan 30, 2009 at 8:59

Re: HDFS Namenode Heap Size woes

2009-02-01 Thread jason hadoop
nks, > Sean > > On Sun, Feb 1, 2009 at 4:00 PM, jason hadoop > wrote: > > > If your datanodes are pausing and falling out of the cluster you will get > a > > large workload for the namenode of blocks to replicate and when the > paused > > datanode comes back,

Re: Setting up cluster

2009-02-01 Thread jason hadoop
t have a firewall so that shouldnt be a problem. I'll look into > the > other things once. How can I point the system to use a particular config > file? Arent those fixed to hadoop-default.xml and hadoop-site.xml? > > > > On Sun, Feb 1, 2009 at 5:49 PM, jason hadoop >

Re: HDFS Namenode Heap Size woes

2009-02-01 Thread jason hadoop
;> Brian >>> >>> >>> On Feb 1, 2009, at 6:11 PM, Sean Knapp wrote: >>> >>> Jason, >>> >>>> Thanks for the response. By falling out, do you mean a longer time since >>>> last contact (100s+), or fully timed out

Re: Hadoop's reduce tasks are freezes at 0%.

2009-02-02 Thread jason hadoop
A reduce stall at 0% implies that the map tasks are not outputting any records via the output collector. You need to go look at the task tracker and the task logs on all of your slave machines, to see if anything that seems odd appears in the logs. On the tasktracker web interface detail screen for

Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-02 Thread jason hadoop
If you have to do a time based solution, for now, simply close the file and stage it, then open a new file. Your reads will have to deal with the fact the file is in multiple parts. Warning: Datanodes get pokey if they have large numbers of blocks, and the quickest way to do this is to create a lot

Re: problem with completion notification from block movement

2009-02-02 Thread jason hadoop
there was a though of running a continous find on the dfs.data.dir to try to force the kernel to keep the inodes in memory, but I think they abandoned that strategy. On Mon, Feb 2, 2009 at 10:23 AM, Karl Kleinpaste wrote: > On Sun, 2009-02-01 at 17:58 -0800, jason hadoop wrote: > >

Re: hadoop to ftp files into hdfs

2009-02-02 Thread jason hadoop
If you have a large number of ftp urls spread across many sites, simply set that file to be your hadoop job input, and force the input split to be a size that gives you good distribution across your cluster. On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin wrote: > Does any one have a good suggestio

Re: My tasktrackers keep getting lost...

2009-02-02 Thread jason hadoop
When I was at Attributor we experienced periodic odd XFS hangs that would freeze up the Hadoop Server processes resulting in them going away. Sometimes XFS would deadlock all writes to the log file and the server would freeze trying to log a message. Can't even JSTACK the jvm. We never had any trac

Re: reading data from multiple output files into a single Map method.

2009-02-02 Thread jason hadoop
Do you really want to have a single task process all of the reduce outputs? If you want all of your output processed by a set of map tasks, you can set the output directory of your previous job to be the input directory of your next job, ensuring that the framework knows how to read the key value

Re: Control over max map/reduce tasks per job

2009-02-03 Thread jason hadoop
An alternative is to have 2 Tasktracker clusters, where the nodes are on the same machines. One cluster is for IO intensive jobs and has a low number of map/reduces per tracker, the other cluster is for cpu intensive jobs and has a high number of map/reduces per tracker. The alternative, simpler m

Re: Value-Only Reduce Output

2009-02-03 Thread jason hadoop
If you are using the standard TextOutputFormat, and the output collector is passed a null for the value, there will not be a trailing tab character added to the output line. output.collect( key, null ); Will give you the behavior you are looking for if your configuration is as I expect. On Tue, F

Re: Value-Only Reduce Output

2009-02-03 Thread jason hadoop
Ooops, you are using streaming., and I am not familar. As a terrible hack, you could set mapred.textoutputformat.separator to the empty string, in your configuration. On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop wrote: > If you are using the standard TextOutputFormat, and the output collec

Re: Value-Only Reduce Output

2009-02-04 Thread jason hadoop
t; map.output.key.field.separator parameters for this purpose, they > don't work either. When hadoop sees empty string, it takes default tab > character instead. > > Rasit > > 2009/2/4 jason hadoop > > > > Ooops, you are using streaming., and I am not familar. > &g

Re: Heap size error

2009-02-07 Thread jason hadoop
The default task memory allocation size is set in the hadoop-default.xml file for your configuration and is usually The parameter is mapred.child.java.opts, and the value is generally -Xmx200m. You may alter this value in your JobConf object before you submit the job and the individual tasks will

Re: Cannot copy from local file system to DFS

2009-02-07 Thread jason hadoop
Please examine the web console for the namenode. The url for this should be http://*namenodehost*:50070/ This will tell you what datanodes are successfully connected to the namenode. If the number is 0, then no datanodes are either running or were able to connect to the namenode at start, or wer

Re: Re: Re: Re: Regarding "Hadoop multi cluster" set-up

2009-02-07 Thread jason hadoop
On your master machine, use the netstat command to determine what ports and addresses the namenode process is listening on. On the datanode machines, examine the log files,, to verify that the datanode has attempted to connect to the namenode ip address on one of those ports, and was successfull.

Re: java.io.IOException: Could not get block locations. Aborting...

2009-02-09 Thread jason hadoop
You will have to increase the per user file descriptor limit. For most linux machines the file /etc/security/limits.conf controls this on a per user basis. You will need to log in a fresh shell session after making the changes, to see them. Any login shells started before the change and process sta

Re: java.io.IOException: Could not get block locations. Aborting...

2009-02-09 Thread jason hadoop
The other issue you may run into, with many files in your HDFS is that you may end up with more than a few 100k worth of blocks on each of your datanodes. At present this can lead to instability due to the way the periodic block reports to the namenode are handled. The more blocks per datanode, the

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread jason hadoop
The .maximum values are only loaded by the Tasktrackers at server start time at present, and any changes you make will be ignored. 2009/2/18 S D > Thanks for your response Rasit. You may have missed a portion of my post. > > > On a different note, when I attempt to pass params via -D I get a us

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread jason hadoop
I certainly hope it changes but I am unaware that it is in the todo queue at present. 2009/2/18 S D > Thanks Jason. That's useful information. Are you aware of plans to change > this so that the maximum values can be changed without restarting the > server? > > John > &

Re: Disabling Reporter Output?

2009-02-18 Thread jason hadoop
There is a moderate a mount of setup and tear down in any hadoop job. It may be that your 10 seconds are primarily that. On Wed, Feb 18, 2009 at 11:29 AM, Philipp Dobrigkeit wrote: > I am currently trying Map/Reduce in Eclipse. The input comes from an hbase > table. The performance of my jobs is

Re: How to use JobConf.setKeyFieldPartitionerOptions() method

2009-02-22 Thread jason hadoop
For reasons that are not clear, in 19, the partitioner steps one character past the end of the field unless you are very explicit in your key specification. One would assume that -k2 would pick up the second token, even if it was the last field in the key, but -k2,2 is required As near as I can te

Re: Different Hadoop Home on Slaves?

2009-02-24 Thread jason hadoop
If you manually start the daemons, via hadoop-daemon.sh, the parent directory of the hadoop-daemon.sh script will be used as the root directory for the hadoop installation. I do believe, but do not know, that the namenode/jobtracker does does not notice the actual file system location of the the t

Re: why print this error when using MultipleOutputFormat?

2009-02-24 Thread jason hadoop
My 1st guess is that your application is running out of file descriptors,possibly because your MultipleOutputFormat instance is opening more output files than you expect. Opening lots of files in HDFS is generally a quick route to bad job performance if not job failure. On Tue, Feb 24, 2009 at 6:

Re: re : How to use MapFile in C++ program

2009-02-24 Thread jason hadoop
You may wish to look at the documentation on hadoop pipes, which provide a interface for writing c++ map/reduce applications and a mechanism to pass key/value data to C++ from hadoop. The framework will read and write sequence file or mapfiles, and provide key/value pairs to the map function and r

Re: why print this error when using MultipleOutputFormat?

2009-02-25 Thread jason hadoop
the number of computers, can we solve this problem of > > running out of file descriptors? > > > > > > > > > > On Wed, Feb 25, 2009 at 11:07 AM, jason hadoop > > wrote: > > > My 1st guess is that your application is running out of file > > >

Re: MapReduce jobs with expensive initialization

2009-03-01 Thread jason hadoop
If you have to you can reach through all of the class loaders and find the instance of your singleton class that has the data loaded. It is awkward, and I haven't done this in java since the late 90's. It did work the last time I did it. On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey wrote: > You

Re: What's the cause of this Exception

2009-03-01 Thread jason hadoop
The way you are specifying the section of your key to compare is reaching beyond the end of the last part of the key. Your key specification is not terminating explicitly on the last character of the final field of the key. if your key splits in to N parts, and you are comparing on the Nth part,

Re: What's the cause of this Exception

2009-03-01 Thread jason hadoop
1"). When i limit the input size, > it works fine, i think this because i limit the total number of the > possible > "key1,key2,key3" compositions. but when i increate the input size, this > exception was thrown. > > 2009/3/2 jason hadoop > > > The way you ar

Re: master trying fetch data from slave using "localhost" hostname :)

2009-03-06 Thread jason hadoop
I see that when the host name of the node is also on the localhost line in /etc/hosts On Fri, Mar 6, 2009 at 9:38 AM, wrote: > > I see the same strange behavior on 2-node cluster with 0.18.3, 0.19.1 and > snv's branch-0.20.0... > 2 nodes: > "master1" running NameNode, JobTracker, DataNode, Task

Re: MapReduce jobs with expensive initialization

2009-03-07 Thread jason hadoop
You can have your item in a separate jar and pin the reference so that it becomes perm-gen, which will pin it. Then you can search the class loader hierarchy for the reference. A quick scan through the Child.java main loop shows no magic with class loaders. I wrote some code to check this against

Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread jason hadoop
There were a couple of fork timing errors in the jdk 1.5 that occasionally caused a sub process fork to go bad, this could by the du/df being forked off by the datanode and dying. I can't find the references I had saved away at one point, from the java forums, but perhaps this will get you started

Re: Reducer goes past 100% complete?

2009-03-09 Thread jason hadoop
speculative execution. On Mon, Mar 9, 2009 at 12:19 PM, Nathan Marz wrote: > I have the same problem with reducers going past 100% on some jobs. I've > seen reducers go as high as 120%. Would love to know what the issue is. > > > On Mar 9, 2009, at 8:45 AM, Doug Cook wrote: > > >> Hi folks, >>

Re: DataNode gets 'stuck', ends up with two DataNode processes

2009-03-09 Thread jason hadoop
be this bug: > > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6671051 > > However, this is using Java 1.6.0_11, and that bug was marked as fixed in > 1.6.0_6 :( > > Any other ideas? > > Brian > > > On Mar 9, 2009, at 2:21 PM, jason hadoop wrote: > >

Re: Reducer goes past 100% complete?

2009-03-09 Thread jason hadoop
I noticed this getting much worse with block compression on the intermediate map outputs, in the cloudera patched 18.3. I just assumed it was speculative execution. I wonder if one of the patches in the cloudera version has had an effect on this. On Mon, Mar 9, 2009 at 2:34 PM, Owen O'Malley wro

Re: Support for zipped input files

2009-03-09 Thread jason hadoop
Hadoop has support for S3, the compression support is handled at another level and should also work. On Mon, Mar 9, 2009 at 9:05 PM, Ken Weiner wrote: > I have a lot of large zipped (not gzipped) files sitting in an Amazon S3 > bucket that I want to process. What is the easiest way to process

Re: Extending ClusterMapReduceTestCase

2009-03-10 Thread jason hadoop
There are a couple of failures that happen in tests derived from ClusterMapReduceTestCase that are run outside of the hadoop unit test framework. The basic issue is that the unit test doesn't have the benefit of a runtime environment setup by the bin/hadoop script. The classpath is usually missin

Re: Extending ClusterMapReduceTestCase

2009-03-10 Thread jason hadoop
("javax.xml.parsers.SAXParserFactory","org.apache.xerces.jaxp.SAXParserFactoryImpl"); On Tue, Mar 10, 2009 at 2:28 PM, jason hadoop wrote: > There are a couple of failures that happen in tests derived from > ClusterMapReduceTestCase that are run outside of the hadoop unit test > framework. > &g

Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread jason hadoop
d out the details, but it has been have a year since I dealt with this last. Unless you are forking, to run your junit tests, ant won't let you change the class path for your unit tests - much chaos will ensue. On Wed, Mar 11, 2009 at 4:39 AM, Steve Loughran wrote: > jason hadoop wr

Re: Extending ClusterMapReduceTestCase

2009-03-11 Thread jason hadoop
Finally remembered, we had saxon 6.5.5 in the class path, and the jetty error was 09/03/11 08:23:20 WARN xml.XmlParser: EXCEPTION javax.xml.parsers.ParserConfigurationException: AElfred parser is non-validating On Wed, Mar 11, 2009 at 8:01 AM, jason hadoop wrote: > I am having trou

Re: tuning performance

2009-03-12 Thread jason hadoop
For a simple test, set the replication on your entire cluster to 6 hadoop dfs -setRep -R -w 6 / This will triple your disk usage and probably take a while, but then you are guaranteed that all data is local. You can also get a rough idea from the Job Counters, 'Data-local map tasks' total field

Re: HTTP addressable files from HDFS?

2009-03-13 Thread jason hadoop
wget http://namenode:port/*data/*filename will return the filename. The namenode will redirect the http request to a datanode that has at least some of the blocks in local storage to serve the actual request. The key piece of course is the /data prefix on the file name. port is the port that the w

Re: Temporary files for mapppers and reducers

2009-03-15 Thread jason hadoop
If you use the Java System Property java.io.tmpdir, your reducer will use the ./tmp directory in the local working directory allocated by the framework for your task. If you have a specialty file system for transient data, such as a tmpfs, use that. On Sun, Mar 15, 2009 at 4:08 PM, Mark Kerzner

Re: where and how to get fuse-dfs?

2009-03-17 Thread jason hadoop
fuse_dfs is a contrib package that is part of the standard hadoop distribution tar ball, but not compiled, and does not compile without some special ant flags There is a README in src/contrib/fuse-dfs/README, of the distribution, that walks you through the process of compiling and using fuse_dfs.

Re: Monitoring with Ganglia

2009-03-17 Thread jason hadoop
Make all of your hadoop-metrics properties use the standard IP address of your master node. Then add a straight udp receive block to the gmond.conf of your master node. Then point your gmetad.conf at your master node. There are complete details in forthcoming book, and with this in it, should be a

Re: Changing key/value separator in hadoop streaming

2009-03-21 Thread jason hadoop
For a job using TextOutputFormat, the final output key value pairs will be separated by the string defined in the key mapred.textoutputformat.separator, which defaults to TAB The string under stream.map.output.field.separator, is used to split the lines read back from the mapper into key, value, f

Re: JNI and calling Hadoop jar files

2009-03-24 Thread jason hadoop
The exception reference to *org.apache.hadoop.hdfs.DistributedFileSystem*, implies strongly that a hadoop-default.xml file, or at least a job.xml file is present. Since hadoop-default.xml is bundled into the hadoop-0.X.Y-core.jar, the assumption is that the core jar is available. The class not fou

Re: Join Variation

2009-03-25 Thread jason hadoop
If the search file data set is large, the issue becomes ensuring that only the required portion of search file is actually read, and that those reads are ordered, in search file's key order. If the data set is small, most any of the common patterns will work. I haven't looked at pig for a while,

Re: Join Variation

2009-03-26 Thread jason hadoop
don't see this as an issue yet, because I'm still puzzeled with how to > write > the job in plain MR. The join code is looking for an exact match in the > keys > and that is not what I need. Would a custom comperator which will look for > a > match in between the ranges, be

Re: Multiple k,v pairs from a single map - possible?

2009-03-27 Thread jason hadoop
You may write an arbitrary number of output.collect command You may even use MultipleOutputFormat, to separate and stream the output.collect results to additional destinations. Caution must be taken to ensure that large numbers of files are not created, when using MultipleOutputFormat On Fri,

Re: Join Variation

2009-04-01 Thread jason hadoop
Just for fun, chapter 9 in my book is a work through of solving this class of problem. On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop wrote: > For the classic map/reduce job, you have 3 requirements. > > 1) a comparator that provide the keys in ip address order, such that all > ke

Re: Strange Reduce Bahavior

2009-04-02 Thread jason hadoop
1) when running in pseudo-distributed mode, only 2 values for the reduce count are accepted, 0 and 1. All other positive values are mapped to 1. 2) The single reduce task spawned has several steps, and each of these steps account for about 1/3 of it's overall progress. The 1st third, is collectin

Re: Join Variation

2009-04-02 Thread jason hadoop
Probably be available in a week or so, as draft one isn't quite finished :) On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski wrote: > .. and is not yet available as an alpha book chapter. Any chance uploading > it? > > On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop > wro

Re: HDFS data block clarification

2009-04-02 Thread jason hadoop
HDFS only allocates as much physical disk space is required for a block, up to the block size for the file (+ some header data). So if you write a 4k file, the single block for that file will be around 4k. If you write a 65M file, there will be two blocks, one of roughly 64M, and one of roughly 1M

Re: joining two large files in hadoop

2009-04-04 Thread jason hadoop
This is discussed in chapter 8 of my book. In short, If both data sets are: - in same key order - partitioned with the same partitioner, - the input format of each data set is the same, (necessary for this simple example only) A map side join will present all the key value pairs of e

Re: After a node goes down, I can't run jobs

2009-04-05 Thread jason hadoop
>From the 0.19.0 FsNameSystem.java, it looks like the timeout by default is 2 * 3000 + 30 = 306000msec or 5 minutes 6 seconds. If you have configured dfs.hosts.exclude in your hadoop-site.xml to point to an empty file, that actually exists, you may add the name (as used in the slaves file) for

Re: joining two large files in hadoop

2009-04-05 Thread jason hadoop
Alpha chapters are available, and 8 should be available in the alpha's as soon as draft one gets back from technical review. On Sun, Apr 5, 2009 at 7:43 AM, Christian Ulrik Søttrup wrote: > jason hadoop wrote: > >> This is discussed in chapter 8 of my book. >> >> &g

Re: hadoop 0.18.3 writing not flushing to hadoop server?

2009-04-06 Thread jason hadoop
The data is flushed when the file is closed, or the amount written is an even multiple of the block size specified for the file, which by default is 64meg. There is no other way to flush the data to HDFS at present. There is an attempt at this in 0.19.0 but it caused data corruption issues and wa

Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread jason hadoop
Chapter 8 of my book covers this in detail, the alpha chapter should be available at the apress web site Chain mapping rules! http://www.apress.com/book/view/1430219424 On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz wrote: > You can also try decreasing the replication factor for the intermediate >

Re: Multithreaded Reducer

2009-04-10 Thread jason hadoop
Hi Sagar! There is no reason for the body of your reduce method to do more than copy and queue the key value set into an execution pool. The close method will need to wait until the all of the items finish execution and potentially keep the heartbeat up with the task tracker by periodically repor

Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-13 Thread jason hadoop
The following very simple program will tell the VM to drop the pages being cached for a file. I tend to spin this in a for loop when making large tar files, or otherwise working with large files, and the system performance really smooths out. Since it use open(path) it will churn through the inode

Re: Hadoop and Image analysis question

2009-04-13 Thread jason hadoop
If you pack your images into sequence files, as the value items, the cluster will automatically do a decent job of ensuring that the input splits made from the sequences files are local to the map task. We did this in production at a previous job and it worked very well for us. Might as well turn

Re: Extending ClusterMapReduceTestCase

2009-04-13 Thread jason hadoop
I have a nice variant of this in the ch7 examples section of my book, including a standalone wrapper around the virtual cluster for allowing multiple test instances to share the virtual cluster - and allow an easier time to poke around with the input and output datasets. It even works decently und

Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-14 Thread jason hadoop
be looking at the > performance both with and without the cache. > > Brian > > > On Apr 14, 2009, at 12:01 AM, jason hadoop wrote: > > The following very simple program will tell the VM to drop the pages being >> cached for a file. I tend to spin this in a for loop whe

Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
utely necessary for this test to work? > > Thanks again, > bc > > > > jason hadoop wrote: > > > > I have a nice variant of this in the ch7 examples section of my book, > > including a standalone wrapper around the virtual cluster for allowing > > multiple

Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
+ File.separator + "history"); looks like the hadoop.log.dir system property is not set, note: not environment variable, not configuration parameter, but system property. Try a *System.setProperty("hadoop.log.dir","/tmp");* in your code before you initialize the virtu

Re: Map-Reduce Slow Down

2009-04-15 Thread jason hadoop
Double check that there is no firewall in place. At one point a bunch of new machines were kickstarted and placed in a cluster and they all failed with something similar. It turned out the kickstart script turned enabled the firewall with a rule that blocked ports in the 50k range. It took us a whi

Re: Complex workflows in Hadoop

2009-04-16 Thread jason hadoop
Chaining described in chapter 8 of my book provides this to a limited degree. Cascading, http://www.cascading.org/, also supports complex flows. I do not know how cascading works under the covers. On Thu, Apr 16, 2009 at 8:23 AM, Shevek wrote: > On Tue, 2009-04-14 at 07:59 -0500, Pankil Doshi w

Re: Map-Reduce Slow Down

2009-04-16 Thread jason hadoop
you wrote or is it run when > the system turns on? > Mithila > > On Thu, Apr 16, 2009 at 1:06 AM, Mithila Nagendra > wrote: > > > Thanks Jason! Will check that out. > > Mithila > > > > > > On Thu, Apr 16, 2009 at 5:23 AM, jason hadoop >wrote: &g

Re: Map-Reduce Slow Down

2009-04-16 Thread jason hadoop
The firewall was run at system startup, I think there was a /etc/sysconfig/iptables file present which triggered the firewall. I don't currently have access to any centos 5 machines so I can't easily check. On Thu, Apr 16, 2009 at 6:54 PM, jason hadoop wrote: > The kicksta

Re: Map-Reduce Slow Down

2009-04-16 Thread jason hadoop
wall rules, if any for a linux machine. You should be able to use telnet to verify that you can connect from the remote machine. On Thu, Apr 16, 2009 at 9:18 PM, Mithila Nagendra wrote: > Thanks! I ll see what I can find out. > > On Fri, Apr 17, 2009 at 4:55 AM, jason hadoop >wrote:

Re: max value for a dataset

2009-04-18 Thread jason hadoop
The traditional approach would be a Mapper class that maintained a member variable that you kept the max value record, and in the close method of your mapper you output a single record containing that value. The map method of course compares the current record against the max and stores current in

Re: max value for a dataset

2009-04-20 Thread jason hadoop
f the work done in the reduce. On Mon, Apr 20, 2009 at 4:26 AM, Shevek wrote: > On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: > > The traditional approach would be a Mapper class that maintained a member > > variable that you kept the max value record, and in the close method

Re: max value for a dataset

2009-04-21 Thread jason hadoop
u're performing a > > SQL-like operation in MapReduce; not always the best way to approach this > > type of problem). > > > > Brian > > > > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: > > > >> The Hadoop Framework requires that a Map Phase b

Re: max value for a dataset

2009-04-21 Thread jason hadoop
ay to approach this > type of problem). > > Brian > > > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: > > The Hadoop Framework requires that a Map Phase be run before the Reduce >> Phase. >> By doing the initial 'reduce' in the map, a much smaller

Re: anyone knows why setting mapred.tasktracker.map.tasks.maximum not working?

2009-04-21 Thread jason hadoop
There must be only 2 input splits being produced for your job. Either you have 2 unsplitable files, or the input file(s) you have are not large enough compared to the block size to be split. Table 6-1 in chapter 06 gives a breakdown of all of the configuration parameters that affect split size in

Re: No route to host prevents from storing files to HDFS

2009-04-21 Thread jason hadoop
Most likely that machine is affected by some firewall somewhere that prevents traffic on port 50075. The no route to host is a strong indicator, particularly if the Datanote registered with the namenode. On Tue, Apr 21, 2009 at 4:18 PM, Philip Zeyliger wrote: > Very naively looking at the code, t

Re: getting DiskErrorException during map

2009-04-21 Thread jason hadoop
For reasons that I have never bothered to investigate I have never had a cluster work when the hadoop.tmp.dir was not identical on all of the nodes. My solution has always been to just make a symbolic link so that hadoop.tmp.dir was identical and on the machine in question really ended up in the f

Re: max value for a dataset

2009-04-22 Thread jason hadoop
/numbers -output /tmp/numbers_max_output -reducer aggregate -mapper LongMax.pl -file /tmp/LongMax.pl On Tue, Apr 21, 2009 at 7:42 PM, jason hadoop wrote: > There is no reason to use a combiner in this case, as there is only a > single output record from the map. > > Combiners buy you da

Re: getting DiskErrorException during map

2009-04-22 Thread jason hadoop
ey Jason, > > We've never had the hadoop.tmp.dir identical on all our nodes. > > Brian > > > On Apr 22, 2009, at 10:54 AM, jason hadoop wrote: > > For reasons that I have never bothered to investigate I have never had a >> cluster work when the hadoop.tmp.dir was

Re: No route to host prevents from storing files to HDFS

2009-04-22 Thread jason hadoop
the no route to host message means one of two things, either there is no actual route, which would have generated a different error, or some firewall is sending back a new route message. I have seen the now route to host problem several times, and it is usually because there is a firewall in place

Re: No route to host prevents from storing files to HDFS

2009-04-22 Thread jason hadoop
I wonder if this is an obscure case of out of file descriptors. I would expect a different message out of the jvm core On Wed, Apr 22, 2009 at 5:34 PM, Matt Massie wrote: > Just for clarity: are you using any type of virtualization (e.g. vmware, > xen) or just running the DataNode java process o

Re: NameNode Startup Problem

2009-04-22 Thread jason hadoop
It looks like this is during the hdfs recovery phase of the cluster start. Perhaps a tmp cleaner has removed some of the files, and now this portion of the restart is causing a failure. I am not terribly familiar with the job recovery code. On Wed, Apr 22, 2009 at 11:44 AM, Tamir Kamara wrote:

Re: No route to host prevents from storing files to HDFS

2009-04-22 Thread jason hadoop
I believe the datanode is the same physical machine as the namenode if I understand this problem correctly. Which really puts pay to our suggestions about traceroute and firewalls) I have one question, is the ip address consistent, I think in one of the thread mails, it was stated that the ip addr

Re: No route to host prevents from storing files to HDFS

2009-04-23 Thread jason hadoop
Can you give us your network topology ? I see that at least 3 ip addresses 192.168.253.20, 192.168.253.32 and 192.168.253.21 In particular the fs.default.name which you have provided, the hadoop-site.xml for each machine, the slaves file, with ip address mappings if needed and a netstat -a -n -t -

  1   2   3   >