Re: Hadoop HDFS vs. NFS latency

2008-03-24 Thread Robert Krüger
Thank you very much for sharing your experience! This is very helpful and we will take a loog at mogile. I have two questions regarding your decision against HDFS. You mention issues with scale regarding the number of files. Could you elaborate a bit? At which orders of magnitude would you

[core] datanode Exception error wiht dfs.replication to 1

2008-03-24 Thread Alfonso Olias Sanz
Hi I set up the property dfs.replication to 1 and I got this error while coping files to the DFS. [EMAIL PROTECTED] ~/software/Hadoop/hadoop-0.16.0]$ bin/hadoop dfs -copyFromLocal /home2/mtlinden/simdata/GASS-RDS-3-G/tm IDT 08/03/24 11:46:32 WARN fs.DFSClient: DataStreamer Exception:

Re: Performance / cluster scaling question

2008-03-24 Thread André Martin
Thanks for the clarification, dhruba :-) Anyway, what can cause those other exceptions such as Could not get block locations and DataXceiver: java.io.EOFException? Can anyone give me a little more insight about those exceptions? And does anyone have a similar workload (frequent writes and

Re: HoD and locality of TaskTrackers to data (on DataNodes)

2008-03-24 Thread Jiaqi Tan
Hi Hemanth, More design questions I'm wondering about: So what determines the spread/location of data blocks that are uploaded/added to HDFS outside of the Map/Reduce framework? For instance, if I use a dfs -put to upload files to the HDFS, does the dfs system try to spread the blocks out across

[core] problems while coping files from local file system to dfs

2008-03-24 Thread Alfonso Olias Sanz
Hi I want to copy 1000 files (37GB) of data to the dfs. I have a set up of 9-10 nodes, each one has between 5 to 15GB of free space. While coping the files from the local file system on nodeA, the node gets full of data and the the process gets stalled. I have another free node with 80GB of

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Ted Dunning
Copy from a machine that is *not* running as a data node in order to get better balancing. Using distcp may also help because the nodes actually doing the copying will be spread across the cluster. You should probably be running a rebalancing script as well if your nodes have differing sizes.

Re: Hadoop HDFS vs. NFS latency

2008-03-24 Thread Robert Krüger
Ted Dunning wrote: A few million files should fit pretty easily in hdfs. One problem is that hadoop is not designed with full high availability in mind. Mogile is easier to adapt to those needs. Sorry to be so persistent but what failure scenario would mogile handle better than hadoop hdfs or

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Alfonso Olias Sanz
Hi Ted Thanks for the info. But running the distfs I got this exception bin/hadoop distcp -update file:///home2/mtlinden/simdata/GASS-RDS-3-G/tm /user/aolias/IDT With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RemoteException:

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Alfonso Olias Sanz
Ok it seems I have the file system corrupted. How can I recover from this bin/hadoop fsck / /tmp/hadoop-aolias/mapred/system/job_200803241610_0001/job.jar: Under replicated blk_4445907956276011533. Target Replicas is 10 but found 7 replica(s). ...

setMapOutputValueClass doesn't work

2008-03-24 Thread Chang Hu
Hi, I have been unsuccessfully trying to set the map output value class different to the one reduce outputs (in 0.16.0). AFAIK the following should do the trick: conf.setMapOutputValueClass(FooWritable.class) conf.setOutputValueClass(BarWritable.class) However I kept getting exceptions saying

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Riccardo Boscolo
From the exception stack it appears that the map output class is correctly set to FooWritable.class but you are trying to collect BarWritable(s) in your map tasks. Best, RB On Mon, Mar 24, 2008 at 1:22 PM, Chang Hu [EMAIL PROTECTED] wrote: Hi, I have been unsuccessfully trying to set the

Re: MapReduce with related data from disparate files

2008-03-24 Thread Ted Dunning
Map-reduce excels at gluing together files like this. The map phase selects the key and makes sure that you have some way of telling what the source of the record is. The reduce phase takes all of the records with the same key and glues them together. It can do your processing, but it is also

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Chang Hu
Thanks Riccardo, but that's not the case. I checked and made sure it's collecting FooWritable. In fact, from the following thread: http://www.nabble.com/Different-output-classes-from-map-and-reducer-td15728122.html My exception is the same as if map output value class was not set. - Chang

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread lohit
If your client use to copy is one of the datanodes, then the first copy would go to this datanode(client) and second would be on another random nodes in your cluster. This policy is designed to improve write performance. On the other hand if you would like the data to be distributed, as Ted

RE: MapReduce with related data from disparate files

2008-03-24 Thread Nathan Wang
It's possible to do the whole thing in one round of map/reduce. The only requirement is to be able to differentiate between the 2 different types of input files, possibly using different file name extensions. One of my coworkers wrote a smart InputFormat class that creates a different

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Chang Hu
Code below, also attached. I put this together from the word count example. package edu.umd.cs.mapreduce; import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Doug Cutting
Chang Hu wrote: Code below, also attached. I put this together from the word count example. The problem is with your combiner. When a combiner is specified, it generates the final map output, since combination is a map-side operation. Your combiner takes Text,IntWritable generated by

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Alfonso Olias Sanz
Yes I did the test and it worked. Also I run the distfs command and the parallel map/reduce copy. Improves performance on the basis that files are copied locally in that node, so there is no need network transmission. But isn't that policy more weak? If that node crashes ( he worst case), you

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Chang Hu
Thanks Doug! I am able to run the job after removing the setConbiner() line. Does it hurt efficiency and how do I add a combiner? - Chang On Mon, Mar 24, 2008 at 6:26 PM, Doug Cutting [EMAIL PROTECTED] wrote: Chang Hu wrote: Code below, also attached. I put this together from the word

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Chang Hu
Good call. Thank you guys for helping me out. I'll do some experiments on efficiency later and keep you guys updated. - Chang On Mon, Mar 24, 2008 at 6:51 PM, Riccardo Boscolo [EMAIL PROTECTED] wrote: That's simple, add a combiner that looks exactly like your reducer, but collects

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread lohit
Improves performance on the basis that files are copied locally in that node, so there is no need network transmission. But isn't that policy more weak? If that node crashes ( he worst case), you loses 1 redundancy level. This policy was for better write performance. As you mentioned, yes in

Re: [core] problems while coping files from local file system to dfs

2008-03-24 Thread Ted Dunning
I hate to point this out, but losing *any* data node will decrease the replication of some blocks. On 3/24/08 4:53 PM, lohit [EMAIL PROTECTED] wrote: Improves performance on the basis that files are copied locally in that node, so there is no need network transmission. But isn't that policy

Re: How to use hadoop wiht tomcat!!

2008-03-24 Thread Josh Ma
sandybandy wrote: Hi , I have put hadoop-core.jar and all dependent JARS in webapp lib and also all XMLBEAN jars in webapp lib since my mapreducer program is using these XMLBEAN jar files to process xml document. But when I submit a job via servlet it is saying