Re: HDFS load balancing for non-local reads

2012-01-06 Thread alo.alt
Ben, thats defined in ReplicationTargetChooser, first local, 2nd same rack, random. You're right - 50/50 if case one and two does not match. - Alex -- Alexander Lorenz http://mapredit.blogspot.com On Jan 6, 2012, at 11:56 AM, Ben Clay wrote: > Alex- > > Understood. We do not have a situation

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Joey Echeverria
I would use a MapReduce job to merge them. -Joey On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes wrote: > Hi Joey, > > That's a very good suggestion and might suit us just fine. > > However, many of the files will be much smaller than the HDFS block size. > That could affect the performance of the

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Steve Edison
I was exploring .har based hadop archive files for a similar small log file scenario I have. I have millions of log files which are less than 64MB each and I want to put them into HDFS and run analysis. Still exploring if HDFS is a good options. Traditionally what I have learnt is that HDFS isn't g

RE: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Dave Shine
Frank, We have a very serious small file problem. I created a M/R job that combines files as it seemed best to use all the resources of the cluster rather than opening a stream and combining files single threaded or trying to do something via command line. Dave -Original Message- Fr

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi Joey, That's a very good suggestion and might suit us just fine. However, many of the files will be much smaller than the HDFS block size. That could affect the performance of the MapReduce jobs, correct? Also, from my understanding it would put more burden on the name node (memory usage) tha

RE: HDFS load balancing for non-local reads

2012-01-06 Thread Ben Clay
Alex- Understood. We do not have a situation that extreme, I was just looking for conceptual verification that reads are balanced across replicas of equal distance. From the PDF you linked: "For reading, the name node first checks if the client's computer is located in the cluster. If yes, block

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Joey Echeverria
I would do it by staging the machine data into a temporary directory and then renaming the directory when it's been verified. So, data would be written into directories like this: 2012-01/02/00/stage/machine1.log.avro 2012-01/02/00/stage/machine2.log.avro 2012-01/02/00/stage/machine3.log.avro Aft

Re: HDFS load balancing for non-local reads

2012-01-06 Thread alo.alt
Ben, the scenario should not happen, if one DN has 20 clients and the other zero (same block) the cluster (or DN) has another problem. Rack Awareness is described here: https://issues.apache.org/jira/secure/attachment/12345251/Rack_aware_HDFS_proposal.pdf - Alex -- Alexander Lorenz http://mapr

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi Bobby, Actually, the problem we're trying to solve is one of completeness. Say we have 3 machines generating log events and putting them to HDFS on an hourly basis. e.g. 2012-01/01/00/machine1.log.avro 2012-01/01/00/machine2.log.avro 2012-01/01/00/machine3.log.avro Sometime after the hour,

Re: Mounting HDFS

2012-01-06 Thread alo.alt
Stuti, define in CLASSPATH="…." only the jars you really need for. An export of all jars in a given directory is a red flag (done with *.jar). - Alex On Jan 6, 2012, at 7:23 AM, M. C. Srivas wrote: > > unique: 1, error: 0 (Success), outsize: 40 > unique: 2, opcode: GETATTR (3), nodeid: 1, i

Re: Combining AVRO files efficiently within HDFS

2012-01-06 Thread Robert Evans
Frank, That depends on what you mean by combining. It sounds like you are trying to aggregate data from several days, which may involve doing a join so I would say a MapReduce job is your best bet. If you are not going to do any processing at all then why are you trying to combine them? Is th

Combining AVRO files efficiently within HDFS

2012-01-06 Thread Frank Grimes
Hi All, I was wondering if there was an easy way to combing multiple .avro files efficiently. e.g. combining multiple hours of logs into a daily aggregate Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed. I'd like to avoid needing to pull

Re: Mounting HDFS

2012-01-06 Thread M. C. Srivas
> unique: 1, error: 0 (Success), outsize: 40 > unique: 2, opcode: GETATTR (3), nodeid: 1, insize: 56 Error occurred > during initialization of VM > java/lang/NoClassDefFoundError: java/lang/Object > > Exported Environment Variable: > > > CLASSPATH="/root/FreshMount/hadoop-0.20.2/lib/*.jar:/root/F

How-to use DFSClient's BlockReader from Java

2012-01-06 Thread David Pavlis
Hi, I am relatively new to Hadoop and I am trying to utilize HDFS for own application where I want to take advantage of data partitioning HDFS performs. The idea is that I get list of individual blocks - BlockLocations of particular file and then directly read those (go to individual DataNodes).

RE: Mounting HDFS

2012-01-06 Thread Stuti Awasthi
Hi Guys, Badly stuck up with the fuse-dfs since last 3 days. Following are the errors I am facing : [root@slave fuse-dfs]# ./fuse_dfs dfs://slave:54310 /root/FreshMount/mnt1/ -d port=54310,server=slave fuse-dfs didn't recognize /root/FreshMount/mnt1/,-2 fuse-dfs ignoring option -d unique: 1, o