Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread Bryan Duxbury
We use XFS for our data drives, and we've had somewhat mixed results. One of the biggest pros is that XFS has more free space than ext3, even with the reserved space settings turned all the way to 0. Another is that you can format a 1TB drive as XFS in about 0 seconds, versus minutes for

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Bryan Duxbury
I thought it a conspicuous omission to not discuss the cost of various approaches. Hadoop is free, though you have to spend developer time; how much does Vertica cost on 100 nodes? -Bryan On Apr 14, 2009, at 7:16 AM, Guilherme Germoglio wrote: (Hadoop is used in the benchmarks)

Issue distcp'ing from 0.19.2 to 0.18.3

2009-04-09 Thread Bryan Duxbury
Hey all, I was trying to copy some data from our cluster on 0.19.2 to a new cluster on 0.18.3 by using disctp and the hftp:// filesystem. Everything seemed to be going fine for a few hours, but then a few tasks failed because a few files got 500 errors when trying to be read from the 19

Re: Issue distcp'ing from 0.19.2 to 0.18.3

2009-04-09 Thread Bryan Duxbury
out at the same time? Thanks -Todd On Wed, Apr 8, 2009 at 11:39 PM, Bryan Duxbury br...@rapleaf.com wrote: Hey all, I was trying to copy some data from our cluster on 0.19.2 to a new cluster on 0.18.3 by using disctp and the hftp:// filesystem. Everything seemed to be going fine for a few

Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Bryan Duxbury
I don't really see what the downside of reading it from disk is. A list of word counts should be pretty small on disk so it shouldn't take long to read it into a HashMap. Doing anything else is going to cause you to go a long way out of your way to end up with the same result. -Bryan On

Re: Massive discrepancies in job's bytes written/read

2009-03-18 Thread Bryan Duxbury
Is there some area of the codebase that deals with aggregating counters that I should be looking at? -Bryan On Mar 17, 2009, at 10:20 PM, Owen O'Malley wrote: On Mar 17, 2009, at 7:44 PM, Bryan Duxbury wrote: There is no compression in the mix for us, so that's not the culprit. I'd

Re: Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Bryan Duxbury
I believe the last word on appends right now is that the patch that was committed broke a lot of other things, so it's been disabled. As such, there is no working append in HDFS, and certainly not in hadoop-17.x. -Bryan On Mar 17, 2009, at 4:50 PM, Steve Gao wrote: Thanks, but I was told

Massive discrepancies in job's bytes written/read

2009-03-17 Thread Bryan Duxbury
Hey all, In looking at the stats for a number of our jobs, the amount of data that the UI claims we've read from or written to HDFS is vastly larger than the amount of data that should be involved in the job. For instance, we have a job that combines small files into big files that we're

Re: Does HDFS provide a way to append A file to B ?

2009-03-17 Thread Bryan Duxbury
No. There isn't *any* version of Hadoop with a (stable) append command. On Mar 17, 2009, at 5:08 PM, Steve Gao wrote: Thanks, Bryan. Does 0.18.3 has built-in append command? --- On Tue, 3/17/09, Bryan Duxbury br...@rapleaf.com wrote: From: Bryan Duxbury br...@rapleaf.com Subject: Re: Does

Re: Massive discrepancies in job's bytes written/read

2009-03-17 Thread Bryan Duxbury
if it is due to all the disk activity that happens while processing spills in the mapper and the copy/shuffle/sort phase in the reducer. It would certainly be nice if all the byte counts were reported in a way that they're comparable. -- Stefan From: Bryan Duxbury br...@rapleaf.com Reply

Re: Profiling Hadoop

2009-02-27 Thread Bryan Duxbury
I've used YourKit Java Profiler pretty successfully. There's a JobConf parameter you can flip on that will cause a few maps and reduces to start with profiling on, so you won't be overwhelmed with info. -Bryan On Feb 27, 2009, at 11:12 AM, Sandy wrote: Hello, Could anyone recommend any

Big HDFS deletes lead to dead datanodes

2009-02-24 Thread Bryan Duxbury
On occasion, I've deleted a few TB of stuff in DFS at once. I've noticed that when I do this, datanodes start taking a really long time to check in and ultimately get marked dead. Some time later, they'll get done deleting stuff and come back and get unmarked. I'm wondering, why do

Super-long reduce task timeouts in hadoop-0.19.0

2009-02-20 Thread Bryan Duxbury
(Repost from the dev list) I noticed some really odd behavior today while reviewing the job history of some of our jobs. Our Ganglia graphs showed really long periods of inactivity across the entire cluster, which should definitely not be the case - we have a really long string of jobs in

Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-02-20 Thread Bryan Duxbury
We didn't customize this value, to my knowledge, so I'd suspect it's the default. -Bryan On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote: How often do your reduce tasks report status? On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury br...@rapleaf.com wrote: (Repost from the dev list) I

Measuring IO time in map/reduce jobs?

2009-02-12 Thread Bryan Duxbury
Hey all, Does anyone have any experience trying to measure IO time spent in their map/reduce jobs? I know how to profile a sample of map and reduce tasks, but that appears to exclude IO time. Just subtracting the total cpu time from the total run time of a task seems like too coarse an

Re: java.io.IOException: Could not get block locations. Aborting...

2009-02-09 Thread Bryan Duxbury
Small files are bad for hadoop. You should avoid keeping a lot of small files if possible. That said, that error is something I've seen a lot. It usually happens when the number of xcievers hasn't been adjusted upwards from the default of 256. We run with 8000 xcievers, and that seems to

Re: java.io.IOException: Could not get block locations. Aborting...

2009-02-09 Thread Bryan Duxbury
Correct. +1 to Jason's more unix file handles suggestion. That's a must-have. -Bryan On Feb 9, 2009, at 3:09 PM, Scott Whitecross wrote: This would be an addition to the hadoop-site.xml file, to up dfs.datanode.max.xcievers? Thanks. On Feb 9, 2009, at 5:54 PM, Bryan Duxbury wrote

Re: Control over max map/reduce tasks per job

2009-02-03 Thread Bryan Duxbury
This sounds good enough for a JIRA ticket to me. -Bryan On Feb 3, 2009, at 11:44 AM, Jonathan Gray wrote: Chris, For my specific use cases, it would be best to be able to set N mappers/reducers per job per node (so I can explicitly say, run at most 2 at a time of this CPU bound task on any

Re: Question about HDFS capacity and remaining

2009-01-30 Thread Bryan Duxbury
: Ext2 by default reserves 5% of the drive for use by root only. That'd be 45MB of your 907GB capacity which would account for most of the discrepancy. You can adjust this with tune2fs. Doug Bryan Duxbury wrote: There are no non-dfs files on the partitions in question. df -h indicates

Re: Question about HDFS capacity and remaining

2009-01-30 Thread Bryan Duxbury
it was mostly full (ext4 was not tested)... so, if you are thinking of pushing things to the limits, that might be something worth considering. Brian On Jan 30, 2009, at 11:18 AM, stephen mulcahy wrote: Bryan Duxbury wrote: Hm, very interesting. Didn't know about that. What's the purpose

Question about HDFS capacity and remaining

2009-01-29 Thread Bryan Duxbury
Hey all, I'm currently installing a new cluster, and noticed something a little confusing. My DFS is *completely* empty - 0 files in DFS. However, in the namenode web interface, the reported capacity is 3.49 TB, but the remaining is 3.25TB. Where'd that .24TB go? There are literally zero

Re: Question about HDFS capacity and remaining

2009-01-29 Thread Bryan Duxbury
files. Hairong On 1/29/09 3:23 PM, Bryan Duxbury br...@rapleaf.com wrote: Hey all, I'm currently installing a new cluster, and noticed something a little confusing. My DFS is *completely* empty - 0 files in DFS. However, in the namenode web interface, the reported capacity is 3.49 TB

Re: Q about storage architecture

2008-12-07 Thread Bryan Duxbury
If you are considering using it as a conventional filesystem from a few clients, then it most resembles NAS. However, I don't think it makes sense to try and classify it as SAN or NAS. HDFS is a distributed filesystem designed to be consumed in a massively distributed fashion, so it does

Re: Filesystem closed errors

2008-11-26 Thread Bryan Duxbury
My app isn't a map/reduce job. On Nov 25, 2008, at 9:07 PM, David B. Ritch wrote: Do you have speculative execution enabled? I've seen error messages like this caused by speculative execution. David Bryan Duxbury wrote: I have an app that runs for a long time with no problems, but when I

Re: Filesystem closed errors

2008-11-26 Thread Bryan Duxbury
to your problem. On Nov 25, 2008, at 9:07 PM, David B. Ritch wrote: Do you have speculative execution enabled? I've seen error messages like this caused by speculative execution. David Bryan Duxbury wrote: I have an app that runs for a long time with no problems, but when I signal it to shut

Filesystem closed errors

2008-11-25 Thread Bryan Duxbury
I have an app that runs for a long time with no problems, but when I signal it to shut down, I get errors like this: java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196) at

Re: Hadoop Design Question

2008-11-06 Thread Bryan Duxbury
Comments inline. On Nov 6, 2008, at 9:29 AM, Ricky Ho wrote: Hi, While exploring how Hadoop fits in our usage scenarios, there are 4 recurring issues keep popping up. I don't know if they are real issues or just our misunderstanding of Hadoop. Can any expert shed some light here ?

Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Bryan Duxbury
Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a ThriftWritable if you want to be crafty, but you can also just use byte[]s and do the serialization and deserialization yourself. -Bryan On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote: Take a look at Thrift:

Re: Lazily deserializing Writables

2008-10-02 Thread Bryan Duxbury
We do this with some of our Thrift-serialized types. We account for this behavior explicitly in the ThrittWritable class and make it so that we can read the serialized version off the wire completely by prepending the size. Then, we can read in the raw bytes and hang on to them for later

rename return values

2008-09-30 Thread Bryan Duxbury
Hey all, Why is it that FileSystem.rename returns true or false instead of throwing an exception? It seems incredibly inconvenient to get a false result and then have to go poring over the namenode logs looking for the actual error message. I had this case recently where I'd forgotten to

Re: rename return values

2008-09-30 Thread Bryan Duxbury
if it did, it's not clear to FileSystem that the failure to rename is fatal/exceptional to the application. -C On Sep 30, 2008, at 1:37 PM, Bryan Duxbury wrote: Hey all, Why is it that FileSystem.rename returns true or false instead of throwing an exception? It seems incredibly inconvenient

Re: Could not get block locations. Aborting... exception

2008-09-29 Thread Bryan Duxbury
Ok, so, what might I do next to try and diagnose this? Does it sound like it might be an HDFS/mapreduce bug, or should I pore over my own code first? Also, did any of the other exceptions look interesting? -Bryan On Sep 29, 2008, at 10:40 AM, Raghu Angadi wrote: Raghu Angadi wrote: Doug

Could not get block locations. Aborting... exception

2008-09-26 Thread Bryan Duxbury
Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one:

Re: Could not get block locations. Aborting... exception

2008-09-26 Thread Bryan Duxbury
: Bryan Duxbury [mailto:[EMAIL PROTECTED] Sent: Fri 9/26/2008 4:36 PM To: core-user@hadoop.apache.org Subject: Could not get block locations. Aborting... exception Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp

Hadoop job scheduling issue

2008-09-24 Thread Bryan Duxbury
I encountered an interesting situation today. I'm running Hadoop 0.17.1. What happened was that 3 jobs started simultaneously, which is expected in my workflow, but then resources got very mixed up. One of the jobs grabbed all the available reducers (5) and got one map task in before the

Re: Serialization format for structured data

2008-05-23 Thread Bryan Duxbury
On May 23, 2008, at 9:51 AM, Ted Dunning wrote: Relative to thrift, JSON has the advantage of not requiring a schema as well as the disadvantage of not having a schema. The advantage is that the data is more fluid and I don't have to generate code to handle the records. The disadvantage

Re: Trouble hooking up my app to HDFS

2008-05-14 Thread Bryan Duxbury
Nobody has any ideas about this? -Bryan On May 13, 2008, at 11:27 AM, Bryan Duxbury wrote: I'm trying to create a java application that writes to HDFS. I have it set up such that hadoop-0.16.3 is on my machine, and the env variables HADOOP_HOME and HADOOP_CONF_DIR point to the correct

Re: Trouble hooking up my app to HDFS

2008-05-14 Thread Bryan Duxbury
is present in your classpath. Make sure your generated class path matches the same. And the conf dir (/ Users/bryanduxbury/hadoop-0.16.3/conf) I hope it is the similar as the one you are using for your hadoop installation. Thanks, lohit - Original Message From: Bryan Duxbury [EMAIL

Trouble hooking up my app to HDFS

2008-05-13 Thread Bryan Duxbury
I'm trying to create a java application that writes to HDFS. I have it set up such that hadoop-0.16.3 is on my machine, and the env variables HADOOP_HOME and HADOOP_CONF_DIR point to the correct respective directories. My app lives elsewhere, but generates it's classpath by looking in

Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Bryan Duxbury
I think what you're saying is that you are mostly interested in data locality. I don't think it's done yet, but it would be pretty easy to make HBase provide start keys as well as region locations for splits for a MapReduce job. In theory, that would give you all the pieces you need to run

Re: ID Service with HBase?

2008-04-16 Thread Bryan Duxbury
HBASE-493 was created, and seems similar. It's a write-if-not- modified-since. I would guess that you probably don't want to use HBase to maintain a distributed auto-increment. You need to think of some other approach the produces unique ids across concurrent access, like hash or GUID or

Re: how to connect to hbase using php

2008-03-11 Thread Bryan Duxbury
To connect to HBase from PHP, you should use either REST or Thrift integration. -Bryan On Mar 11, 2008, at 4:20 AM, Ved Prakash wrote: I have seen examples to connect to hbase using php, which mentions of hshellconnect.class.php, I would like to know where can I download this file, or is

Re: loading data into hbase table

2008-03-11 Thread Bryan Duxbury
Ved, At the moment you're stuck loading the data via one of the APIs (Java, REST or Thrift) yourself. We would like to have import tools for HBase, but we haven't gotten around to it yet. Also, there's now a separate HBase mailing list at hbase- [EMAIL PROTECTED] Your questions about

Re: Hbase Matrix Package for Map/Reduce-based Parallel Matrix Computations

2008-01-30 Thread Bryan Duxbury
There's nothing stopping you from storing doubles in HBase. All you have to do is convert your double into a byte array. -Bryan On Jan 30, 2008, at 4:31 PM, Chanwit Kaewkasi wrote: Hi Edward, On 29/01/2008, edward yoon [EMAIL PROTECTED] wrote: Did you mean the MATLAB-like scientific