Re: RAID vs. JBOD

2009-01-12 Thread Colin Evans
Currently, Hadoop does round-robin allocation of blocks and data across multiple JBOD disks. We did some testing and found that there weren't significant differences between RAID-0 and JBOD. We went with JBOD because we figured that RAID-0 has a higher failure rate than JBOD -- any disk

Re: A Scale-Out RDF Store for Distributed Processing on Map/Reduce

2008-10-20 Thread Colin Evans
Hi Edward, At Metaweb, we're experimenting with storing raw triples in HDFS flat files, and have written a simple query language and planner that executes the queries with chained map-reduce jobs. This approach works well for warehousing triple data, and doesn't require HBase. Queries may

Re: A Scale-Out RDF Store for Distributed Processing on Map/Reduce

2008-10-20 Thread Colin Evans
Engineering, Korea University 1, 5-ga, Anam-dong, Seongbuk-gu, Seoul, 136-713, Republic of Korea TEL : +82-2-3290-3580 - On Tue, Oct 21, 2008 at 10:23 AM, Colin Evans [EMAIL PROTECTED] wrote: Hi Edward, At Metaweb, we're experimenting

Re: Distributed cache Design

2008-10-16 Thread Colin Evans
At Freebase, we're mapping our large graphs into very large files of triples in HDFS and running large queries over them. Hadoop is optimized for processing streaming data off of disk, and we've found that trying to load a multi-GB graph and then access it in a Hadoop task has scaling

Re: LZO and native hadoop libraries

2008-09-30 Thread Colin Evans
There's a patch to get the native targets to build on Mac OS X: http://issues.apache.org/jira/browse/HADOOP-3659 You probably will need to monkey with LDFLAGS as well to get it to work, but we've been able to build the native libs for the Mac without too much trouble. Doug Cutting wrote:

Re: LZO and native hadoop libraries

2008-09-30 Thread Colin Evans
[exec] make[2]: *** [LzoCompressor.lo] Error 1 [exec] make[1]: *** [all-recursive] Error 1 [exec] make: *** [all] Error 2 Any ideas? On Sep 30, 2008, at 11:53 AM, Colin Evans wrote: There's a patch to get the native targets to build on Mac OS X: http://issues.apache.org/jira

Re: LZO and native hadoop libraries

2008-09-30 Thread Colin Evans
wrote: Unfortunately, setting those environment variables did not help my issue. It appears that the HADOOP_LZO_LIBRARY variable is not defined in both LzoCompressor.c and LzoDecompressor.c. Where is this variable supposed to be set? On Sep 30, 2008, at 12:33 PM, Colin Evans wrote: Hi Nathan

Hadoop + Python = Happy

2008-09-23 Thread Colin Evans
Freebase is finally open-sourcing our Jython-based framework for writing map-reduce jobs on Hadoop. Happy tightly embeds Jython into the Hadoop APIs, files off a lot of the sharp edges, and makes writing map-reduce programs a breeze. This is the 0.1 release, but we've been using Happy at

Hadoop presentations at the next Freebase user group meeting

2008-06-12 Thread Colin Evans
from the SEC in Freebase, a talk by Kurt Bollacker on data mining Wikipedia, and at talk by Kirrily Robert on new features in Freebase. Sign up if you're planning on coming - space can be limited. http://upcoming.yahoo.com/event/760574 Thanks Colin Evans

RAID-0 vs. JBOD?

2008-04-10 Thread Colin Evans
We're building a cluster of 40 machines with 5 drives each, and I'm curious what people's experiences have been for using RAID-0 for HDFS vs. configuring seperate partitions (JBOD) and having the datanode balance between them. I took a look at the datanode code, and datanodes appear to write

Re: map/reduce function on xml string

2008-03-04 Thread Colin Evans
Here's the code. If folks are interested, I can submit it as a patch as well. Prasan Ary wrote: Colin, Is it possible that you share some of the code with us? thx, Prasan Colin Evans [EMAIL PROTECTED] wrote: We ended up subclassing TextInputFormat and adding a custom

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Colin Evans
:19 PM, Colin Evans [EMAIL PROTECTED] wrote: The big question for me is how well a dual-CPU 4-core (8 cores per box) configuration will do. Has anyone tried out this configuration with Intel or AMD CPUs? Is the memory throughput sufficient?

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Colin Evans
Because of acquiring servers of different capacities at different times, we have 2 servers with 1TB of disk each, and 11 servers with ~300GB each. The 1TB servers tend to be under-utilized by HDFS given their capacity. This makes sense, as block replicas need to be relatively evenly

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Colin Evans
Hi Ted, I've been building out a similar framework in JavaScript (Rhino) for work that I've been doing at MetaWeb, and we've been thinking about open sourcing it too. It's pretty clear that there are major benefits to using a dynamic scripting language with Hadoop. I'd love too see how