Hi,

We use HDFS to process data for the LHC - somewhat similar case here.  Our 
files are a bit larger, our total local data size if ~1PB logical, and we 
"bring our own" batch system, so no Map-Reduce.  We perform many random reads, 
so we are quite sensitive to underlying latency.

I don't see any obvious mismatches between your requirements and HDFS 
capabilities that you can eliminate it as a candidate without an evaluation.  
Do note that HDFS does not provide complete POSIX semantics - but you don't 
appear to need them?

IMHO, if you are looking for the following requirements:
1) Proven petascale data store (never want to be on the bleeding edge of your 
filesystem's scaling!).
2) Has self-healing semantics (can recover from the loss of RAIDs or entire 
storage targets).
3) Open source (but do consider commercial companies - your time is worth 
something!).

You end up at looking at a very small number of candidates.  Others filesystems 
that should be on your list:

1) Gluster.  A quite viable alternate.  Like HDFS, you can buy commercial 
support.  I personally don't know enough to provide a pros/cons list, but we 
keep it on our radar.
2) Ceph.  Not as proven IMHO.  I don't know of multiple petascale deploys.  
Requires a quite recent kernel.  Quite good on-paper design.
3) Lustre.  I think you'd be disappointed with the self-healing.  A very 
"traditional" HPC/clustered filesystem design.

For us, HDFS wins.  I think it has the possibility of being a winner in your 
case too.

Brian

On Oct 15, 2012, at 3:21 PM, Jay Vyas <[email protected]> wrote:

> Seems like a heavyweight solution unless you are actually processing the 
> images? 
> 
> Wow, no mapreduce, no streaming writes, and relatively small files.  Im 
> surprised that you are considering hadoop at all ?
> 
> Im surprised there isnt a simpler solution that uses redundancy without all 
> the 
> daemons and name nodes and task trackers and stuff.
> 
> Might make it kind of awkward as a normal file system. 
> 
> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[email protected]> wrote:
> Hey Matt,
> 
> What do you mean by 'real-time' though? While HDFS has pretty good
> contiguous data read speeds (and you get N x replicas to read from),
> if you're looking to "cache" frequently accessed files into memory
> then HDFS does not natively have support for that. Otherwise, I agree
> with Brock, seems like you could make it work with HDFS (sans
> MapReduce - no need to run it if you don't need it).
> 
> The presence of NameNode audit logging will help your file access
> analysis requirement.
> 
> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[email protected]> wrote:
> > Hi,
> >
> > I am a new Hadoop user, and would really appreciate your opinions on whether
> > Hadoop is the right tool for what I'm thinking of using it for.
> >
> > I am investigating options for scaling an archive of around 100Tb of image
> > data. These images are typically TIFF files of around 50-100Mb each and need
> > to be made available online in realtime. Access to the files will be
> > sporadic and occasional, but writing the files will be a daily activity.
> > Speed of write is not particularly important.
> >
> > Our previous solution was a monolithic, expensive - and very full - SAN so I
> > am excited by Hadoop's distributed, extensible, redundant architecture.
> >
> > My concern is that a lot of the discussion on and use cases for Hadoop is
> > regarding data processing with MapReduce and - from what I understand -
> > using HDFS for the purpose of input for MapReduce jobs. My other concern is
> > vague indication that it's not a 'real-time' system. We may be using
> > MapReduce in small components of the application, but it will most likely be
> > in file access analysis rather than any processing on the files themselves.
> >
> > In other words, what I really want is a distributed, resilient, scalable
> > filesystem.
> >
> > Is Hadoop suitable if we just use this facility, or would I be misusing it
> > and inviting grief?
> >
> > M
> 
> 
> 
> --
> Harsh J
> 
> 
> 
> -- 
> Jay Vyas
> MMSB/UCHC

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to