If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster. It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS. Supposedly you can even run Hadoop on top of Gluster.
I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list. Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself. From: Jay Vyas <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, October 15, 2012 1:21 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Suitability of HDFS for live file store Seems like a heavyweight solution unless you are actually processing the images? Wow, no mapreduce, no streaming writes, and relatively small files. Im surprised that you are considering hadoop at all ? Im surprised there isnt a simpler solution that uses redundancy without all the daemons and name nodes and task trackers and stuff. Might make it kind of awkward as a normal file system. On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[email protected]<mailto:[email protected]>> wrote: Hey Matt, What do you mean by 'real-time' though? While HDFS has pretty good contiguous data read speeds (and you get N x replicas to read from), if you're looking to "cache" frequently accessed files into memory then HDFS does not natively have support for that. Otherwise, I agree with Brock, seems like you could make it work with HDFS (sans MapReduce - no need to run it if you don't need it). The presence of NameNode audit logging will help your file access analysis requirement. On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[email protected]<mailto:[email protected]>> wrote: > Hi, > > I am a new Hadoop user, and would really appreciate your opinions on whether > Hadoop is the right tool for what I'm thinking of using it for. > > I am investigating options for scaling an archive of around 100Tb of image > data. These images are typically TIFF files of around 50-100Mb each and need > to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activity. > Speed of write is not particularly important. > > Our previous solution was a monolithic, expensive - and very full - SAN so I > am excited by Hadoop's distributed, extensible, redundant architecture. > > My concern is that a lot of the discussion on and use cases for Hadoop is > regarding data processing with MapReduce and - from what I understand - > using HDFS for the purpose of input for MapReduce jobs. My other concern is > vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most likely be > in file access analysis rather than any processing on the files themselves. > > In other words, what I really want is a distributed, resilient, scalable > filesystem. > > Is Hadoop suitable if we just use this facility, or would I be misusing it > and inviting grief? > > M -- Harsh J -- Jay Vyas MMSB/UCHC
