Thanks guys; really appreciated. I was deliberately vague about the notion of real-time because I didn't know what the metrics are that made Hadoop be considered a batch system - if that makes sense!
Essentially, the speed of access to the files stored in HDFS needs to be comparable to files being read off a native file system in order for end-user download. Whereas the bulk of the data on disk will be TIFF files, we will also be including JPEG derivatives which we are intending to be displaying inline in a web-based application. We typically have sparse access metrics - we have millions of files, but each file may be viewed only zero or one time over a year. Therefore, native in-memory caching isn't so much of an issue. M On 16 October 2012 09:08, Harsh J <[email protected]> wrote: > Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[email protected]> wrote: > > Hi, > > > > I am a new Hadoop user, and would really appreciate your opinions on > whether > > Hadoop is the right tool for what I'm thinking of using it for. > > > > I am investigating options for scaling an archive of around 100Tb of > image > > data. These images are typically TIFF files of around 50-100Mb each and > need > > to be made available online in realtime. Access to the files will be > > sporadic and occasional, but writing the files will be a daily activity. > > Speed of write is not particularly important. > > > > Our previous solution was a monolithic, expensive - and very full - SAN > so I > > am excited by Hadoop's distributed, extensible, redundant architecture. > > > > My concern is that a lot of the discussion on and use cases for Hadoop is > > regarding data processing with MapReduce and - from what I understand - > > using HDFS for the purpose of input for MapReduce jobs. My other concern > is > > vague indication that it's not a 'real-time' system. We may be using > > MapReduce in small components of the application, but it will most > likely be > > in file access analysis rather than any processing on the files > themselves. > > > > In other words, what I really want is a distributed, resilient, scalable > > filesystem. > > > > Is Hadoop suitable if we just use this facility, or would I be misusing > it > > and inviting grief? > > > > M > > > > -- > Harsh J > -- Matt Painter [email protected] +64 21 115 9378
