Hi, Harsh makes a good point, there is no explicit way to say "these files should remain in memory". However, I would note that give available RAM on the datanodes, the operating system will cache recently accessed blocks.
Brock On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <[email protected]> wrote: > Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[email protected]> wrote: >> Hi, >> >> I am a new Hadoop user, and would really appreciate your opinions on whether >> Hadoop is the right tool for what I'm thinking of using it for. >> >> I am investigating options for scaling an archive of around 100Tb of image >> data. These images are typically TIFF files of around 50-100Mb each and need >> to be made available online in realtime. Access to the files will be >> sporadic and occasional, but writing the files will be a daily activity. >> Speed of write is not particularly important. >> >> Our previous solution was a monolithic, expensive - and very full - SAN so I >> am excited by Hadoop's distributed, extensible, redundant architecture. >> >> My concern is that a lot of the discussion on and use cases for Hadoop is >> regarding data processing with MapReduce and - from what I understand - >> using HDFS for the purpose of input for MapReduce jobs. My other concern is >> vague indication that it's not a 'real-time' system. We may be using >> MapReduce in small components of the application, but it will most likely be >> in file access analysis rather than any processing on the files themselves. >> >> In other words, what I really want is a distributed, resilient, scalable >> filesystem. >> >> Is Hadoop suitable if we just use this facility, or would I be misusing it >> and inviting grief? >> >> M > > > > -- > Harsh J -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
