I'm late to the party, and this isn't a hadoop solution, but apparently Cassandra is pretty good at this.
https://medium.com/walmartlabs/building-object-store-storing-images-in-cassandra-walmart-scale-a6b9c02af593 On Wed, Sep 6, 2017 at 2:48 PM, Ralph Soika <ralph.so...@imixs.com> wrote: > Hi > > I want to thank you all for your answers and your good ideas how to solve > the hadoop "small-file-problem". > > Now I would like to briefly summarize your answers and suggested > solutions. First of all I describe once again my general use case: > > > - An external enterprise application need to store small photo files > in unsteady intervals into a clustered big data storage. > - Users need to read the files through the web interface of the > enterprise application also in unsteady intervals. > - The solution need to guarantee the data integrity of all files over > a long period of time. > - To write and read the files a Rest API is preferred. > > > > *1) Multiple small-files in one sequence file:* > > Packing multiple small files in one sequence file is a possible solution > even if it is hard to implement. If the files are attached by the > enterprise application in unsteady intervals (as in my case), the > enterprise application need to be aware of the latest filesize of the > sequence file and need to compute the correct offset. The offset and the > size of the photo file is needed to access a single photo file later. E.g. > via WebHDFS: > > http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=< > LONG>]&length=<LONG>] > > If multiple threads try to append data in parallel, the problem becomes > more and more complex. But yes - this could be a possible solution. > > > *2) **Multiple small-files in a Hadoop Archive (HAR):* > > Another solution is to pack the small files in a Hadoop Archive (HAR) > file. But this solution is even more difficult to implement in my case. As > I explained, the enterprise application write data in unsteady intervals. > This means that the archive job need to be decoupled from the enterprise > application. For example a scheduler could archive and delete files older > than one day on a daily basis. This would reduce the number of small files > significant. The problem here is, that the enterprise application need to > be aware of the new location of a single photo file. To access a 'packed' > photo file, the Offset and Size in the HAR file need to be transferred back > to the enterprise application. As a result of this solution, the complexity > of the overall system increases unreasonably. To decouple things, the > scheduler could create a kind of index file for each new created HAR file. > The index could be used by the enterprise application to lookup the > file-path, offset and size. But as a single photo file can now be either > stored still as a small-file or already as a part of a HAR file, the access > method need to be implemented very tricky. > OK, the solution is possible, but the sequence file solution seems to be > much easier. > > > > *3) The object store "openstack swift":* > > It seems that the object store "openstack swift" is solving the small-file > problem much better. It is certainly worth following this approach. > However, since I am basically convinced of Hadoop, I will not make a > fundamental change in architecture for now. > > > *4) Intel-bigdata/SSM:* > > The "Transparent Small Files Support" form the Intel-Bigdata project is an > interesting approach and I believe this would solve my problems at all. But > I fear it is to early to start here. > > > > *5) HDFS-7240 or Ozone:* > > HDFS-7240 or Ozone looks very very promising. It looks to me that Ozone is > the missing piece in the Hadoop project. Although it is not yet ready for > use in production, I will follow this project. > > > *6) mapR-fs:* > mapR-fs could be an alternative, but I do not consider it here. > > > > > *My conclusion: * > > So for my use case in short-term it seems to be the best solution to start > with the sequence file approach. In the intermediate-term I will see if I > can adapt my solution to the Hadoop Ozone project. In the long term I will > probably support both approaches. > Since my solution is part of the open source project Imixs-Workflow, I > will publish my solution as well on GitHub. > > > So once again - thanks a lot for your help. > > Ralph > > > > > On 04.09.2017 19:03, Ralph Soika wrote: > > Hi, > > I know that the issue around the small-file problem was asked frequently, > not only in this mailing list. > I also have read already some books about Haddoop and I also started to > work with Hadoop. But still I did not really understand if Hadoop is the > right choice for my goals. > > To simplify my problem domain I would like to use the use case of a photo > archive: > > - An external application produces about 10 million photos in one year. > The files contain important business critical data. > - A single photo file has a size between 1 and 10 MB. > - The photos need to be stored over several years (10-30 years). > - The data store should support replication over several servers. > - A checksum-concept is needed to guarantee the data integrity of all > files over a long period of time. > - To write and read the files a Rest API is preferred. > > So far Hadoop seems to be absolutely the perfect solution. But my last > requirement seems to throw Hadoop out of the race. > > - The photos need to be readable with very short latency from an external > enterprise application > > With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems > that most of the Hadoop experts advise against this usage if the size of my > data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB. > > I think I understood the concepts of HAR or sequential files. > But if I pack, for example, my files together in a large file of many > Gigabytes it is impossible to access one single photo from the Hadoop > repository in a reasonable time. It makes no sense in my eyes to pack > thousands of files into a large file just so that Hadoop jobs can handle it > better. To simply access a single file from a web interface - as in my case > - it seems to be all counterproductive. > > So my question is: Is Hadoop only feasible to archive large Web-server log > files and not designed to handle big archives of small files with also > business critical data? > > > Thanks for your advice in advance. > > Ralph > -- > > > -- > *Imixs*...extends the way people work together > We are an open source company, read more at: www.imixs.org > ------------------------------ > Imixs Software Solutions GmbH > Agnes-Pockels-Bogen 1, 80992 München > *Web:* www.imixs.com > *Office:* +49 (0)89-452136 16 <+49%2089%2045213616> *Mobil:* > +49-177-4128245 <+49%20177%204128245> > Registergericht: Amtsgericht Muenchen, HRB 136045 > Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika >