I'm late to the party, and this isn't a hadoop solution, but apparently
Cassandra is pretty good at this.

https://medium.com/walmartlabs/building-object-store-storing-images-in-cassandra-walmart-scale-a6b9c02af593



On Wed, Sep 6, 2017 at 2:48 PM, Ralph Soika <ralph.so...@imixs.com> wrote:

> Hi
>
> I want to thank you all for your answers and your good ideas how to solve
> the hadoop "small-file-problem".
>
> Now I would like to briefly summarize your answers and suggested
> solutions. First of all I describe once again my general use case:
>
>
>    - An external enterprise application need to store small photo files
>    in unsteady intervals into a clustered big data storage.
>    - Users need to read the files through the web interface of the
>    enterprise application also in unsteady intervals.
>    - The solution need to guarantee the data integrity of all files over
>    a long period of time.
>    - To write and read the files a Rest API is preferred.
>
>
>
> *1) Multiple small-files in one sequence file:*
>
> Packing multiple small files in one sequence file is a possible solution
> even if it is hard to implement. If the files are attached by the
> enterprise application in unsteady intervals (as in my case), the
> enterprise application need to be aware of the latest filesize of the
> sequence file and need to compute the correct offset. The offset and the
> size of the photo file is needed to access a single photo file later. E.g.
> via WebHDFS:
>
> http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=<
> LONG>]&length=<LONG>]
>
> If multiple threads try to append data in parallel, the problem becomes
> more and more complex. But yes - this could be a possible solution.
>
>
> *2) **Multiple small-files in a Hadoop Archive (HAR):*
>
> Another solution is to pack the small files in a Hadoop Archive (HAR)
> file. But this solution is even more difficult to implement in my case. As
> I explained, the enterprise application write data in unsteady intervals.
> This means that the archive job need to be decoupled from the enterprise
> application. For example a scheduler could archive and delete files older
> than one day on a daily basis. This would reduce the number of small files
> significant. The problem here is, that the enterprise application need to
> be aware of the new location of a single photo file. To access a 'packed'
> photo file, the Offset and Size in the HAR file need to be transferred back
> to the enterprise application. As a result of this solution, the complexity
> of the overall system increases unreasonably. To decouple things, the
> scheduler could create a kind of index file for each new created HAR file.
> The index could be used by the enterprise application to lookup the
> file-path, offset and size. But as a single photo file can now be either
> stored still as a small-file or already as a part of a HAR file, the access
> method need to be implemented very tricky.
> OK, the solution is possible, but the sequence file solution seems to be
> much easier.
>
>
>
> *3) The object store "openstack swift":*
>
> It seems that the object store "openstack swift" is solving the small-file
> problem much better. It is certainly worth following this approach.
> However, since I am basically convinced of Hadoop, I will not make a
> fundamental change in architecture for now.
>
>
> *4) Intel-bigdata/SSM:*
>
> The "Transparent Small Files Support" form the Intel-Bigdata project is an
> interesting approach and I believe this would solve my problems at all. But
> I fear it is to early to start here.
>
>
>
> *5) HDFS-7240 or Ozone:*
>
> HDFS-7240 or Ozone looks very very promising. It looks to me that Ozone is
> the missing piece in the Hadoop project. Although it is not yet ready for
> use in production, I will follow this project.
>
>
> *6) mapR-fs:*
> mapR-fs could be an alternative, but I do not consider it here.
>
>
>
>
> *My conclusion: *
>
> So for my use case in short-term it seems to be the best solution to start
> with the sequence file approach. In the intermediate-term I will see if I
> can adapt my solution to the Hadoop Ozone project. In the long term I will
> probably support both approaches.
> Since my solution is part of the open source project Imixs-Workflow, I
> will publish my solution as well on GitHub.
>
>
> So once again - thanks a lot for your help.
>
> Ralph
>
>
>
>
> On 04.09.2017 19:03, Ralph Soika wrote:
>
> Hi,
>
> I know that the issue around the small-file problem was asked frequently,
> not only in this mailing list.
> I also have read already some books about Haddoop and I also started to
> work with Hadoop. But still I did not really understand if Hadoop is the
> right choice for my goals.
>
> To simplify my problem domain I would like to use the use case of a photo
> archive:
>
> - An external application produces about 10 million photos in one year.
> The files contain important business critical data.
> - A single photo file has a size between 1 and 10 MB.
> - The photos need to be stored over several years (10-30 years).
> - The data store should support replication over several servers.
> - A checksum-concept is needed to guarantee the data integrity of all
> files over a long period of time.
> - To write and read the files a Rest API is preferred.
>
> So far Hadoop seems to be absolutely the perfect solution. But my last
> requirement seems to throw Hadoop out of the race.
>
> - The photos need to be readable with very short latency from an external
> enterprise application
>
> With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems
> that most of the Hadoop experts advise against this usage if the size of my
> data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.
>
> I think I understood the concepts of HAR or sequential files.
> But if I pack, for example, my files together in a large file of many
> Gigabytes it is impossible to access one single photo from the Hadoop
> repository in a reasonable time. It makes no sense in my eyes to pack
> thousands of files into a large file just so that Hadoop jobs can handle it
> better. To simply access a single file from a web interface - as in my case
> - it seems to be all counterproductive.
>
> So my question is: Is Hadoop only feasible to archive large Web-server log
> files and not designed to handle big archives of small files with also
> business critical data?
>
>
> Thanks for your advice in advance.
>
> Ralph
> --
>
>
> --
> *Imixs*...extends the way people work together
> We are an open source company, read more at: www.imixs.org
> ------------------------------
> Imixs Software Solutions GmbH
> Agnes-Pockels-Bogen 1, 80992 München
> *Web:* www.imixs.com
> *Office:* +49 (0)89-452136 16 <+49%2089%2045213616> *Mobil:*
> +49-177-4128245 <+49%20177%204128245>
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika
>

Reply via email to