Hi Otis, Flume was designed as a streaming event transport system, not as a general purpose file transfer system. The two have quite different characteristics, so while binary files could be transported by Flume, if you tried to transport a 100MB PDF as a single event you may have issues around memory allocation, GC, transfer speed, etc., since we hold at least one event at a time in memory. However if you want to transfer a large log file and each line is an event then it's a perfect use case because you care about the individual events more than the file itself.
For transferring very large binary files that are not events or records, you may want to look for something that it good at being a single-hop system with resume capability, like rsync, to transfer the files. Then I suppose you could use the hadoop fs shell and a small script to store the data onto HDFS. You probably wouldn't need all the fancy tagging, routing, and serialization features that Flume has. Hope this helps. Regards Mike On Sun, Oct 14, 2012 at 5:49 PM, Otis Gospodnetic < [email protected]> wrote: > Hi, > > We're considering using Flume for transport of potentially large > "documents" (think documents that can be as small as tweets or as large as > PDF files). > > I'm wondering if Flume is suitable for transporting potentially large > documents (in the most reliable mode, too) or if there is something > inherent in Flume that makes it a poor choice for this use case? > > Thanks, > Otis > ---- > Performance Monitoring for Solr / ElasticSearch / HBase - > http://sematext.com/spm >
