>> sends message per data chunk Interesting... and a little bit confusing...
Looked through the source code and it seems that what the component is really intended to do is to send the message on the per record basis, but not on the per chunk basis. In case of per record basis - the simplest way is just to configure the component with the native hadoop InputFormats (as an additional benefit it will be possible to use all the available on the classpath InputFormats, not only Sequence, Map, BloomMap, Array-files) Also, with chunked data transferring it's really difficult to stream data to, lets say, http, ftp, s3, etc. without storing this intermediate data locally and in that case there is an overhead not only with the tmp file itself, but also with copying the same data multiple times without any necessity. Regards, Sergey > Hi, > related to hdfs2 and normal file, you might find, > that camel sends message per data chunk, > NOT message per file (which I would expect). > They probably don't intent to change it. > It was reported > as bug https://issues.apache.org/jira/browse/CAMEL-8040 (won't fix) > and as doc enhancment > https://issues.apache.org/jira/browse/CAMEL-8150 > (done). > Btw nice catch with that tmp file :) > Josef > On 03/24/2015 09:19 PM, Sergey Zhemzhitsky wrote: >> Hello, >> >> Really interesting question. >> The answer is this jira issue: >> https://issues.apache.org/jira/browse/CAMEL-4555 >> and this diff: >> http://mail-archives.apache.org/mod_mbox/camel-commits/201110.mbox/%[email protected]%3E >> >> It would be really great if >> 1. the component will make this feature optional to be able to stream >> multigigabyte data from within hdfs directly >> on the file by file basis >> 2. the component will merge the files on the fly without any intermediate >> storage. >> >> Just raised the JIRA: https://issues.apache.org/jira/browse/CAMEL-8542 >> >> Regards, >> Sergey >> >>> Hi, all! >>> I'm looking at ways to use hdfs2 component to read files stored in a Hadoop >>> directory. As a quite new Hadoop user I assume that simplest way is when >>> data is stored in normal file format. >>> I was looking at code in >>> 'org.apache.camel.component.hdfs2.HdfsFileType#NORMAL_FILE' class that is >>> responsible for creating the input stream and noticed that it will copy the >>> whole file to the local file system (in temp file) before opening input >>> stream (the case when using 'hdfs://' URI). >>> I wonder what is the reason behind this? Isn't it possible that file can be >>> very large and then this operation will be quite costly? Maybe I missing >>> some basic restrictions on using normal files in Hadoop? >>> Thanks in advance >>> Alexey >> >>
