Suiter,
I thought the solution that using cron job and hadoop commands. However,
in my system, there are some sources (logsys, ...) that used Flume so I
prefer Flume for consistency and useful features from Flume (ex: more
sinks, roll count, ....)
Thanks,
Cuong LUU
On 25/10/2013 00:57, DSuiter RDX wrote:
Luu,
You might want to set up some redundant/load-balancing channels and
sinks, so if one sink is tied up, the operation can be attempted on
another sink. I am not very experienced with that arrangement yet, and
so cannot guide you very much, but have seen that mentioned as a means
to ensure delivery when there is too much going on. The source does
not need to change, since it will replicate to any channels
automatically and the sinks can get their own channels for their input.
I'm not certain that Flume is a good way to handle such a large file,
it seems that Flume is designed to have many small files, and can
aggregate them and such.
But, if the file you are uploading is in a place in local filesystem,
can't you just use a cron entry to run "hadoop fs -put $FILE
$HDFS/INPUT/PATH" into HDFS?
*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com <http://www.rdx.com/>
On Thu, Oct 24, 2013 at 11:35 AM, ltcuong211 <[email protected]
<mailto:[email protected]>> wrote:
Hi Jeff & JS,
I tried using spooling dir source & memory channel. It still takes
~ 4 minutes to copy 1gb data into hdfs.
By the way, thanks for suggesting spooling source. I think it is
better than exec + cat in my case.
Cuong LUU
On 21/10/2013 22:50, Jeff Lord wrote:
Luu,
Have you tried using the spooling directory source?
-Jeff
On Mon, Oct 21, 2013 at 3:25 AM, Cuong Luu <[email protected]
<mailto:[email protected]>> wrote:
Hi all,
I need to copy data in a local directory (hadoop server) into
hdfs regularly and automatically. This is my flume config:
agent.sources = execSource
agent.channels = fileChannel
agent.sinks = hdfsSink
agent.sources.execSource.type = exec
agent.sources.execSource.shell = /bin/bash -c
agent.sources.execSource.command = for i in /local-dir/*; do
cat $i; done
agent.sources.execSource.restart = true
agent.sources.execSource.restartThrottle = 3600000
agent.sources.execSource.batchSize = 100
...
agent.sinks.hdfsSink.hdfs.rollInterval = 0
agent.sinks.hdfsSink.hdfs.rollSize = 262144000
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.batchsize = 100000
...
agent.channels.fileChannel.type = FILE
agent.channels.fileChannel.capacity = 100000
...
while hadoop command takes 30second, Flume takes arround 4
minutes to copy 1 gb text file into HDFS. I am worried about
whether the config is not good or shouldn't use flume in this
case?
How about your opinion?