Hi Jon, Chukwa can take files from a directory and ship to HDFS with some limitation. First, the data needs to be the same type within a directory. Second, Chukwa does not ship identical files to HDFS. It extracts files into records before data is shipped to HDFS or HBase. The files written to HDFS is optimized for map reduce jobs because the files are closed at fix interval. This assumption is that collector creates similar files in size to ensure map reduce tasks can execute in even amount of time for parallelization. Chukwa is designed to ship entry of records in log files. It may not perform well to ship word document or images. Flume is designed to ship original files. Therefore, if you have requirement to ship original files and not records, flume maybe the better choice for that problem.
For testing purpose, tailing files in a directory can be achieved using this command in Chukwa agent port 9093: add DirTailingAdaptor logs /var/log/ *.log filetailer.CharFileTailingAdaptorUTF8 0 This will spawn off multiple CharFileTailingAdaptorUTF8 to ship all log files within the directory. If the log files is removed, the adaptor is automatically shutdown. Hope this helps. regards, Eric On Fri, Jun 27, 2014 at 1:38 PM, Jonathan Mervine <jmerv...@rcanalytics.com> wrote: > Hey I came across chukwa from a blog post. And it looks like it there > is a real effort in collecting data from multiple sources and pumping it > into the HDFS. > > I was looking at this pdf from the wiki > https://wiki.apache.org/hadoop/Chukwa?action=AttachFile&do=view&target=ChukwaPoster.pdf > > > And the chart in the middle seems to imply that 2 of the agents you can > have is one that takes in streaming data and one that is associated with > Log4J and works with log files in particular. > > I’m pretty new to Hadoop so I’m trying to learn a lot about it in a > short time, but what I’m looking for is some kind of system that will > monitor a directory somewhere for files being placed there. I don’t know > what kind of files they could be, csv’s, psv’s, doc’s, txt’s, and many > others. A later stage would be formatting, parsing and analyzing but for > now I just want to be able to detect when a File is placed there. After a > file has been detected than it should be sent on it’s way to be placed into > the HDFS. This should be a completely autonomous and automatic process (or > as much as possible). > > Is this something Chukwa can help me with? If not do you know of any > system that might do what I want? I’ve kind of read a little about Oozie, > Falcon, Flume, Scribe, and a couple other projects but I don’t think I’ve > found what I’m looking for. Also any information you could provide to help > me on my way or clear up any misunderstanding I may have would be great! > > Thanks > jmerv...@rcanalytics.com >