Re: Trying to determine if Chukwa is what I need

Eric Yang Sat, 28 Jun 2014 08:53:23 -0700

Hi Jon,

Chukwa can take files from a directory and ship to HDFS with some
limitation.  First, the data needs to be the same type within a directory.
 Second, Chukwa does not ship identical files to HDFS.  It extracts files
into records before data is shipped to HDFS or HBase.  The files written to
HDFS is optimized for map reduce jobs because the files are closed at fix
interval.  This assumption is that collector creates similar files in size
to ensure map reduce tasks can execute in even amount of time for
parallelization.  Chukwa is designed to ship entry of records in log files.
 It may not perform well to ship word document or images.  Flume is
designed to ship original files.  Therefore, if you have requirement to
ship original files and not records, flume maybe the better choice for that
problem.


For testing purpose, tailing files in a directory can be achieved using
this command in Chukwa agent port 9093:

add DirTailingAdaptor logs /var/log/ *.log
filetailer.CharFileTailingAdaptorUTF8 0

This will spawn off multiple CharFileTailingAdaptorUTF8 to ship all log
files within the directory.  If the log files is removed, the adaptor is
automatically shutdown.

Hope this helps.

regards,
Eric


On Fri, Jun 27, 2014 at 1:38 PM, Jonathan Mervine <jmerv...@rcanalytics.com>
wrote:

>  Hey I came across chukwa from a blog post. And it looks like it  there
> is a real effort in collecting data from multiple sources and pumping it
> into the HDFS.
>
>  I was looking at this pdf from the wiki
> https://wiki.apache.org/hadoop/Chukwa?action=AttachFile&do=view&target=ChukwaPoster.pdf
>
>
>  And the chart in the middle seems to imply that 2 of the agents you can
> have is one that takes in streaming data and one that is associated with
> Log4J and works with log files in particular.
>
>  I’m pretty new to Hadoop so I’m trying to learn a lot about it in a
> short time, but what I’m looking for is some kind of system that will
> monitor a directory somewhere for files being placed there. I don’t know
> what kind of files they could be, csv’s, psv’s, doc’s, txt’s, and many
> others. A later stage would be formatting, parsing and analyzing but for
> now I just want to be able to detect when a File is placed there. After a
> file has been detected than it should be sent on it’s way to be placed into
> the HDFS. This should be a completely autonomous and automatic process (or
> as much as possible).
>
>  Is this something Chukwa can help me with? If not do you know of any
> system that might do what I want? I’ve kind of read a little about Oozie,
> Falcon, Flume, Scribe, and a couple other projects but I don’t think I’ve
> found what I’m looking for.  Also any information you could provide to help
> me on my way or clear up any misunderstanding I may have would be great!
>
>  Thanks
>   jmerv...@rcanalytics.com
>

Re: Trying to determine if Chukwa is what I need

Reply via email to