I'd be happy to, if and when it becomes a real thing. Still very alpha quality right now

On 8/7/13 10:58 AM, Russell Jurney wrote:
David, can you share the code on Github so we can take a look? This
sounds awesome.

Russell Jurney http://datasyndrome.com

On Aug 7, 2013, at 7:49 AM, Jun Rao <jun...@gmail.com> wrote:

David,

That's interesting. Kafka provides an infinite stream of data whereas Pig
works on a finite amount of data. How did you solve the mismatch?

Thanks,

Jun


On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <mum...@gmail.com> wrote:

I've thrown together a Pig LoadFunc to read data from Kafka, so you could
load data like:

QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using
com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query');

The path part of the uri is the Kafka topic, and the fragment is the
number of partitions. In the implementation I have, it makes one input
split per partition. Offsets are not really dealt with at this point - it's
a rough prototype.

Anyone have thoughts on whether or not this is a good idea? I know usually
the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
data from Kafka once, is there any reason why I can't skip writing to HDFS?

Thanks!
-David


Reply via email to