Cool stuff, a Pig Kafka UDF. Russell Jurney http://datasyndrome.com
Begin forwarded message: *From:* David Arthur <[email protected]> *Date:* August 7, 2013, 7:41:30 AM PDT *To:* [email protected] *Subject:* *Reading Kafka directly from Pig?* *Reply-To:* [email protected] I've thrown together a Pig LoadFunc to read data from Kafka, so you could load data like: QUERY_LOGS = load 'kafka://localhost:9092/logs.query#8' using com.mycompany.pig.KafkaAvroLoader('com.mycompany.Query'); The path part of the uri is the Kafka topic, and the fragment is the number of partitions. In the implementation I have, it makes one input split per partition. Offsets are not really dealt with at this point - it's a rough prototype. Anyone have thoughts on whether or not this is a good idea? I know usually the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this data from Kafka once, is there any reason why I can't skip writing to HDFS? Thanks! -David
