Re: Kafka connect HDFS conenctor
Thanks Ewen. We decided to update our producer side of the application to use schema-registry and post avro messages. Now I am able to store avro messages in HDFS using connect. I have couple more questions : 1) I am using TimeBasedPartitioner and trying to store data in hourly buckets. But the rotation for a particular hour XX is happening only in XX+1 hour, which is a problem when I have batch jobs reading data off /XX bucket. For. example I have rotate.interval.ms=60(5 minutes), - 3:58 one file gets rotated under //MM/dd/03 in HDFS - 4:03 -> one file gets rotated under //MM/dd/04 in HDFS for data from (4:00 to 4:03) -> one file gets rotated under //MM/dd/03 in HDFS for data from (3:58 to 4:00) In this case if I have a hourly batch job starting at 4:00 to process //MM/dd/03, it would miss one file. *Below is my connector config* : *name=hdfs-sink* *connector.class=io.confluent.connect.hdfs.HdfsSinkConnector* *tasks.max=1* *topics=raw-message-avro* *hdfs.url=hdfs://localhost:8020* *topics.dir=/raw/avro/hourly/* *flush.size=1* *partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner* *partition.duration.ms <http://partition.duration.ms>=12* *rotate.interval.ms <http://rotate.interval.ms>=60* *timezone=UTC* *path.format=/MM/dd/HH/* *locale=US* 2) Can I control the file commit based on size like Flume does ? Right now I see flush.size and rotate.interval.ms related to file commit/flush. Is there any other config I am missing? Thanks, Venkatesh On Tue, Feb 23, 2016 at 9:09 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Consuming plain JSON is a bit tricky for something like HDFS because all > the output formats expect the data to have a schema. You can read the JSON > data with the provided JsonConverter, but it'll be returned without a > schema. The HDFS connector will currently fail on this because it expects a > fixed structure. > > Note however that it *does not* depend on already being in Avro format. > Kafka Connect is specifically designed to abstract away the serialization > format of data in Kafka so that connectors don't need to be written a > half-dozen times to support different formats. > > There are a couple of possibilities to allow the HDFS connector to handle > schemaless (i.e. JSON-like) data. One possibility is to infer the schema > automatically based on the incoming data. If you can make guarantees about > the compatibility of the data, this could work with the existing connector > code. Alternatively, an option could be added to handle this type of data > and force file rotation if a new schema was encountered. The risk with this > is that if you have data interleaved with different schemas (as might > happen as you transition an app to a new format) and no easy way to project > between them, you'll have a lot of small HDFS files for awhile. > > Dealing with schemaless data will be tricky for connectors like HDFS, but > is definitely possible. But its worth thinking through the right way to > handle that data with a minimum of additional configuration options > required. > > -Ewen > > On Wed, Feb 17, 2016 at 11:14 AM, Venkatesh Rudraraju < > venkatengineer...@gmail.com> wrote: > >> Hi, >> >> I tried using the HDFS connector sink with kafka-connect and works as >> described-> >> http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html >> >> My Scenario : >> >> I have plain Json data in a kafka topic. Can I still use HDFS connector >> sink to read data from kafka-topic and write to HDFS in avro format ? >> >> As I read from the documentation, HDFS connector expects data in kafka >> already in avro format? Is there a workaround where I can consume plain >> Json and write to HDFS in avro ? Say I have a schema for the plain json >> data. >> >> Thanks, >> Venkatesh >> > > > > -- > Thanks, > Ewen >
build for kafka-avro-serializer
How do I include "kafka-avro-serializer" in my maven build. It's not available in the maven repo as mentioned here -> http://docs.confluent.io/1.0/app-development.html#java-applications-serializers Thanks, Venkatesh
kafka conenct - HDFS sink connector issue
Hi, I tried using the HDFS connector sink with kafka-connect and works as described-> http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html My Scenario : I have plain Json data in a kafka topic. Can I still use HDFS connector sink to read data from kafka-topic and write to HDFS in avro format ? As I read from the documentation, HDFS connector expects data in kafka already in avro format? Is there a workaround where I can consume plain Json and write to HDFS in avro ? Say I have a schema for the plain json data. Thanks, Venkatesh
Kafka connect HDFS conenctor
Hi, I tried using the HDFS connector sink with kafka-connect and works as described-> http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html My Scenario : I have plain Json data in a kafka topic. Can I still use HDFS connector sink to read data from kafka-topic and write to HDFS in avro format ? As I read from the documentation, HDFS connector expects data in kafka already in avro format? Is there a workaround where I can consume plain Json and write to HDFS in avro ? Say I have a schema for the plain json data. Thanks, Venkatesh
Re: kafka connect(copycat) question
I tried building copycat-hdfs but its not able to pull dependencies from maven... error trace : --- Failed to execute goal on project kafka-connect-hdfs: Could not resolve dependencies for project io.confluent:kafka-connect-hdfs:jar:2.0.0-SNAPSHOT: The following artifacts could not be resolved: org.apache.kafka:connect-api:jar:0.9.0.0, io.confluent:kafka-connect-avro-converter:jar:2.0.0-SNAPSHOT, io.confluent:common-config:jar:2.0.0-SNAPSHOT: Could not find artifact org.apache.kafka:connect-api:jar:0.9.0.0 in confluent On Thu, Nov 12, 2015 at 2:59 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Yes, though it's still awaiting some updates after some renaming and API > modifications that happened in Kafka recently. > > -Ewen > > On Thu, Nov 12, 2015 at 9:10 AM, Venkatesh Rudraraju < > venkatengineer...@gmail.com> wrote: > > > Ewen, > > > > How do I use a HDFSSinkConnector. I see the sink as part of a confluent > > project ( > > > > > https://github.com/confluentinc/copycat-hdfs/blob/master/src/main/java/io/confluent/copycat/hdfs/HdfsSinkConnector.java > > ). > > Does it mean that I build this project and add the jar to kafka libs ? > > > > > > > > > > On Tue, Nov 10, 2015 at 9:35 PM, Ewen Cheslack-Postava < > e...@confluent.io> > > wrote: > > > > > Venkatesh, > > > > > > 1. It only works with quotes because the message needs to be parsed as > > JSON > > > -- a bare string without quotes is not valid JSON. If you're just > using a > > > file sink, you can also try the StringConverter, which only supports > > > strings and uses a fixed schema, but is also very easy to use since it > > has > > > minimal requirements. It's really meant for demonstration purposes more > > > than anything else, but may be helpful just to get up and running. > > > 2. Which JsonParser error? When processing a message fails, we need to > be > > > careful about how we handle it. Currently it will not proceed if it > can't > > > process a message since for a lot of applications it isn't acceptable > to > > > drop messages. By default, we want at least once semantics, with > exactly > > > once as long as we don't encounter any crashes or network errors. > Manual > > > intervention is currently required in that case. > > > > > > -Ewen > > > > > > On Tue, Nov 10, 2015 at 8:58 PM, Venkatesh Rudraraju < > > > venkatengineer...@gmail.com> wrote: > > > > > > > Hi Ewen, > > > > > > > > Thanks for the explanation. with your suggested setting, I was able > to > > > > start just a sink connector like below : > > > > > > > > >* bin/connect-standalone.sh config/connect-standalone.properties > > > > config/connect-file-sink.properties* > > > > > > > > But I have a couple of issues yet, > > > > 1) Since I am only testing a simple file sink connector, I am > manually > > > > producing some messages to the 'connect-test' kafka topic, where the > > > > sink-Task is reading from. And it works only if the message is within > > > > double-quotes. > > > > 2) Once I hit the above JsonParser error on the SinkTask, the > connector > > > is > > > > hung, doesn't take any more messages even proper ones. > > > > > > > > > > > > On Tue, Nov 10, 2015 at 1:59 PM, Ewen Cheslack-Postava < > > > e...@confluent.io> > > > > wrote: > > > > > > > > > Hi Venkatesh, > > > > > > > > > > If you're using the default settings included in the sample > configs, > > > > it'll > > > > > expect JSON data in a special format to support passing schemas > along > > > > with > > > > > the data. This is turned on by default because it makes it possible > > to > > > > work > > > > > with a *lot* more connectors and data storage systems (many require > > > > > schemas!), though it does mean consuming regular JSON data won't > work > > > out > > > > > of the box. You can easily switch this off by changing these lines > in > > > the > > > > > worker config: > > > > > > > > > > key.converter.schemas.enable=true > > > > > value.converter.schemas.enable=true > > > > > > > > > > to be false instead. However, note that this will only work with > > > > connectors &
kafka connect(copycat) question
Hi, I am trying out the new kakfa connect service. version : kafka_2.11-0.9.0.0 mode: standalone I have a conceptual question on the service. Can I just start a sink connector which reads from Kafka and writes to say HDFS ? >From what I have tried, it's expecting a source-connector as well because the sink-connector is expecting a particular pattern of the message in kafka-topic. Thanks, Venkat
Re: kafka connect(copycat) question
Hi Ewen, Thanks for the explanation. with your suggested setting, I was able to start just a sink connector like below : >* bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-sink.properties* But I have a couple of issues yet, 1) Since I am only testing a simple file sink connector, I am manually producing some messages to the 'connect-test' kafka topic, where the sink-Task is reading from. And it works only if the message is within double-quotes. 2) Once I hit the above JsonParser error on the SinkTask, the connector is hung, doesn't take any more messages even proper ones. On Tue, Nov 10, 2015 at 1:59 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Hi Venkatesh, > > If you're using the default settings included in the sample configs, it'll > expect JSON data in a special format to support passing schemas along with > the data. This is turned on by default because it makes it possible to work > with a *lot* more connectors and data storage systems (many require > schemas!), though it does mean consuming regular JSON data won't work out > of the box. You can easily switch this off by changing these lines in the > worker config: > > key.converter.schemas.enable=true > value.converter.schemas.enable=true > > to be false instead. However, note that this will only work with connectors > that can work with "schemaless" data. This wouldn't work for, e.g., writing > Avro files in HDFS since they need schema information, but it might work > for other formats. This would allow you to consume JSON data from any topic > it already existed in. > > Note that JSON is not the only format you can use. You can also substitute > other implementations of the Converter interface. Confluent has implemented > an Avro version that works well with our schema registry ( > https://github.com/confluentinc/schema-registry/tree/master/avro-converter > ). > The JSON implementation made sense to add as the one included with Kafka > simply because it didn't introduce any other dependencies that weren't > already in Kafka. It's also possible to write implementations for other > formats (e.g. Thrift, Protocol Buffers, Cap'n Proto, MessagePack, and > more), but I'm not aware of anyone who has started to tackle those > converters yet. > > -Ewen > > On Tue, Nov 10, 2015 at 1:23 PM, Venkatesh Rudraraju < > venkatengineer...@gmail.com> wrote: > > > Hi, > > > > I am trying out the new kakfa connect service. > > > > version : kafka_2.11-0.9.0.0 > > mode: standalone > > > > I have a conceptual question on the service. > > > > Can I just start a sink connector which reads from Kafka and writes to > say > > HDFS ? > > From what I have tried, it's expecting a source-connector as well because > > the sink-connector is expecting a particular pattern of the message in > > kafka-topic. > > > > Thanks, > > Venkat > > > > > > -- > Thanks, > Ewen > -- Victory awaits him who has everything in order--luck, people call it.