Re: Kafka connect HDFS conenctor

2016-02-24 Thread Venkatesh Rudraraju
Thanks Ewen.
We decided to update our producer side of the application to use
schema-registry and post avro messages. Now I am able to store avro
messages in HDFS using connect. I have couple more questions :

1) I am using TimeBasedPartitioner and trying to store data in hourly
buckets. But the rotation for a particular hour XX is happening only in
XX+1 hour, which is a problem when I have batch jobs reading data off /XX
bucket.

For. example I have rotate.interval.ms=60(5 minutes),
- 3:58 one file gets rotated under //MM/dd/03 in HDFS
- 4:03
 -> one file gets rotated under  //MM/dd/04 in HDFS for data from
(4:00 to 4:03)
 -> one file gets rotated under  //MM/dd/03 in HDFS for data from
(3:58 to 4:00)

In this case if I have a hourly batch job starting at 4:00 to process
 //MM/dd/03, it would miss one file.

*Below is my connector config* :
*name=hdfs-sink*
*connector.class=io.confluent.connect.hdfs.HdfsSinkConnector*
*tasks.max=1*
*topics=raw-message-avro*
*hdfs.url=hdfs://localhost:8020*
*topics.dir=/raw/avro/hourly/*
*flush.size=1*
*partitioner.class=io.confluent.connect.hdfs.partitioner.TimeBasedPartitioner*
*partition.duration.ms <http://partition.duration.ms>=12*
*rotate.interval.ms <http://rotate.interval.ms>=60*
*timezone=UTC*
*path.format=/MM/dd/HH/*
*locale=US*


2) Can I control the file commit based on size like Flume does ? Right now
I see flush.size and rotate.interval.ms related to file commit/flush. Is
there any other config I am missing?

Thanks,
Venkatesh

On Tue, Feb 23, 2016 at 9:09 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Consuming plain JSON is a bit tricky for something like HDFS because all
> the output formats expect the data to have a schema. You can read the JSON
> data with the provided JsonConverter, but it'll be returned without a
> schema. The HDFS connector will currently fail on this because it expects a
> fixed structure.
>
> Note however that it *does not* depend on already being in Avro format.
> Kafka Connect is specifically designed to abstract away the serialization
> format of data in Kafka so that connectors don't need to be written a
> half-dozen times to support different formats.
>
> There are a couple of possibilities to allow the HDFS connector to handle
> schemaless (i.e. JSON-like) data. One possibility is to infer the schema
> automatically based on the incoming data. If you can make guarantees about
> the compatibility of the data, this could work with the existing connector
> code. Alternatively, an option could be added to handle this type of data
> and force file rotation if a new schema was encountered. The risk with this
> is that if you have data interleaved with different schemas (as might
> happen as you transition an app to a new format) and no easy way to project
> between them, you'll have a lot of small HDFS files for awhile.
>
> Dealing with schemaless data will be tricky for connectors like HDFS, but
> is definitely possible. But its worth thinking through the right way to
> handle that data with a minimum of additional configuration options
> required.
>
> -Ewen
>
> On Wed, Feb 17, 2016 at 11:14 AM, Venkatesh Rudraraju <
> venkatengineer...@gmail.com> wrote:
>
>> Hi,
>>
>> I tried using the HDFS connector sink with kafka-connect and works as
>> described->
>> http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html
>>
>> My Scenario :
>>
>> I have plain Json data in a kafka topic. Can I still use HDFS connector
>> sink to read data from kafka-topic and write to HDFS in avro format ?
>>
>> As I read from the documentation, HDFS connector expects data in kafka
>> already in avro format? Is there a workaround where I can consume plain
>> Json and write to HDFS in avro ? Say I have a schema for the plain json
>> data.
>>
>> Thanks,
>> Venkatesh
>>
>
>
>
> --
> Thanks,
> Ewen
>


build for kafka-avro-serializer

2016-02-21 Thread Venkatesh Rudraraju
How do I include "kafka-avro-serializer" in my maven build. It's not
available in the maven repo as mentioned here ->

http://docs.confluent.io/1.0/app-development.html#java-applications-serializers

Thanks,
Venkatesh


kafka conenct - HDFS sink connector issue

2016-02-18 Thread Venkatesh Rudraraju
Hi,

I tried using the HDFS connector sink with kafka-connect and works as
described->
http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html

My Scenario :

I have plain Json data in a kafka topic. Can I still use HDFS connector
sink to read data from kafka-topic and write to HDFS in avro format ?

As I read from the documentation, HDFS connector expects data in kafka
already in avro format? Is there a workaround where I can consume plain
Json and write to HDFS in avro ? Say I have a schema for the plain json
data.

Thanks,
Venkatesh


Kafka connect HDFS conenctor

2016-02-17 Thread Venkatesh Rudraraju
Hi,

I tried using the HDFS connector sink with kafka-connect and works as
described->
http://docs.confluent.io/2.0.0/connect/connect-hdfs/docs/index.html

My Scenario :

I have plain Json data in a kafka topic. Can I still use HDFS connector
sink to read data from kafka-topic and write to HDFS in avro format ?

As I read from the documentation, HDFS connector expects data in kafka
already in avro format? Is there a workaround where I can consume plain
Json and write to HDFS in avro ? Say I have a schema for the plain json
data.

Thanks,
Venkatesh


Re: kafka connect(copycat) question

2015-11-14 Thread Venkatesh Rudraraju
I tried building copycat-hdfs but its not able to pull dependencies from
maven...

error trace :
---
 Failed to execute goal on project kafka-connect-hdfs: Could not resolve
dependencies for project
io.confluent:kafka-connect-hdfs:jar:2.0.0-SNAPSHOT: The following artifacts
could not be resolved: org.apache.kafka:connect-api:jar:0.9.0.0,
io.confluent:kafka-connect-avro-converter:jar:2.0.0-SNAPSHOT,
io.confluent:common-config:jar:2.0.0-SNAPSHOT: Could not find artifact
org.apache.kafka:connect-api:jar:0.9.0.0 in confluent

On Thu, Nov 12, 2015 at 2:59 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Yes, though it's still awaiting some updates after some renaming and API
> modifications that happened in Kafka recently.
>
> -Ewen
>
> On Thu, Nov 12, 2015 at 9:10 AM, Venkatesh Rudraraju <
> venkatengineer...@gmail.com> wrote:
>
> > Ewen,
> >
> > How do I use a HDFSSinkConnector. I see the sink as part of a confluent
> > project (
> >
> >
> https://github.com/confluentinc/copycat-hdfs/blob/master/src/main/java/io/confluent/copycat/hdfs/HdfsSinkConnector.java
> > ).
> > Does it mean that I build this project and add the jar to kafka libs ?
> >
> >
> >
> >
> > On Tue, Nov 10, 2015 at 9:35 PM, Ewen Cheslack-Postava <
> e...@confluent.io>
> > wrote:
> >
> > > Venkatesh,
> > >
> > > 1. It only works with quotes because the message needs to be parsed as
> > JSON
> > > -- a bare string without quotes is not valid JSON. If you're just
> using a
> > > file sink, you can also try the StringConverter, which only supports
> > > strings and uses a fixed schema, but is also very easy to use since it
> > has
> > > minimal requirements. It's really meant for demonstration purposes more
> > > than anything else, but may be helpful just to get up and running.
> > > 2. Which JsonParser error? When processing a message fails, we need to
> be
> > > careful about how we handle it. Currently it will not proceed if it
> can't
> > > process a message since for a lot of applications it isn't acceptable
> to
> > > drop messages. By default, we want at least once semantics, with
> exactly
> > > once as long as we don't encounter any crashes or network errors.
> Manual
> > > intervention is currently required in that case.
> > >
> > > -Ewen
> > >
> > > On Tue, Nov 10, 2015 at 8:58 PM, Venkatesh Rudraraju <
> > > venkatengineer...@gmail.com> wrote:
> > >
> > > > Hi Ewen,
> > > >
> > > > Thanks for the explanation. with your suggested setting, I was able
> to
> > > > start just a sink connector like below :
> > > >
> > > > >* bin/connect-standalone.sh config/connect-standalone.properties
> > > > config/connect-file-sink.properties*
> > > >
> > > > But I have a couple of issues yet,
> > > > 1) Since I am only testing a simple file sink connector, I am
> manually
> > > > producing some messages to the 'connect-test' kafka topic, where the
> > > > sink-Task is reading from. And it works only if the message is within
> > > > double-quotes.
> > > > 2) Once I hit the above JsonParser error on the SinkTask, the
> connector
> > > is
> > > > hung, doesn't take any more messages even proper ones.
> > > >
> > > >
> > > > On Tue, Nov 10, 2015 at 1:59 PM, Ewen Cheslack-Postava <
> > > e...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Venkatesh,
> > > > >
> > > > > If you're using the default settings included in the sample
> configs,
> > > > it'll
> > > > > expect JSON data in a special format to support passing schemas
> along
> > > > with
> > > > > the data. This is turned on by default because it makes it possible
> > to
> > > > work
> > > > > with a *lot* more connectors and data storage systems (many require
> > > > > schemas!), though it does mean consuming regular JSON data won't
> work
> > > out
> > > > > of the box. You can easily switch this off by changing these lines
> in
> > > the
> > > > > worker config:
> > > > >
> > > > > key.converter.schemas.enable=true
> > > > > value.converter.schemas.enable=true
> > > > >
> > > > > to be false instead. However, note that this will only work with
> > > > connectors
&

kafka connect(copycat) question

2015-11-10 Thread Venkatesh Rudraraju
Hi,

I am trying out the new kakfa connect service.

version : kafka_2.11-0.9.0.0
mode: standalone

I have a conceptual question on the service.

Can I just start a sink connector which reads from Kafka and writes to say
HDFS ?
>From what I have tried, it's expecting a source-connector as well because
the sink-connector is expecting a particular pattern of the message in
kafka-topic.

Thanks,
Venkat


Re: kafka connect(copycat) question

2015-11-10 Thread Venkatesh Rudraraju
Hi Ewen,

Thanks for the explanation. with your suggested setting, I was able to
start just a sink connector like below :

>* bin/connect-standalone.sh config/connect-standalone.properties
config/connect-file-sink.properties*

But I have a couple of issues yet,
1) Since I am only testing a simple file sink connector, I am manually
producing some messages to the 'connect-test' kafka topic, where the
sink-Task is reading from. And it works only if the message is within
double-quotes.
2) Once I hit the above JsonParser error on the SinkTask, the connector is
hung, doesn't take any more messages even proper ones.


On Tue, Nov 10, 2015 at 1:59 PM, Ewen Cheslack-Postava <e...@confluent.io>
wrote:

> Hi Venkatesh,
>
> If you're using the default settings included in the sample configs, it'll
> expect JSON data in a special format to support passing schemas along with
> the data. This is turned on by default because it makes it possible to work
> with a *lot* more connectors and data storage systems (many require
> schemas!), though it does mean consuming regular JSON data won't work out
> of the box. You can easily switch this off by changing these lines in the
> worker config:
>
> key.converter.schemas.enable=true
> value.converter.schemas.enable=true
>
> to be false instead. However, note that this will only work with connectors
> that can work with "schemaless" data. This wouldn't work for, e.g., writing
> Avro files in HDFS since they need schema information, but it might work
> for other formats. This would allow you to consume JSON data from any topic
> it already existed in.
>
> Note that JSON is not the only format you can use. You can also substitute
> other implementations of the Converter interface. Confluent has implemented
> an Avro version that works well with our schema registry (
> https://github.com/confluentinc/schema-registry/tree/master/avro-converter
> ).
> The JSON implementation made sense to add as the one included with Kafka
> simply because it didn't introduce any other dependencies that weren't
> already in Kafka. It's also possible to write implementations for other
> formats (e.g. Thrift, Protocol Buffers, Cap'n Proto, MessagePack, and
> more), but I'm not aware of anyone who has started to tackle those
> converters yet.
>
> -Ewen
>
> On Tue, Nov 10, 2015 at 1:23 PM, Venkatesh Rudraraju <
> venkatengineer...@gmail.com> wrote:
>
> > Hi,
> >
> > I am trying out the new kakfa connect service.
> >
> > version : kafka_2.11-0.9.0.0
> > mode: standalone
> >
> > I have a conceptual question on the service.
> >
> > Can I just start a sink connector which reads from Kafka and writes to
> say
> > HDFS ?
> > From what I have tried, it's expecting a source-connector as well because
> > the sink-connector is expecting a particular pattern of the message in
> > kafka-topic.
> >
> > Thanks,
> > Venkat
> >
>
>
>
> --
> Thanks,
> Ewen
>



-- 
Victory awaits him who has everything in order--luck, people call it.