External DB as sink - with processing guarantees

2016-03-11 Thread Josh
Hi all,

I want to use an external data store (DynamoDB) as a sink with Flink. It looks 
like there's no connector for Dynamo at the moment, so I have two questions:

1. Is it easy to write my own sink for Flink and are there any docs around how 
to do this?
2. If I do this, will I still be able to have Flink's processing guarantees? 
I.e. Can I be sure that every tuple has contributed to the DynamoDB state 
either at-least-once or exactly-once?

Thanks for any advice,
Josh

Re: Running Flink 1.0.0 on YARN

2016-03-11 Thread Robert Metzger
Hi,

the first issue you are describing is expected. Flink is starting the web
interface on the container running the JobManager, not on the resource
manager.
Also, the port is allocated dynamically, to avoid port collisions. So its
not started on 8081.
However, you can access the web interface from the proxy provided in the
application overview.

Regarding the second error, can you check the log files of the TaskManager
(running on *x.x.x.x:43272)* which failed?
I'm pretty sure there is some information in there why it didn't respond.


On Thu, Mar 10, 2016 at 9:45 AM, Ashutosh Kumar 
wrote:

> I have a yarn setup with 1 master and 2 slaves.
> When I run yarn session with  bin/yarn-session.sh -n 2 -jm 1024 -tm
> 1024 and  submit job with bin/flink run examples/batch/WordCount.jar , the
> job succeeds . It shows status on yarn UI http://x.x.x.x:8088/cluster .
> However it does not show anything on Flink UI
> http://x.x.x.x:8081/#/overview .
>
> Is this expected behavior ?
>
> If I run using bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096
> examples/batch/WordCount.jar then the job fails with following error.
>
>  * java.lang.IllegalStateException: Update task on instance
> 451105022ff3b4cd6e2c307e239d1595 @ slave2 - 2 slots - URL:
> akka.tcp://flink@*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *x.x.x.x:43272/user/taskmanager failed due to:at
> org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
> at akka.dispatch.OnFailure.internal(Future.scala:228)at
> akka.dispatch.OnFailure.internal(Future.scala:227)at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25)
> at
> scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
> at
> scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)at
> scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)Caused
> by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@*
>
>
>
>
>
>
>
>
>
>
>
>
>
> *x.x.x.x:43272/user/taskmanager#1361901425]] after [1 ms]at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)at
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
> at java.lang.Thread.run(Thread.java:745)03/10/2016 08:36:46 Job
> execution switched to status FAILING.*
>
> *Thanks*
>
> *Ashutosh*
>


Re: Flink and YARN ship folder

2016-03-11 Thread Ufuk Celebi
Everything in the lib folder should be added to the classpath. Can you
check the YARN client logs that the files are uploaded? Furthermore,
you can check the classpath of the JVM in the YARN logs of the
JobManager/TaskManager processes.

– Ufuk


On Fri, Mar 11, 2016 at 5:33 PM, Andrea Sella
 wrote:
> Hi,
>
> There is a way to add external dependencies to Flink Job,  running on YARN,
> not using HADOOP_CLASSPATH?
> I am looking for a similar idea to standalone mode using lib folder.
>
> BR,
> Andrea


Re: Log4j configuration on YARN

2016-03-11 Thread Ufuk Celebi
Hey Nick!

I just checked and the conf/log4j.properties file is copied and is
given as an argument to the JVM.

You should see the following:
- client logs that the conf/log4j.properties file is copied
- JobManager logs show log4j.configuration being passed to the JVM.

Can you confirm that these shows up? If yes, but you still don't get
the expected logging, I would check via -Dlog4j.debug what is
configured (prints to stdout I think). Does this help?

– Ufuk


On Fri, Mar 11, 2016 at 6:02 PM, Nick Dimiduk  wrote:
> Can anyone tell me where I must place my application-specific
> log4j.properties to have them honored when running on a YARN cluster? In my
> application jar doesn't work. In the log4j files under flink/conf doesn't
> work.
>
> My goal is to set the log level for 'com.mycompany' classes used in my flink
> application to DEBUG.
>
> Thanks,
> Nick
>


Log4j configuration on YARN

2016-03-11 Thread Nick Dimiduk
Can anyone tell me where I must place my application-specific
log4j.properties to have them honored when running on a YARN cluster? In my
application jar doesn't work. In the log4j files under flink/conf doesn't
work.

My goal is to set the log level for 'com.mycompany' classes used in my
flink application to DEBUG.

Thanks,
Nick


Flink and YARN ship folder

2016-03-11 Thread Andrea Sella
Hi,

There is a way to add external dependencies to Flink Job,  running on YARN,
not using HADOOP_CLASSPATH?
I am looking for a similar idea to standalone mode using lib folder.

BR,
Andrea


Re: Stack overflow from self referencing Avro schema

2016-03-11 Thread David Kim
Thanks Stephan! :)

On Thu, Mar 10, 2016 at 11:06 AM, Stephan Ewen  wrote:

> The following issue should track that.
> https://issues.apache.org/jira/browse/FLINK-3602
>
> @Niels: Thanks for looking into this. At this point, I think it may
> actually be a Flink issue, since it concerns the interaction of Avro and
> Flink's TypeInformation.
>
> On Thu, Mar 10, 2016 at 6:00 PM, Stephan Ewen  wrote:
>
>> Hi!
>>
>> I think that is a TypeExtractor bug. It may actually be a bug for all
>> recursive types.
>> Let's check this and come up with a fix...
>>
>> Greetings,
>> Stephan
>>
>>
>> On Thu, Mar 10, 2016 at 4:11 PM, David Kim <
>> david@braintreepayments.com> wrote:
>>
>>> Hello!
>>>
>>> Just wanted to check up on this again. Has anyone else seen this before
>>> or have any suggestions?
>>>
>>> Thanks!
>>> David
>>>
>>> On Tue, Mar 8, 2016 at 12:12 PM, David Kim <
>>> david@braintreepayments.com> wrote:
>>>
 Hello all,

 I'm running into a StackOverflowError using flink 1.0.0. I have an Avro
 schema that has a self reference. For example:

 item.avsc

 {

   "namespace": "..."

   "type": "record"
   "name": "Item",
   "fields": [
 {
   "name": "parent"
   "type": ["null, "Item"]
 }
   ]
 }


 When running my flink job, I'm running into the follow error:

 Exception in thread "Thread-94" java.lang.StackOverflowError
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.countTypeInHierarchy(TypeExtractor.java:1105)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1397)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1319)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:609)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.analyzePojo(TypeExtractor.java:1531)
at 
 org.apache.flink.api.java.typeutils.AvroTypeInfo.generateFieldsFromAvroSchema(AvroTypeInfo.java:53)
at 
 org.apache.flink.api.java.typeutils.AvroTypeInfo.(AvroTypeInfo.java:48)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1394)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1319)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:609)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.analyzePojo(TypeExtractor.java:1531)
at 
 org.apache.flink.api.java.typeutils.AvroTypeInfo.generateFieldsFromAvroSchema(AvroTypeInfo.java:53)
at 
 org.apache.flink.api.java.typeutils.AvroTypeInfo.(AvroTypeInfo.java:48)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1394)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.privateGetForClass(TypeExtractor.java:1319)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.createTypeInfoWithTypeHierarchy(TypeExtractor.java:609)
at 
 org.apache.flink.api.java.typeutils.TypeExtractor.analyzePojo(TypeExtractor.java:1531)
at 
 org.apache.flink.api.java.typeutils.AvroTypeInfo.generateFieldsFromAvroSchema(AvroTypeInfo.java:53)


 Interestingly if I change the type to an Avro array in the schema,
 this error is not thrown.

 Thanks!
 David

>>>
>>>
>>>
>>> --
>>> Note: this information is confidential. It is prohibited to share, post
>>> online or otherwise publicize without Braintree's prior written consent.
>>>
>>
>>
>


-- 
Note: this information is confidential. It is prohibited to share, post
online or otherwise publicize without Braintree's prior written consent.


Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-11 Thread Balaji Rajagopalan
Robert,
  That did not fix it ( using flink and connector same version) . Tried
with scala version 2.11, so will try to see scala 2.10 makes any
difference.

balaji

On Fri, Mar 11, 2016 at 8:06 PM, Robert Metzger  wrote:

> Hi,
>
> you have to use the same version for all dependencies from the
> "org.apache.flink" group.
>
> You said these are the versions you are using:
>
> flink.version = 0.10.2
> kafka.verison = 0.8.2
> flink.kafka.connection.verion=0.9.1
>
> For the connector, you also need to use 0.10.2.
>
>
>
> On Fri, Mar 11, 2016 at 9:56 AM, Balaji Rajagopalan <
> balaji.rajagopa...@olacabs.com> wrote:
>
>> I am tyring to use the flink kafka connector, for this I have specified
>> the kafka connector dependency and created a fat jar since default flink
>> installation does not contain kafka connector jars. I have made sure that
>> flink-streaming-demo-0.1.jar has the
>> kafka.javaapi.consumer.SimpleConsumer.class but still I see the class not
>> found exception.
>>
>> The code for kafka connector in flink.
>>
>> val env = StreamExecutionEnvironment.getExecutionEnvironment
>> val prop:Properties = new Properties()
>> prop.setProperty("zookeeper.connect","somezookeer:2181")
>> prop.setProperty("group.id","some-group")
>> prop.setProperty("bootstrap.servers","somebroker:9092")
>>
>> val stream = env
>>   .addSource(new FlinkKafkaConsumer082[String]("location", new 
>> SimpleStringSchema, prop))
>>
>> jar tvf flink-streaming-demo-0.1.jar | grep
>> kafka.javaapi.consumer.SimpleConsumer
>>
>>   5111 Fri Mar 11 14:18:36 UTC 2016
>> *kafka/javaapi/consumer/SimpleConsumer*.class
>>
>> flink.version = 0.10.2
>> kafka.verison = 0.8.2
>> flink.kafka.connection.verion=0.9.1
>>
>> The command that I use to run the flink program in yarn cluster is below,
>>
>> HADOOP_CONF_DIR=/etc/hadoop/conf /usr/share/flink/bin/flink run -c
>> com.dataartisans.flink_demo.examples.DriverEventConsumer  -m yarn-cluster
>> -yn 2 /home/balajirajagopalan/flink-streaming-demo-0.1.jar
>>
>> java.lang.NoClassDefFoundError: kafka/javaapi/consumer/SimpleConsumer
>>
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.getPartitionsForTopic(FlinkKafkaConsumer.java:691)
>>
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.(FlinkKafkaConsumer.java:281)
>>
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer082.(FlinkKafkaConsumer082.java:49)
>>
>> at
>> com.dataartisans.flink_demo.examples.DriverEventConsumer$.main(DriverEventConsumer.scala:53)
>>
>> at
>> com.dataartisans.flink_demo.examples.DriverEventConsumer.main(DriverEventConsumer.scala)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:497)
>>
>> at
>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
>>
>> at
>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
>>
>> at org.apache.flink.client.program.Client.runBlocking(Client.java:252)
>>
>> at
>> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)
>>
>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)
>>
>> at
>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)
>>
>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)
>>
>> Caused by: java.lang.ClassNotFoundException:
>> kafka.javaapi.consumer.SimpleConsumer
>>
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>>
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>> ... 16 more
>>
>>
>> Any help appreciated.
>>
>>
>> balaji
>>
>
>


Re: protobuf messages from Kafka to elasticsearch using flink

2016-03-11 Thread Robert Metzger
Hi,

I think what you have to do is the following:

1. Create your own DeserializationSchema. There, the deserialize() method
gets a byte[] for each message in Kafka
2. Deserialize the byte[] using the generated classes from protobuf.
3. If your datatype is called "Foo", there should be a generated "Foo"
class with a "parseFrom()" accepting a byte[]. With that, you can turn each
byte[] into a "Foo" that you can then use in Flink.

Disclaimer: I haven't tested this myself. Its based on a quick
stackoverflow research.
Sources:
http://stackoverflow.com/questions/10984402/deserialize-protobuf-java-array
; https://developers.google.com/protocol-buffers/docs/javatutorial




On Wed, Mar 9, 2016 at 9:36 PM, Madhukar Thota 
wrote:

> Hi Fabian
>
> We are already using Flink to read json messages from kafka and index into
> elasticsearch. Now we have a requirement to read protobuf messages from
> kafka. I am new to protobuf and looking for help on how to deserialize
> protobuf using flink from kafka consumer.
>
> -Madhu
>
> On Wed, Mar 9, 2016 at 5:27 AM, Fabian Hueske  wrote:
>
>> Hi,
>>
>> I haven't used protobuf to serialize Kafka events but this blog post (+
>> the linked repository) shows how to write data from Flink into
>> Elasticsearch:
>>
>> -->
>> https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana
>>
>> Hope this helps,
>> Fabian
>>
>> 2016-03-09 2:52 GMT+01:00 Madhukar Thota :
>>
>>> Friends,
>>>
>>> Can someone guide me or share an example on  how to consume protobuf
>>> message from kafka and index into Elasticsearch using flink?
>>>
>>
>>
>


Re: kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-11 Thread Robert Metzger
Hi,

you have to use the same version for all dependencies from the
"org.apache.flink" group.

You said these are the versions you are using:

flink.version = 0.10.2
kafka.verison = 0.8.2
flink.kafka.connection.verion=0.9.1

For the connector, you also need to use 0.10.2.



On Fri, Mar 11, 2016 at 9:56 AM, Balaji Rajagopalan <
balaji.rajagopa...@olacabs.com> wrote:

> I am tyring to use the flink kafka connector, for this I have specified
> the kafka connector dependency and created a fat jar since default flink
> installation does not contain kafka connector jars. I have made sure that
> flink-streaming-demo-0.1.jar has the
> kafka.javaapi.consumer.SimpleConsumer.class but still I see the class not
> found exception.
>
> The code for kafka connector in flink.
>
> val env = StreamExecutionEnvironment.getExecutionEnvironment
> val prop:Properties = new Properties()
> prop.setProperty("zookeeper.connect","somezookeer:2181")
> prop.setProperty("group.id","some-group")
> prop.setProperty("bootstrap.servers","somebroker:9092")
>
> val stream = env
>   .addSource(new FlinkKafkaConsumer082[String]("location", new 
> SimpleStringSchema, prop))
>
> jar tvf flink-streaming-demo-0.1.jar | grep
> kafka.javaapi.consumer.SimpleConsumer
>
>   5111 Fri Mar 11 14:18:36 UTC 2016
> *kafka/javaapi/consumer/SimpleConsumer*.class
>
> flink.version = 0.10.2
> kafka.verison = 0.8.2
> flink.kafka.connection.verion=0.9.1
>
> The command that I use to run the flink program in yarn cluster is below,
>
> HADOOP_CONF_DIR=/etc/hadoop/conf /usr/share/flink/bin/flink run -c
> com.dataartisans.flink_demo.examples.DriverEventConsumer  -m yarn-cluster
> -yn 2 /home/balajirajagopalan/flink-streaming-demo-0.1.jar
>
> java.lang.NoClassDefFoundError: kafka/javaapi/consumer/SimpleConsumer
>
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.getPartitionsForTopic(FlinkKafkaConsumer.java:691)
>
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.(FlinkKafkaConsumer.java:281)
>
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer082.(FlinkKafkaConsumer082.java:49)
>
> at
> com.dataartisans.flink_demo.examples.DriverEventConsumer$.main(DriverEventConsumer.scala:53)
>
> at
> com.dataartisans.flink_demo.examples.DriverEventConsumer.main(DriverEventConsumer.scala)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
>
> at
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
>
> at org.apache.flink.client.program.Client.runBlocking(Client.java:252)
>
> at
> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)
>
> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)
>
> at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)
>
> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)
>
> Caused by: java.lang.ClassNotFoundException:
> kafka.javaapi.consumer.SimpleConsumer
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> ... 16 more
>
>
> Any help appreciated.
>
>
> balaji
>


Re: Flink streaming throughput

2016-03-11 Thread Robert Metzger
Hi Hironori,

can you try with the kafka-console-consumer how many messages you can read
in one minute?
Maybe the broker's disk I/O is limited because everything is running in
virtual machines (potentially sharing one hard disk?)
I'm also not sure if running a Kafka 0.8 consumer against a 0.9 broker is
working as expected.

Our Kafka 0.8 consumer has been tested in environments where its reading
with more than 100 MB/s per from a broker.


On Fri, Mar 11, 2016 at 9:33 AM, おぎばやしひろのり  wrote:

> Aljoscha,
>
> Thank you for your response.
>
> I tried no JSON parsing and no sink (DiscardingSink) case. The
> throughput was 8228msg/sec.
> Slightly better than JSON + Elasticsearch case.
> I also tried using socketTextStream instead of FlinkKafkaConsumer, in
> that case, the result was
> 60,000 msg/sec with just 40% flink TaskManager CPU Usage. (socket
> server was the bottleneck)
> That was amazing, although Flink's fault tolerance feature is not
> available with socketTextStream.
>
> Regards,
> Hironori
>
> 2016-03-08 21:36 GMT+09:00 Aljoscha Krettek :
> > Hi,
> > Another interesting test would be a combination of 3) and 2). I.e. no
> JSON parsing and no sink. This would show what the raw throughput can be
> before being slowed down by writing to Elasticsearch.
> >
> > Also .print() is also not feasible for production since it just prints
> every element to the stdout log on the TaskManagers, which itself can cause
> quite a slowdown. You could try:
> >
> > datastream.addSink(new DiscardingSink())
> >
> > which is a dummy sink that does nothing.
> >
> > Cheers,
> > Aljoscha
> >> On 08 Mar 2016, at 13:31, おぎばやしひろのり  wrote:
> >>
> >> Stephan,
> >>
> >> Sorry for the delay in my response.
> >> I tried 3 cases you suggested.
> >>
> >> This time, I set parallelism to 1 for simpicity.
> >>
> >> 0) base performance (same as the first e-mail): 1,480msg/sec
> >> 1) Disable checkpointing : almost same as 0)
> >> 2) No ES sink. just print() : 1,510msg/sec
> >> 3) JSON to TSV : 8,000msg/sec
> >>
> >> So, as you can see, the bottleneck was JSON parsing. I also want to
> >> try eliminating Kafka to see
> >> if there is a room to improve performance.(Currently, I am using
> >> FlinkKafkaConsumer082 with Kafka 0.9
> >> I think I should try Flink 1.0 and FlinkKafkaConsumer09).
> >> Anyway, I think 8,000msg/sec with 1 CPU is not so bad thinking of
> >> Flink's scalability and fault tolerance.
> >> Thank you for your advice.
> >>
> >> Regards,
> >> Hironori Ogibayashi
> >>
> >> 2016-02-26 21:46 GMT+09:00 おぎばやしひろのり :
> >>> Stephan,
> >>>
> >>> Thank you for your quick response.
> >>> I will try and post the result later.
> >>>
> >>> Regards,
> >>> Hironori
> >>>
> >>> 2016-02-26 19:45 GMT+09:00 Stephan Ewen :
>  Hi!
> 
>  I would try and dig bit by bit into what the bottleneck is:
> 
>  1) Disable the checkpointing, see what difference that makes
>  2) Use a dummy sink (discarding) rather than elastic search, to see
> if that
>  is limiting
>  3) Check the JSON parsing. Many JSON libraries are very CPU intensive
> and
>  easily dominate the entire pipeline.
> 
>  Greetings,
>  Stephan
> 
> 
>  On Fri, Feb 26, 2016 at 11:23 AM, おぎばやしひろのり 
> wrote:
> >
> > Hello,
> >
> > I started evaluating Flink and tried simple performance test.
> > The result was just about 4000 messages/sec with 300% CPU usage. I
> > think this is quite low and wondering if it is a reasonable result.
> > If someone could check it, it would be great.
> >
> > Here is the detail:
> >
> > [servers]
> > - 3 Kafka broker with 3 partitions
> > - 3 Flink TaskManager + 1 JobManager
> > - 1 Elasticsearch
> > All of them are separate VM with 8vCPU, 8GB memory
> >
> > [test case]
> > The application counts access log by URI with in 1 minute window and
> > send the result to Elasticsearch. The actual code is below.
> > I used '-p 3' option to flink run command, so the task was
> distributed
> > to 3 TaskManagers.
> > In the test, I sent about 5000 logs/sec to Kafka.
> >
> > [result]
> > - From Elasticsearch records, the total access count for all URI was
> > about 260,000/min = 4300/sec. This is the entire throughput.
> > - Kafka consumer lag was keep growing.
> > - The CPU usage of each TaskManager machine was about 13-14%. From
> top
> > command output, Flink java process was using 100%(1 CPU full)
> >
> > So I thought the bottleneck here was CPU used by Flink Tasks.
> >
> > Here is the application code.
> > ---
> >val env = StreamExecutionEnvironment.getExecutionEnvironment
> >env.enableCheckpointing(1000)
> > ...
> >val stream = env
> >  .addSource(new FlinkKafkaConsumer082[String]("kafka.dummy", new
> > 

Re: DataSet -> DataStream

2016-03-11 Thread Prez Cannady
This is roughly the solution I have now.  On the other hand, I was hoping for a 
solution that doesn’t involve checking whether a file has updated.

Prez Cannady  
p: 617 500 3378  
e: revp...@opencorrelate.org   
GH: https://github.com/opencorrelate   
LI: https://www.linkedin.com/in/revprez   









> On Mar 11, 2016, at 12:20 AM, Balaji Rajagopalan 
>  wrote:
> 
> You could I suppose write the dateset to a sink a file and then read the file 
> to a data stream. 
> 
> On Fri, Mar 11, 2016 at 4:18 AM, Prez Cannady  > wrote:
> 
> I’d like to pour some data I’ve collected into a DataSet via JDBC into a 
> Kafka topic, but I think I need to transform my DataSet into a DataStream 
> first.  If anyone has a clue how to proceed, I’d appreciate it; or let me 
> know if I’m completely off track.
> 
> 
> Prez Cannady  
> p: 617 500 3378  
> e: revp...@opencorrelate.org   
> GH: https://github.com/opencorrelate   
> LI: https://www.linkedin.com/in/revprez  
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 



Re: 404 error for Flink examples

2016-03-11 Thread Ufuk Celebi
I was wondering whether we should completely remove that page and just
link to the examples package on GitHub. Do you think that it would be
a good idea?

On Fri, Mar 11, 2016 at 10:45 AM, Maximilian Michels  wrote:
> Thanks for noticing, Janardhan. Fixed for the next release.
>
> On Fri, Mar 11, 2016 at 6:38 AM, janardhan shetty
>  wrote:
>> Thanks Balaji.
>>
>> This needs to be updated in the Job.java file of quickstart application.
>> I am using 1.0 version
>>
>> On Thu, Mar 10, 2016 at 9:23 PM, Balaji Rajagopalan
>>  wrote:
>>>
>>> You could try this link.
>>>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/examples.html
>>>
>>> On Fri, Mar 11, 2016 at 9:56 AM, janardhan shetty 
>>> wrote:

 Hi,

 I was looking at the examples for Flink applications and the comment in
 quickstart/job results in 404 for the web page.

 http://flink.apache.org/docs/latest/examples.html

 This needs to be updated


>>>
>>


Re: 404 error for Flink examples

2016-03-11 Thread Maximilian Michels
Thanks for noticing, Janardhan. Fixed for the next release.

On Fri, Mar 11, 2016 at 6:38 AM, janardhan shetty
 wrote:
> Thanks Balaji.
>
> This needs to be updated in the Job.java file of quickstart application.
> I am using 1.0 version
>
> On Thu, Mar 10, 2016 at 9:23 PM, Balaji Rajagopalan
>  wrote:
>>
>> You could try this link.
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/examples.html
>>
>> On Fri, Mar 11, 2016 at 9:56 AM, janardhan shetty 
>> wrote:
>>>
>>> Hi,
>>>
>>> I was looking at the examples for Flink applications and the comment in
>>> quickstart/job results in 404 for the web page.
>>>
>>> http://flink.apache.org/docs/latest/examples.html
>>>
>>> This needs to be updated
>>>
>>>
>>
>


kafka.javaapi.consumer.SimpleConsumer class not found

2016-03-11 Thread Balaji Rajagopalan
I am tyring to use the flink kafka connector, for this I have specified the
kafka connector dependency and created a fat jar since default flink
installation does not contain kafka connector jars. I have made sure that
flink-streaming-demo-0.1.jar has the
kafka.javaapi.consumer.SimpleConsumer.class but still I see the class not
found exception.

The code for kafka connector in flink.

val env = StreamExecutionEnvironment.getExecutionEnvironment
val prop:Properties = new Properties()
prop.setProperty("zookeeper.connect","somezookeer:2181")
prop.setProperty("group.id","some-group")
prop.setProperty("bootstrap.servers","somebroker:9092")

val stream = env
  .addSource(new FlinkKafkaConsumer082[String]("location", new
SimpleStringSchema, prop))

jar tvf flink-streaming-demo-0.1.jar | grep
kafka.javaapi.consumer.SimpleConsumer

  5111 Fri Mar 11 14:18:36 UTC 2016 *kafka/javaapi/consumer/SimpleConsumer*
.class

flink.version = 0.10.2
kafka.verison = 0.8.2
flink.kafka.connection.verion=0.9.1

The command that I use to run the flink program in yarn cluster is below,

HADOOP_CONF_DIR=/etc/hadoop/conf /usr/share/flink/bin/flink run -c
com.dataartisans.flink_demo.examples.DriverEventConsumer  -m yarn-cluster
-yn 2 /home/balajirajagopalan/flink-streaming-demo-0.1.jar

java.lang.NoClassDefFoundError: kafka/javaapi/consumer/SimpleConsumer

at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.getPartitionsForTopic(FlinkKafkaConsumer.java:691)

at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer.(FlinkKafkaConsumer.java:281)

at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer082.(FlinkKafkaConsumer082.java:49)

at
com.dataartisans.flink_demo.examples.DriverEventConsumer$.main(DriverEventConsumer.scala:53)

at
com.dataartisans.flink_demo.examples.DriverEventConsumer.main(DriverEventConsumer.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)

at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)

at org.apache.flink.client.program.Client.runBlocking(Client.java:252)

at
org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)

at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)

at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)

at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)

Caused by: java.lang.ClassNotFoundException:
kafka.javaapi.consumer.SimpleConsumer

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

... 16 more


Any help appreciated.


balaji


Re: Flink streaming throughput

2016-03-11 Thread おぎばやしひろのり
Aljoscha,

Thank you for your response.

I tried no JSON parsing and no sink (DiscardingSink) case. The
throughput was 8228msg/sec.
Slightly better than JSON + Elasticsearch case.
I also tried using socketTextStream instead of FlinkKafkaConsumer, in
that case, the result was
60,000 msg/sec with just 40% flink TaskManager CPU Usage. (socket
server was the bottleneck)
That was amazing, although Flink's fault tolerance feature is not
available with socketTextStream.

Regards,
Hironori

2016-03-08 21:36 GMT+09:00 Aljoscha Krettek :
> Hi,
> Another interesting test would be a combination of 3) and 2). I.e. no JSON 
> parsing and no sink. This would show what the raw throughput can be before 
> being slowed down by writing to Elasticsearch.
>
> Also .print() is also not feasible for production since it just prints every 
> element to the stdout log on the TaskManagers, which itself can cause quite a 
> slowdown. You could try:
>
> datastream.addSink(new DiscardingSink())
>
> which is a dummy sink that does nothing.
>
> Cheers,
> Aljoscha
>> On 08 Mar 2016, at 13:31, おぎばやしひろのり  wrote:
>>
>> Stephan,
>>
>> Sorry for the delay in my response.
>> I tried 3 cases you suggested.
>>
>> This time, I set parallelism to 1 for simpicity.
>>
>> 0) base performance (same as the first e-mail): 1,480msg/sec
>> 1) Disable checkpointing : almost same as 0)
>> 2) No ES sink. just print() : 1,510msg/sec
>> 3) JSON to TSV : 8,000msg/sec
>>
>> So, as you can see, the bottleneck was JSON parsing. I also want to
>> try eliminating Kafka to see
>> if there is a room to improve performance.(Currently, I am using
>> FlinkKafkaConsumer082 with Kafka 0.9
>> I think I should try Flink 1.0 and FlinkKafkaConsumer09).
>> Anyway, I think 8,000msg/sec with 1 CPU is not so bad thinking of
>> Flink's scalability and fault tolerance.
>> Thank you for your advice.
>>
>> Regards,
>> Hironori Ogibayashi
>>
>> 2016-02-26 21:46 GMT+09:00 おぎばやしひろのり :
>>> Stephan,
>>>
>>> Thank you for your quick response.
>>> I will try and post the result later.
>>>
>>> Regards,
>>> Hironori
>>>
>>> 2016-02-26 19:45 GMT+09:00 Stephan Ewen :
 Hi!

 I would try and dig bit by bit into what the bottleneck is:

 1) Disable the checkpointing, see what difference that makes
 2) Use a dummy sink (discarding) rather than elastic search, to see if that
 is limiting
 3) Check the JSON parsing. Many JSON libraries are very CPU intensive and
 easily dominate the entire pipeline.

 Greetings,
 Stephan


 On Fri, Feb 26, 2016 at 11:23 AM, おぎばやしひろのり  wrote:
>
> Hello,
>
> I started evaluating Flink and tried simple performance test.
> The result was just about 4000 messages/sec with 300% CPU usage. I
> think this is quite low and wondering if it is a reasonable result.
> If someone could check it, it would be great.
>
> Here is the detail:
>
> [servers]
> - 3 Kafka broker with 3 partitions
> - 3 Flink TaskManager + 1 JobManager
> - 1 Elasticsearch
> All of them are separate VM with 8vCPU, 8GB memory
>
> [test case]
> The application counts access log by URI with in 1 minute window and
> send the result to Elasticsearch. The actual code is below.
> I used '-p 3' option to flink run command, so the task was distributed
> to 3 TaskManagers.
> In the test, I sent about 5000 logs/sec to Kafka.
>
> [result]
> - From Elasticsearch records, the total access count for all URI was
> about 260,000/min = 4300/sec. This is the entire throughput.
> - Kafka consumer lag was keep growing.
> - The CPU usage of each TaskManager machine was about 13-14%. From top
> command output, Flink java process was using 100%(1 CPU full)
>
> So I thought the bottleneck here was CPU used by Flink Tasks.
>
> Here is the application code.
> ---
>val env = StreamExecutionEnvironment.getExecutionEnvironment
>env.enableCheckpointing(1000)
> ...
>val stream = env
>  .addSource(new FlinkKafkaConsumer082[String]("kafka.dummy", new
> SimpleStringSchema(), properties))
>  .map{ json => JSON.parseFull(json).get.asInstanceOf[Map[String,
> AnyRef]] }
>  .map{ x => x.get("uri") match {
>case Some(y) => (y.asInstanceOf[String],1)
>case None => ("", 1)
>  }}
>  .keyBy(0)
>  .timeWindow(Time.of(1, TimeUnit.MINUTES))
>  .sum(1)
>  .map{ x => (System.currentTimeMillis(), x)}
>  .addSink(new ElasticsearchSink(config, transports, new
> IndexRequestBuilder[Tuple2[Long, Tuple2[String, Int]]]  {
>override def createIndexRequest(element: Tuple2[Long,
> Tuple2[String, Int]], ctx: RuntimeContext): IndexRequest = {
>  val json = new HashMap[String, AnyRef]
>