Re: Elasticsearch 5.x connection

2017-03-02 Thread Tzu-Li (Gordon) Tai
Hi, java.lang.NoSuchMethodError: org.apache.flink.util.InstantiationUtil.isSerializable(Ljava/lang/Object;)Z         at org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase.(ElasticsearchSinkBase.java:195)         at

Flink using notebooks in EMR

2017-03-02 Thread Meghashyam Sandeep V
Hi there, Has anyone tried using flink interpreter in Zeppelin using AWS EMR? I tried creating a new interpreter using host as 'localhots' and port '6123' which didn't seem to work. Thanks, Sandeep

Re: Cross operation on two huge datasets

2017-03-02 Thread Xingcan Cui
Hi Gwen, in my view, indexing and searching are two isolated processes and they should be separated. Maybe you should take the RTree structure as a new dataset (fortunately it's static, right?) and store it to a distributed cache or DFS that can be accessed by operators from any nodes. That will

Re: Serialization performance

2017-03-02 Thread Stephan Ewen
Hi! I can write some more details later, here the short answer: - Your serializer would do into something like the AvroSerializer - When you instantiate the AvroSerializer in GenericTypeInfo.createSerializer(ExecutionConfig), you pre-register the type of the generic type, plus all types in

Re: Http Requests from Flink

2017-03-02 Thread Yassine MARZOUGUI
Hi Ulf, I've done HTTP requests in Flink using OkHttp library . I found it practical and easy to use. Here is how I used it to make a POST request for each incoming element in the stream and ouput the response: DataStream stream = stream.map(new

RE: Serialization performance

2017-03-02 Thread Newport, Billy
This is what we’re using as our serializer: Somewhere: env.addDefaultKryoSerializer(Record.class, GRKryoSerializer.class); then public class GRKryoSerializer extends Serializer { static class AvroStuff { Schema schema; String comment; long

Re: Serialization performance

2017-03-02 Thread Stephan Ewen
Hi! Thanks for this writeup, very cool stuff ! For part (1) - serialization: I think that can be made a bit nicer. Avro is a bit of an odd citizen in Flink, because Flink serialization is actually schema aware, but does not integrate with Avro. That's why Avro types go through Kryo. We should

Serialization performance

2017-03-02 Thread Newport, Billy
We've been working on performance for the last while. We're using flink 1.2 right now. We are writing batch jobs which process avro and parquet input files and produce parquet files. Flink serialization costs seem to be the most significant aspect of our wall clock time. We have written a

Re: Cross operation on two huge datasets

2017-03-02 Thread Jain, Ankit
If I understood correctly, you have just implemented flink broadcasting by hand ☺. You are still sending out the whole points dataset to each shape partition – right? I think this could be optimized by using a keyBy or custom partition which is common across shapes & points – that should make

Re: How to write a Sink that needs to deal with conditions in thread-safe way?

2017-03-02 Thread Stephan Ewen
Hi! (1) RichSinkFunction is the best function for streaming sinks. (2) The "invoke()" method is never called by multiple threads concurrently. No need to synchronize. Stephan On Thu, Mar 2, 2017 at 4:53 PM, Hussein Baghdadi < hussein.baghd...@zalando.de> wrote: > Hello, > > I have some basic

How to write a Sink that needs to deal with conditions in thread-safe way?

2017-03-02 Thread Hussein Baghdadi
Hello, I have some basic questions regarding Sinks in Flink. 1) To implement our own Sink, which class to implement: RichSinkFunction, RichOutputFormat, etc .. 2) We are trying to write batches in our Sink. For that, in overrided invoke() , we are calling a synchronised function: // events

Re: Data stream to write to multiple rds instances

2017-03-02 Thread Sathi Chowdhury
Hi Till, Thanks for your reply.I guess I will have to write a custom sink function that will use JdbcOutputFormat. I have a question about checkpointing support though ..if I am reading a stream from kinesis , streamA and it is transformed to streamB, and that is written to db, as streamB is

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
I made it so that I don’t care where the next operator will be scheduled. I configured taskslots = 1 and parallelism = yarnnodes so that : · Each node contains 1/N th of the shapes (simple repartition() of the shapes dataset). · The points will be cloned so that each partition

Elasticsearch 5.x connection

2017-03-02 Thread Fábio Dias
Hi, Last Friday I was running elasticsearch 5.X with Flink 1.2.0 In the pom.xml I added this dependency: org.apache.flink flink-connector-elasticsearch-base_2.10 1.3-SNAPSHOT org.elasticsearch elasticsearch And I added to two files : the ElasticsearchSink.java and

Re: Cross operation on two huge datasets

2017-03-02 Thread Till Rohrmann
Yes you’re right about the “split” and broadcasting. Storing it in the JVM is not a good approach, since you don’t know where Flink will schedule the new operator instance. It might be the case that an operator responsible for another partition gets scheduled to this JVM and then it has the wrong

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
The best for me would be to make it “persist” inside of the JVM heap in some map since I don’t even know if the structure is Serializable (I could try). But I understand. As for broadcasting, wouldn’t broadcasting the variable cancel the efforts I did to “split” the dataset parsing over the

Re: Connecting workflows in batch

2017-03-02 Thread Till Rohrmann
Hi Mohit, StreamExecutionEnvironment.execute() will only return giving you the JobExecutionResult after the job has reached a final stage. If that works for you to schedule the second job, then it should be ok to combine both jobs in one program and execute the second job after the first one has

Re: Data stream to write to multiple rds instances

2017-03-02 Thread Till Rohrmann
Hi Sathi, you can split select or filter your data stream based on the field's value. Then you are able to obtain multiple data streams which you can output using a JDBCOutputFormat for each data stream. Be aware, however, that the JDBCOutputFormat does not give you any processing guarantees

Re: NPE while writing to s3://

2017-03-02 Thread Till Rohrmann
Hi Sathi, which version of Flink are you using? Since Flink 1.2 the RollingSink is deprecated. It is now recommend to use the BucketingSink. Maybe this problem is resolved with the newer sink. Cheers, Till ​ On Thu, Mar 2, 2017 at 9:44 AM, Sathi Chowdhury < sathi.chowdh...@elliemae.com> wrote:

Re: Cross operation on two huge datasets

2017-03-02 Thread Till Rohrmann
Hi Gwenhael, if you want to persist operator state, then you would have to persist it (e.g. writing to a shared directory or emitting the model and using one of Flink’s sinks) and when creating the new operators you have to reread it from there (usually in the open method or from a Flink source

Re: Http Requests from Flink

2017-03-02 Thread Alex De Castro
Hi Ulf, I’ve had similar problem, before but from a sink perspective: I had to create a HTTP sink for a Kafka REST API. I’ve used scalaj-http https://github.com/scalaj/scalaj-http which is a wrapper for the corresponding Java lib. For example, https://github.com/scalaj/scalaj-http For example

Re: unclear exception when writing to elasticsearch

2017-03-02 Thread Martin Neumann
Hej, I finally found out what the problem was. I had added another dependency that was necessary to run things on hops for some reason that broke things. When I remove it, it works fine. I talking to the hops guys about it to understand what's going on. Thanks for the help. Cheers Martin On

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
I (almost) made it work the following way: 1rst job : Read all the shapes, repartition() them equally on my N nodes, then on each node fill a static RTree (thanks for the tip). 2nd job : Read all the points, use a flatmap + custom partitioner to “clone” the dataset to all nodes, then apply a

Re: kinesis producer setCustomPartitioner use stream's own data

2017-03-02 Thread Sathi Chowdhury
Thanks Gordon, It was simple to resolve. Best, Sathi From: "Tzu-Li (Gordon) Tai" Reply-To: "user@flink.apache.org" Date: Monday, February 20, 2017 at 11:46 PM To: "user@flink.apache.org" Subject: Re: kinesis producer

NPE while writing to s3://

2017-03-02 Thread Sathi Chowdhury
I get the NPE from the below code I am running this from my mac in a local flink cluster. RollingSink s3Sink = new RollingSink("s3://sc-sink1/"); s3Sink.setBucketer(new DateTimeBucketer("-MM-dd--HHmm")); s3Sink.setWriter(new StringWriter()); s3Sink.setBatchSize(200);