Re: checkpoint notifier not found?

2016-12-12 Thread Abhishek R. Singh
https://issues.apache.org/jira/browse/FLINK-5323 


> On Dec 12, 2016, at 5:37 AM, Till Rohrmann  wrote:
> 
> Hi Abhishek,
> 
> great to hear that you like to become part of the Flink community. Here are 
> some information for how to contribute [1].
> 
> [1] http://flink.apache.org/how-to-contribute.html 
> 
> 
> Cheers,
> Till
> 
> On Mon, Dec 12, 2016 at 12:36 PM, Abhishek Singh 
> > 
> wrote:
> Will be happy to. Could you guide me a bit in terms of what I need to do?
> 
> I am a newbie to open source contributing. And currently at Frankfurt 
> airport. When I hit ground will be happy to contribute back. Love the project 
> !!
> 
> Thanks for the awesomeness. 
> 
> 
> On Mon, Dec 12, 2016 at 12:29 PM Stephan Ewen  > wrote:
> Thanks for reporting this.
> It would be awesome if you could file a JIRA or a pull request for fixing the 
> docs for that.
> 
> On Sat, Dec 10, 2016 at 2:02 AM, Abhishek R. Singh 
> > 
> wrote:
> I was following the official documentation: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
>  
> 
> 
> Looks like this is the right one to be using: import 
> org.apache.flink.runtime.state.CheckpointListener;
> 
> -Abhishek-
> 
>> On Dec 9, 2016, at 4:30 PM, Abhishek R. Singh 
>> > 
>> wrote:
>> 
>> I can’t seem to find CheckpointNotifier. Appreciate help !
>> 
>> CheckpointNotifier is not a member of package 
>> org.apache.flink.streaming.api.checkpoint
>> 
>> From my pom.xml:
>> 
>> 
>> org.apache.flink
>> flink-scala_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-streaming-scala_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-clients_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-statebackend-rocksdb_2.11
>> 1.1.3
>> 
> 
> 
> 



use case for sliding windows

2016-12-12 Thread Meghashyam Sandeep V
Hi There,

I have a streaming job which has source as Kafka and sink as Cassandra. I
have a use case where I wouldn't want to write some events to Cassandra
when there are more than 100 events for a given 'id' (field in my Pojo) in
5mins. Is this a good usecase for SlidingWindows? Can I get the sliding
count for each key and then decide whether to add it to sink or not?

Thanks,


Re: Reg. custom sinks in Flink

2016-12-12 Thread Chesnay Schepler

Hello,

the query is generated automatically from the pojo by the datastax 
MappingManager in the CassandraPojoSink; Flink isn't generating anything 
itself.


On the MappingManager you can set the TTL for all queries (it also 
allows some other stuff). So, to allow the user to set the TTL we must 
add a hook
to configure the MappingManager; this can be done the same way the 
Cluster is configured using the ClusterBuilder.


Regards,
Chesnay

On 12.12.2016 19:12, Meghashyam Sandeep V wrote:
Thank you Till. I wanted to contribute towards Flink. Looks like this 
could be a good start. I couldn't find the place where the insert 
query is built for Pojo sinks in CassandraSink.java, 
CassandraPojoSink.java, or CassandraSinkBase.java. Could you throw 
some light about how that insert query is built automatically by the 
sink?


Thanks,

On Mon, Dec 12, 2016 at 7:56 AM, Till Rohrmann > wrote:


(1) A subtask is a parallel instance of an operator and thus
responsible for a partition (possibly infinite) of the whole
DataStream/DataSet.

(2) Maybe you can add this feature to Flink's Cassandra Sink.

Cheers,
Till

On Mon, Dec 12, 2016 at 4:30 PM, Meghashyam Sandeep V
> wrote:

Data piles up in Cassandra without TTL. Is there a workaround
for this problem? Is there a way to specify my query and still
use Pojo?

Thanks,

On Mon, Dec 12, 2016 at 7:26 AM, Chesnay Schepler
> wrote:

Regarding 2) I don't think so. That would require access
to the datastax MappingManager.
We could add something similar as the ClusterBuilder for
that though.

Regards,
Chesnay


On 12.12.2016 16:15, Meghashyam Sandeep V wrote:

Hi Till,

Thanks for the information.

1. What do you mean by 'subtask', is it every partition
or every message in the stream?

2. I tried using CassandraSink with a Pojo. Is there a
way to specify TTL as I can't use a query when I have a
datastream with Pojo?

CassandraSink.addSink(messageStream)
  .setClusterBuilder(new ClusterBuilder() {
  @Override protected Cluster 
buildCluster(Cluster.Builder builder) {
  return buildCassandraCluster();
  }
  })
  .build();
Thanks,
On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann
> wrote:

Hi Meghashyam,

1.

You can perform initializations in the open
method of the |RichSinkFunction| interface. The
|open| method will be called once for every sub
task when initializing it. If you want to share
the resource across multiple sub tasks running in
the same JVM you can also store the |dbSession|
in a class variable.

2.

The Flink community is currently working on
adding security support including ssl encryption
to Flink. So maybe in the future you can use
Flink’s Cassandra sink again.

Cheers, Till

​
On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V
> wrote:

Thanks a lot for the quick reply Shannon.
1. I will create a class that extends
SinkFunction and write my connection logic there.
My only question here is- will a dbSession be
created for each message/partition which might
affect the performance? Thats the reason why I
added this line to create a connection once and
use it along the datastream. if(dbSession == null
&& store!=null) { dbSession = getSession();}
2. I couldn't use flink-connector-cassandra as I
have SSL enabled for my C* cluster and I couldn't
get it work with all my SSL
config(truststore,keystore etc) added to cluster
building. I didn't find a proper example with SSL
enabled flink-connector-cassandra
Thanks
On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey
>
wrote:

You haven't really added a sink 

Flink 1.1.3 RollingSink - mismatch in the number of records consumed/produced

2016-12-12 Thread Dominik Safaric
Hi everyone,

As I’ve implemented a RollingSink writing messages consumed from a Kafka log, 
I’ve observed that there is a significant mismatch in the number of messages 
consumed and written to file system.

Namely, the consumed Kafka topic contains in total 1.000.000 messages. The 
topology does not perform any data transformation whatsoever, but instead of, 
data from the source is pushed straight to the RollingSink. 

After I’ve checksummed the output files, I’ve observed that the total number of 
messages written to the output files is greater then 7.000.000 - a different of 
6.000.000 records more then consumed/available.

What is the cause of this behaviour? 

Regards,
Dominik   

Re: How to retrieve values from yarn.taskmanager.env in a Job?

2016-12-12 Thread Shannon Carey
Hi Till,

Yes, System.getenv() was the first thing I tried. It'd be great if someone else 
can reproduce the issue, but for now I'll submit a JIRA with the assumption 
that it really is not working right. 
https://issues.apache.org/jira/browse/FLINK-5322

-Shannon

From: Till Rohrmann >
Date: Monday, December 12, 2016 at 7:21 AM
To: >
Subject: Re: How to retrieve values from yarn.taskmanager.env in a Job?


Hi Shannon,

have you tried accessing the environment variables via System.getenv()? This 
should give you a map of string-string key value pairs where the key is the 
environment variable name.

If your values are not set in the returned map, then this indicates a bug in 
Flink and it would be great if you could open a JIRA issue.

Cheers,
Till

​

On Fri, Dec 9, 2016 at 7:33 PM, Shannon Carey 
> wrote:
This thread 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/passing-environment-variables-to-flink-program-td3337.html
 describes the impetus for the addition of yarn.taskmanager.env.

I have configured a value within yarn.taskmanager.env, and I see it appearing 
in the Flink web UI in the list underneath Job Manager -> Configuration. 
However, I can't figure out how to retrieve the value from within a Flink job. 
It doesn't appear in the environment, the system properties, or my 
ParameterTool instance, and I can't figure out how I would get to it via the 
StreamExecutionEnvironment. Can anyone point me in the right direction?

All I want to do is inform my Flink jobs which environment they're running on, 
so that programmers don't have to specify the environment as a job parameter 
every time they run it. I also see that there is a "env.java.opts" 
configuration… does that work in YARN apps (would my jobs be able to see it?)

Thanks!
Shannon



Re: How to retrieve values from yarn.taskmanager.env in a Job?

2016-12-12 Thread Shannon Carey
Hi Chesnay,

Since that configuration option is supposed to apply the environment variables 
to the task managers, I figured it would definitely be available within the 
stream operators. I'm not sure whether the job plan runs within a task manager 
or not, but hopefully it does?

In my particular code, I want to get the name of the environment in order to 
read the correct configuration file(s) so that properly populated config 
objects can be passed to various operators. Therefore, it would be sufficient 
for the job plan execution to have access to the environment. All the operators 
are capable of persisting any necessary configuration through serialization.

It really can work either way, but I think it'd be easiest if it was available 
everywhere. If it's only available during job planning then you have to make 
sure to serialize it everywhere you need it, and if it's only available during 
operator execution then it's less straightforward to do central configuration 
work. Either way it's lying in wait for a programmer to forget where it's 
accessible vs. not.

-Shannon

From: Chesnay Schepler >
Date: Monday, December 12, 2016 at 7:36 AM
To: >
Subject: Re: How to retrieve values from yarn.taskmanager.env in a Job?

Hello,

can you clarify one small thing for me: Do you want to access this parameter 
when you define the plan
(aka when you call methods on the StreamExecutionEnvironment or DataStream 
instances)
or from within your functions/operators?

Regards,
Chesnay Schepler


On 12.12.2016 14:21, Till Rohrmann wrote:

Hi Shannon,

have you tried accessing the environment variables via System.getenv()? This 
should give you a map of string-string key value pairs where the key is the 
environment variable name.

If your values are not set in the returned map, then this indicates a bug in 
Flink and it would be great if you could open a JIRA issue.

Cheers,
Till

​

On Fri, Dec 9, 2016 at 7:33 PM, Shannon Carey 
> wrote:
This thread 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/passing-environment-variables-to-flink-program-td3337.html
 describes the impetus for the addition of yarn.taskmanager.env.

I have configured a value within yarn.taskmanager.env, and I see it appearing 
in the Flink web UI in the list underneath Job Manager -> Configuration. 
However, I can't figure out how to retrieve the value from within a Flink job. 
It doesn't appear in the environment, the system properties, or my 
ParameterTool instance, and I can't figure out how I would get to it via the 
StreamExecutionEnvironment. Can anyone point me in the right direction?

All I want to do is inform my Flink jobs which environment they're running on, 
so that programmers don't have to specify the environment as a job parameter 
every time they run it. I also see that there is a "env.java.opts" 
configuration… does that work in YARN apps (would my jobs be able to see it?)

Thanks!
Shannon




Re: Reg. custom sinks in Flink

2016-12-12 Thread Meghashyam Sandeep V
Thank you Till. I wanted to contribute towards Flink. Looks like this could
be a good start. I couldn't find the place where the insert query is built
for Pojo sinks in CassandraSink.java, CassandraPojoSink.java, or
CassandraSinkBase.java. Could you throw some light about how that insert
query is built automatically by the sink?

Thanks,

On Mon, Dec 12, 2016 at 7:56 AM, Till Rohrmann  wrote:

> (1) A subtask is a parallel instance of an operator and thus responsible
> for a partition (possibly infinite) of the whole DataStream/DataSet.
>
> (2) Maybe you can add this feature to Flink's Cassandra Sink.
>
> Cheers,
> Till
>
> On Mon, Dec 12, 2016 at 4:30 PM, Meghashyam Sandeep V <
> vr1meghash...@gmail.com> wrote:
>
>> Data piles up in Cassandra without TTL. Is there a workaround for this
>> problem? Is there a way to specify my query and still use Pojo?
>>
>> Thanks,
>>
>> On Mon, Dec 12, 2016 at 7:26 AM, Chesnay Schepler 
>> wrote:
>>
>>> Regarding 2) I don't think so. That would require access to the datastax
>>> MappingManager.
>>> We could add something similar as the ClusterBuilder for that though.
>>>
>>> Regards,
>>> Chesnay
>>>
>>>
>>> On 12.12.2016 16:15, Meghashyam Sandeep V wrote:
>>>
>>> Hi Till,
>>>
>>> Thanks for the information.
>>>
>>> 1. What do you mean by 'subtask', is it every partition or every message
>>> in the stream?
>>>
>>> 2. I tried using CassandraSink with a Pojo. Is there a way to specify
>>> TTL as I can't use a query when I have a datastream with Pojo?
>>>
>>> CassandraSink.addSink(messageStream)
>>>  .setClusterBuilder(new ClusterBuilder() {
>>>  @Override protected Cluster 
>>> buildCluster(Cluster.Builder builder) {
>>>  return buildCassandraCluster();
>>>  }
>>>  })
>>>  .build();
>>>
>>> Thanks,
>>>
>>>
>>> On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann 
>>> wrote:
>>>
 Hi Meghashyam,

1.

You can perform initializations in the open method of the
RichSinkFunction interface. The open method will be called once for
every sub task when initializing it. If you want to share the resource
across multiple sub tasks running in the same JVM you can also store the
dbSession in a class variable.
2.

The Flink community is currently working on adding security support
including ssl encryption to Flink. So maybe in the future you can use
Flink’s Cassandra sink again.

 Cheers,
 Till
 ​

 On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V <
 vr1meghash...@gmail.com> wrote:

> Thanks a lot for the quick reply Shannon.
>
> 1. I will create a class that extends SinkFunction and write my
> connection logic there. My only question here is- will a dbSession be
> created for each message/partition which might affect the performance?
> Thats the reason why I added this line to create a connection once and use
> it along the datastream. if(dbSession == null && store!=null) { dbSession
> = getSession();}
> 2. I couldn't use flink-connector-cassandra as I have SSL enabled for
> my C* cluster and I couldn't get it work with all my SSL
> config(truststore,keystore etc) added to cluster building. I didn't find a
> proper example with SSL enabled flink-connector-cassandra
>
>
> Thanks
>
>
>
>
> On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey 
> wrote:
>
>> You haven't really added a sink in Flink terminology, you're just
>> performing a side effect within a map operator. So while it may work, if
>> you want to add a sink proper you need have an object that extends
>> SinkFunction or RichSinkFunction. The method call on the stream should be
>> ".addSink(…)".
>>
>> Also, the dbSession isn't really Flink state as it will not vary
>> based on the position in or content in the stream. It's a necessary 
>> helper
>> object, yes, but you don't need Flink to checkpoint it.
>>
>> You can still use the sinks provided with flink-connector-cassandra
>> and customize the cluster building by passing your own ClusterBuilder 
>> into
>> the constructor.
>>
>> -Shannon
>>
>> From: Meghashyam Sandeep V < 
>> vr1meghash...@gmail.com>
>> Date: Friday, December 9, 2016 at 12:26 PM
>> To: < user@flink.apache.org>, <
>> d...@flink.apache.org>
>> Subject: Reg. custom sinks in Flink
>>
>> Hi there,
>>
>> I have a flink streaming app where my source is Kafka and a custom
>> sink to Cassandra(I can't use standard C* sink that comes with flink as I
>> have customized auth to C*). I'm currently have the following:
>>
>> messageStream
>> .rebalance()

Re: Incremental aggregations - Example not working

2016-12-12 Thread Matt
Err, I meant if I'm not wrong *

On Mon, Dec 12, 2016 at 2:02 PM, Matt  wrote:

> I just checked with version 1.1.3 and it works fine, the problem is that
> in that version we can't use Kafka 0.10 if I'm not work. Thank you for the
> workaround!
>
> Best,
> Matt
>
> On Mon, Dec 12, 2016 at 1:52 PM, Yassine MARZOUGUI <
> y.marzou...@mindlytix.com> wrote:
>
>> Yes, it was suppoed to work. I looked into this, and as Chesnay said,
>> this is a bug in the fold function. I opened an issue in JIRA :
>> https://issues.apache.org/jira/browse/FLINK-5320, and will fix it very
>> soon, thank you for reporting it.
>> In the mean time you can workaround the problem by specifying the
>> TypeInformation along with the fold function as follows : fold(ACC,
>> FoldFunction, WindowFunction, foldAccumulatorType, resultType). In the
>> example, the foldAccumulatorType is new TupleTypeInfo> Long, Integer>>(), and the resultType is also new
>> TupleTypeInfo>().
>>
>> Best,
>> Yassine
>>
>> 2016-12-12 16:38 GMT+01:00 Matt :
>>
>>> I'm using 1.2-SNAPSHOT, should it work in that version?
>>>
>>> On Mon, Dec 12, 2016 at 12:18 PM, Yassine MARZOUGUI <
>>> y.marzou...@mindlytix.com> wrote:
>>>
 Hi Matt,

 What version of Flink are you using?
 The incremental agregation with fold(ACC, FoldFunction, WindowFunction)
 in a new change that will be part of Flink 1.2, for Flink 1.1 the correct
 way to perform incrementation aggregations is : apply(ACC,
 FoldFunction, WindowFunction) (see the docs for 1.1 [1])

 [1] : https://ci.apache.org/projects/flink/flink-docs-release-1.
 1/apis/streaming/windows.html#windowfunction-with-incrementa
 l-aggregation

 Best,
 Yassine

 2016-12-12 15:37 GMT+01:00 Chesnay Schepler :

> Hello Matt,
>
> This looks like a bug in the fold() function to me.
>
> I'm adding Timo to the discussion, he can probably shed some light on
> this.
>
> Regards,
> Chesnay
>
>
> On 12.12.2016 15:13, Matt wrote:
>
> In case this is important, if I remove the WindowFunction, and only
> use the FoldFunction it works fine.
>
> I don't see what is wrong...
>
> On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:
>
>> Hi,
>>
>> I'm following the documentation [1] of window functions with
>> incremental aggregations, but I'm getting an "input mismatch" error.
>>
>> The code [2] is almost identical to the one in the documentation, at
>> the bottom you can find the exact error.
>>
>> What am I missing? Can you provide a working example of a fold
>> function with both a FoldFunction and a WindowFunction?
>>
>> Regards,
>> Matt
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/w
>> indows.html#windowfunction-with-incremental-aggregation
>>
>> [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d
>>
>
>
>

>>>
>>
>


Re: Incremental aggregations - Example not working

2016-12-12 Thread Matt
I just checked with version 1.1.3 and it works fine, the problem is that in
that version we can't use Kafka 0.10 if I'm not work. Thank you for the
workaround!

Best,
Matt

On Mon, Dec 12, 2016 at 1:52 PM, Yassine MARZOUGUI <
y.marzou...@mindlytix.com> wrote:

> Yes, it was suppoed to work. I looked into this, and as Chesnay said,
> this is a bug in the fold function. I opened an issue in JIRA :
> https://issues.apache.org/jira/browse/FLINK-5320, and will fix it very
> soon, thank you for reporting it.
> In the mean time you can workaround the problem by specifying the
> TypeInformation along with the fold function as follows : fold(ACC,
> FoldFunction, WindowFunction, foldAccumulatorType, resultType). In the
> example, the foldAccumulatorType is new TupleTypeInfo Long, Integer>>(), and the resultType is also new
> TupleTypeInfo>().
>
> Best,
> Yassine
>
> 2016-12-12 16:38 GMT+01:00 Matt :
>
>> I'm using 1.2-SNAPSHOT, should it work in that version?
>>
>> On Mon, Dec 12, 2016 at 12:18 PM, Yassine MARZOUGUI <
>> y.marzou...@mindlytix.com> wrote:
>>
>>> Hi Matt,
>>>
>>> What version of Flink are you using?
>>> The incremental agregation with fold(ACC, FoldFunction, WindowFunction)
>>> in a new change that will be part of Flink 1.2, for Flink 1.1 the correct
>>> way to perform incrementation aggregations is : apply(ACC,
>>> FoldFunction, WindowFunction) (see the docs for 1.1 [1])
>>>
>>> [1] : https://ci.apache.org/projects/flink/flink-docs-release-1.
>>> 1/apis/streaming/windows.html#windowfunction-with-incrementa
>>> l-aggregation
>>>
>>> Best,
>>> Yassine
>>>
>>> 2016-12-12 15:37 GMT+01:00 Chesnay Schepler :
>>>
 Hello Matt,

 This looks like a bug in the fold() function to me.

 I'm adding Timo to the discussion, he can probably shed some light on
 this.

 Regards,
 Chesnay


 On 12.12.2016 15:13, Matt wrote:

 In case this is important, if I remove the WindowFunction, and only use
 the FoldFunction it works fine.

 I don't see what is wrong...

 On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:

> Hi,
>
> I'm following the documentation [1] of window functions with
> incremental aggregations, but I'm getting an "input mismatch" error.
>
> The code [2] is almost identical to the one in the documentation, at
> the bottom you can find the exact error.
>
> What am I missing? Can you provide a working example of a fold
> function with both a FoldFunction and a WindowFunction?
>
> Regards,
> Matt
>
> [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/w
> indows.html#windowfunction-with-incremental-aggregation
>
> [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d
>



>>>
>>
>


Re: Incremental aggregations - Example not working

2016-12-12 Thread Yassine MARZOUGUI
Yes, it was suppoed to work. I looked into this, and as Chesnay said, this
is a bug in the fold function. I opened an issue in JIRA :
https://issues.apache.org/jira/browse/FLINK-5320, and will fix it very
soon, thank you for reporting it.
In the mean time you can workaround the problem by specifying the
TypeInformation along with the fold function as follows : fold(ACC,
FoldFunction, WindowFunction, foldAccumulatorType, resultType). In the
example, the foldAccumulatorType is new TupleTypeInfo>(), and the resultType is also new TupleTypeInfo>().

Best,
Yassine

2016-12-12 16:38 GMT+01:00 Matt :

> I'm using 1.2-SNAPSHOT, should it work in that version?
>
> On Mon, Dec 12, 2016 at 12:18 PM, Yassine MARZOUGUI <
> y.marzou...@mindlytix.com> wrote:
>
>> Hi Matt,
>>
>> What version of Flink are you using?
>> The incremental agregation with fold(ACC, FoldFunction, WindowFunction)
>> in a new change that will be part of Flink 1.2, for Flink 1.1 the correct
>> way to perform incrementation aggregations is : apply(ACC, FoldFunction,
>> WindowFunction) (see the docs for 1.1 [1])
>>
>> [1] : https://ci.apache.org/projects/flink/flink-docs-release-1.
>> 1/apis/streaming/windows.html#windowfunction-with-incremental-aggregation
>>
>> Best,
>> Yassine
>>
>> 2016-12-12 15:37 GMT+01:00 Chesnay Schepler :
>>
>>> Hello Matt,
>>>
>>> This looks like a bug in the fold() function to me.
>>>
>>> I'm adding Timo to the discussion, he can probably shed some light on
>>> this.
>>>
>>> Regards,
>>> Chesnay
>>>
>>>
>>> On 12.12.2016 15:13, Matt wrote:
>>>
>>> In case this is important, if I remove the WindowFunction, and only use
>>> the FoldFunction it works fine.
>>>
>>> I don't see what is wrong...
>>>
>>> On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:
>>>
 Hi,

 I'm following the documentation [1] of window functions with
 incremental aggregations, but I'm getting an "input mismatch" error.

 The code [2] is almost identical to the one in the documentation, at
 the bottom you can find the exact error.

 What am I missing? Can you provide a working example of a fold function
 with both a FoldFunction and a WindowFunction?

 Regards,
 Matt

 [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/w
 indows.html#windowfunction-with-incremental-aggregation

 [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d

>>>
>>>
>>>
>>
>


Re: Avro Parquet/Flink/Beam

2016-12-12 Thread Jean-Baptiste Onofré

Hi Billy,

I will push my branch with ParquetIO on my github.

Yes, the Beam IO is independent from the runner.

Regards
JB

On 12/12/2016 05:29 PM, Newport, Billy wrote:

I don't mind writing one, is there a fork for the ParquetIO works that's 
already been done or is it in trunk?

The ParquetIO is independent of the runner being used? Is that right?

Thanks

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net]
Sent: Monday, December 12, 2016 11:25 AM
To: user@flink.apache.org
Subject: Re: Avro Parquet/Flink/Beam

Hi,

Beam provides a AvroCoder/AvroIO that you can use, but not yet a
ParquetIO (I created a Jira about that and started to work on it).

You can use the Avro reader to populate the PCollection and then use a
custom DoFn to create the Parquet (waiting for the ParquetIO).

Regards
JB

On 12/12/2016 05:19 PM, Newport, Billy wrote:

Are there any examples showing the use of beam with avro/parquet and a
flink runner? I see an avro reader for beam, is it a matter of writing
another one for avro-parquet or does this need to use the flink
HadoopOutputFormat for example?



Thanks

Billy







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


RE: Avro Parquet/Flink/Beam

2016-12-12 Thread Newport, Billy
I don't mind writing one, is there a fork for the ParquetIO works that's 
already been done or is it in trunk?

The ParquetIO is independent of the runner being used? Is that right?

Thanks

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] 
Sent: Monday, December 12, 2016 11:25 AM
To: user@flink.apache.org
Subject: Re: Avro Parquet/Flink/Beam

Hi,

Beam provides a AvroCoder/AvroIO that you can use, but not yet a 
ParquetIO (I created a Jira about that and started to work on it).

You can use the Avro reader to populate the PCollection and then use a 
custom DoFn to create the Parquet (waiting for the ParquetIO).

Regards
JB

On 12/12/2016 05:19 PM, Newport, Billy wrote:
> Are there any examples showing the use of beam with avro/parquet and a
> flink runner? I see an avro reader for beam, is it a matter of writing
> another one for avro-parquet or does this need to use the flink
> HadoopOutputFormat for example?
>
>
>
> Thanks
>
> Billy
>
>
>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net=DgID-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q=wsZfFaIgCU4OQCJzjCyCLIVFFKeRBjbv4lB3kSqYRjw=AnmdxwKDl7BYeuvQ001GrywGxW0Kvnwtgs3ikrNou8Y=
 
Talend - 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com=DgID-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=rlkM70D3djmDN7dGPzzbVKG26ShcTFDMKlX5AWucE5Q=wsZfFaIgCU4OQCJzjCyCLIVFFKeRBjbv4lB3kSqYRjw=5T8pN5Tz5hIpwH9uf77csajX0wJLjHzJ3kyqSzxQ2Xw=
 


Re: Avro Parquet/Flink/Beam

2016-12-12 Thread Jean-Baptiste Onofré

Hi,

Beam provides a AvroCoder/AvroIO that you can use, but not yet a 
ParquetIO (I created a Jira about that and started to work on it).


You can use the Avro reader to populate the PCollection and then use a 
custom DoFn to create the Parquet (waiting for the ParquetIO).


Regards
JB

On 12/12/2016 05:19 PM, Newport, Billy wrote:

Are there any examples showing the use of beam with avro/parquet and a
flink runner? I see an avro reader for beam, is it a matter of writing
another one for avro-parquet or does this need to use the flink
HadoopOutputFormat for example?



Thanks

Billy





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Avro Parquet/Flink/Beam

2016-12-12 Thread Newport, Billy
Are there any examples showing the use of beam with avro/parquet and a flink 
runner? I see an avro reader for beam, is it a matter of writing another one 
for avro-parquet or does this need to use the flink HadoopOutputFormat for 
example?

Thanks
Billy



Re: Reg. custom sinks in Flink

2016-12-12 Thread Till Rohrmann
(1) A subtask is a parallel instance of an operator and thus responsible
for a partition (possibly infinite) of the whole DataStream/DataSet.

(2) Maybe you can add this feature to Flink's Cassandra Sink.

Cheers,
Till

On Mon, Dec 12, 2016 at 4:30 PM, Meghashyam Sandeep V <
vr1meghash...@gmail.com> wrote:

> Data piles up in Cassandra without TTL. Is there a workaround for this
> problem? Is there a way to specify my query and still use Pojo?
>
> Thanks,
>
> On Mon, Dec 12, 2016 at 7:26 AM, Chesnay Schepler 
> wrote:
>
>> Regarding 2) I don't think so. That would require access to the datastax
>> MappingManager.
>> We could add something similar as the ClusterBuilder for that though.
>>
>> Regards,
>> Chesnay
>>
>>
>> On 12.12.2016 16:15, Meghashyam Sandeep V wrote:
>>
>> Hi Till,
>>
>> Thanks for the information.
>>
>> 1. What do you mean by 'subtask', is it every partition or every message
>> in the stream?
>>
>> 2. I tried using CassandraSink with a Pojo. Is there a way to specify TTL
>> as I can't use a query when I have a datastream with Pojo?
>>
>> CassandraSink.addSink(messageStream)
>>  .setClusterBuilder(new ClusterBuilder() {
>>  @Override protected Cluster 
>> buildCluster(Cluster.Builder builder) {
>>  return buildCassandraCluster();
>>  }
>>  })
>>  .build();
>>
>> Thanks,
>>
>>
>> On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann 
>> wrote:
>>
>>> Hi Meghashyam,
>>>
>>>1.
>>>
>>>You can perform initializations in the open method of the
>>>RichSinkFunction interface. The open method will be called once for
>>>every sub task when initializing it. If you want to share the resource
>>>across multiple sub tasks running in the same JVM you can also store the
>>>dbSession in a class variable.
>>>2.
>>>
>>>The Flink community is currently working on adding security support
>>>including ssl encryption to Flink. So maybe in the future you can use
>>>Flink’s Cassandra sink again.
>>>
>>> Cheers,
>>> Till
>>> ​
>>>
>>> On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V <
>>> vr1meghash...@gmail.com> wrote:
>>>
 Thanks a lot for the quick reply Shannon.

 1. I will create a class that extends SinkFunction and write my
 connection logic there. My only question here is- will a dbSession be
 created for each message/partition which might affect the performance?
 Thats the reason why I added this line to create a connection once and use
 it along the datastream. if(dbSession == null && store!=null) { dbSession
 = getSession();}
 2. I couldn't use flink-connector-cassandra as I have SSL enabled for
 my C* cluster and I couldn't get it work with all my SSL
 config(truststore,keystore etc) added to cluster building. I didn't find a
 proper example with SSL enabled flink-connector-cassandra


 Thanks




 On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey 
 wrote:

> You haven't really added a sink in Flink terminology, you're just
> performing a side effect within a map operator. So while it may work, if
> you want to add a sink proper you need have an object that extends
> SinkFunction or RichSinkFunction. The method call on the stream should be
> ".addSink(…)".
>
> Also, the dbSession isn't really Flink state as it will not vary based
> on the position in or content in the stream. It's a necessary helper
> object, yes, but you don't need Flink to checkpoint it.
>
> You can still use the sinks provided with flink-connector-cassandra
> and customize the cluster building by passing your own ClusterBuilder into
> the constructor.
>
> -Shannon
>
> From: Meghashyam Sandeep V < 
> vr1meghash...@gmail.com>
> Date: Friday, December 9, 2016 at 12:26 PM
> To: < user@flink.apache.org>, <
> d...@flink.apache.org>
> Subject: Reg. custom sinks in Flink
>
> Hi there,
>
> I have a flink streaming app where my source is Kafka and a custom
> sink to Cassandra(I can't use standard C* sink that comes with flink as I
> have customized auth to C*). I'm currently have the following:
>
> messageStream
> .rebalance()
>
> .map( s-> {
>
> return mapper.readValue(s, JsonNode.class);)
>
> .filter(//filter some messages)
>
> .map(
>
>  (MapFunction) message -> {
>
>  getDbSession.execute("QUERY_TO_EXEC")
>
>  })
>
> private static Session getDbSession() {
> if(dbSession == null && store!=null) {
> dbSession = getSession();
> }
>
> return dbSession;
> }
>
> 1. Is this the right way to add a custom 

Re: Incremental aggregations - Example not working

2016-12-12 Thread Matt
I'm using 1.2-SNAPSHOT, should it work in that version?

On Mon, Dec 12, 2016 at 12:18 PM, Yassine MARZOUGUI <
y.marzou...@mindlytix.com> wrote:

> Hi Matt,
>
> What version of Flink are you using?
> The incremental agregation with fold(ACC, FoldFunction, WindowFunction)
> in a new change that will be part of Flink 1.2, for Flink 1.1 the correct
> way to perform incrementation aggregations is : apply(ACC, FoldFunction,
> WindowFunction) (see the docs for 1.1 [1])
>
> [1] : https://ci.apache.org/projects/flink/flink-docs-
> release-1.1/apis/streaming/windows.html#windowfunction-
> with-incremental-aggregation
>
> Best,
> Yassine
>
> 2016-12-12 15:37 GMT+01:00 Chesnay Schepler :
>
>> Hello Matt,
>>
>> This looks like a bug in the fold() function to me.
>>
>> I'm adding Timo to the discussion, he can probably shed some light on
>> this.
>>
>> Regards,
>> Chesnay
>>
>>
>> On 12.12.2016 15:13, Matt wrote:
>>
>> In case this is important, if I remove the WindowFunction, and only use
>> the FoldFunction it works fine.
>>
>> I don't see what is wrong...
>>
>> On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:
>>
>>> Hi,
>>>
>>> I'm following the documentation [1] of window functions with incremental
>>> aggregations, but I'm getting an "input mismatch" error.
>>>
>>> The code [2] is almost identical to the one in the documentation, at the
>>> bottom you can find the exact error.
>>>
>>> What am I missing? Can you provide a working example of a fold function
>>> with both a FoldFunction and a WindowFunction?
>>>
>>> Regards,
>>> Matt
>>>
>>> [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/w
>>> indows.html#windowfunction-with-incremental-aggregation
>>>
>>> [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d
>>>
>>
>>
>>
>


Re: Reg. custom sinks in Flink

2016-12-12 Thread Meghashyam Sandeep V
Data piles up in Cassandra without TTL. Is there a workaround for this
problem? Is there a way to specify my query and still use Pojo?

Thanks,

On Mon, Dec 12, 2016 at 7:26 AM, Chesnay Schepler 
wrote:

> Regarding 2) I don't think so. That would require access to the datastax
> MappingManager.
> We could add something similar as the ClusterBuilder for that though.
>
> Regards,
> Chesnay
>
>
> On 12.12.2016 16:15, Meghashyam Sandeep V wrote:
>
> Hi Till,
>
> Thanks for the information.
>
> 1. What do you mean by 'subtask', is it every partition or every message
> in the stream?
>
> 2. I tried using CassandraSink with a Pojo. Is there a way to specify TTL
> as I can't use a query when I have a datastream with Pojo?
>
> CassandraSink.addSink(messageStream)
>  .setClusterBuilder(new ClusterBuilder() {
>  @Override protected Cluster 
> buildCluster(Cluster.Builder builder) {
>  return buildCassandraCluster();
>  }
>  })
>  .build();
>
> Thanks,
>
>
> On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann 
> wrote:
>
>> Hi Meghashyam,
>>
>>1.
>>
>>You can perform initializations in the open method of the
>>RichSinkFunction interface. The open method will be called once for
>>every sub task when initializing it. If you want to share the resource
>>across multiple sub tasks running in the same JVM you can also store the
>>dbSession in a class variable.
>>2.
>>
>>The Flink community is currently working on adding security support
>>including ssl encryption to Flink. So maybe in the future you can use
>>Flink’s Cassandra sink again.
>>
>> Cheers,
>> Till
>> ​
>>
>> On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V <
>> vr1meghash...@gmail.com> wrote:
>>
>>> Thanks a lot for the quick reply Shannon.
>>>
>>> 1. I will create a class that extends SinkFunction and write my
>>> connection logic there. My only question here is- will a dbSession be
>>> created for each message/partition which might affect the performance?
>>> Thats the reason why I added this line to create a connection once and use
>>> it along the datastream. if(dbSession == null && store!=null) { dbSession
>>> = getSession();}
>>> 2. I couldn't use flink-connector-cassandra as I have SSL enabled for
>>> my C* cluster and I couldn't get it work with all my SSL
>>> config(truststore,keystore etc) added to cluster building. I didn't find a
>>> proper example with SSL enabled flink-connector-cassandra
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>> On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey 
>>> wrote:
>>>
 You haven't really added a sink in Flink terminology, you're just
 performing a side effect within a map operator. So while it may work, if
 you want to add a sink proper you need have an object that extends
 SinkFunction or RichSinkFunction. The method call on the stream should be
 ".addSink(…)".

 Also, the dbSession isn't really Flink state as it will not vary based
 on the position in or content in the stream. It's a necessary helper
 object, yes, but you don't need Flink to checkpoint it.

 You can still use the sinks provided with flink-connector-cassandra and
 customize the cluster building by passing your own ClusterBuilder into the
 constructor.

 -Shannon

 From: Meghashyam Sandeep V < 
 vr1meghash...@gmail.com>
 Date: Friday, December 9, 2016 at 12:26 PM
 To: < user@flink.apache.org>, <
 d...@flink.apache.org>
 Subject: Reg. custom sinks in Flink

 Hi there,

 I have a flink streaming app where my source is Kafka and a custom sink
 to Cassandra(I can't use standard C* sink that comes with flink as I have
 customized auth to C*). I'm currently have the following:

 messageStream
 .rebalance()

 .map( s-> {

 return mapper.readValue(s, JsonNode.class);)

 .filter(//filter some messages)

 .map(

  (MapFunction) message -> {

  getDbSession.execute("QUERY_TO_EXEC")

  })

 private static Session getDbSession() {
 if(dbSession == null && store!=null) {
 dbSession = getSession();
 }

 return dbSession;
 }

 1. Is this the right way to add a custom sink? As you can see, I have 
 dbSession as a class variable here and I'm storing its state.

 2. This setup works fine in a standalone flink (java -jar MyJar.jar). When 
 I run using flink with YARN on EMR I get a NPE at the session which is 
 kind of weird.

 Thanks




Re: Reg. custom sinks in Flink

2016-12-12 Thread Chesnay Schepler
Regarding 2) I don't think so. That would require access to the datastax 
MappingManager.

We could add something similar as the ClusterBuilder for that though.

Regards,
Chesnay

On 12.12.2016 16:15, Meghashyam Sandeep V wrote:

Hi Till,

Thanks for the information.

1. What do you mean by 'subtask', is it every partition or every 
message in the stream?


2. I tried using CassandraSink with a Pojo. Is there a way to specify 
TTL as I can't use a query when I have a datastream with Pojo?


CassandraSink.addSink(messageStream)
  .setClusterBuilder(new ClusterBuilder() {
  @Override protected Cluster buildCluster(Cluster.Builder builder) 
{
  return buildCassandraCluster();
  }
  })
  .build();
Thanks,

On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann > wrote:


Hi Meghashyam,

1.

You can perform initializations in the open method of the
|RichSinkFunction| interface. The |open| method will be called
once for every sub task when initializing it. If you want to
share the resource across multiple sub tasks running in the
same JVM you can also store the |dbSession| in a class variable.

2.

The Flink community is currently working on adding security
support including ssl encryption to Flink. So maybe in the
future you can use Flink’s Cassandra sink again.

Cheers,
Till

​

On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V
> wrote:

Thanks a lot for the quick reply Shannon.

1. I will create a class that extends SinkFunction and write
my connection logic there. My only question here is- will a
dbSession be created for each message/partition which might
affect the performance? Thats the reason why I added this line
to create a connection once and use it along the datastream.
if(dbSession == null && store!=null) { dbSession = getSession();}
2. I couldn't use flink-connector-cassandra as I have SSL
enabled for my C* cluster and I couldn't get it work with all
my SSL config(truststore,keystore etc) added to cluster
building. I didn't find a proper example with SSL enabled
flink-connector-cassandra


Thanks




On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey
> wrote:

You haven't really added a sink in Flink terminology,
you're just performing a side effect within a map
operator. So while it may work, if you want to add a sink
proper you need have an object that extends SinkFunction
or RichSinkFunction. The method call on the stream should
be ".addSink(…)".

Also, the dbSession isn't really Flink state as it will
not vary based on the position in or content in the
stream. It's a necessary helper object, yes, but you don't
need Flink to checkpoint it.

You can still use the sinks provided with
flink-connector-cassandra and customize the cluster
building by passing your own ClusterBuilder into the
constructor.

-Shannon

From: Meghashyam Sandeep V >
Date: Friday, December 9, 2016 at 12:26 PM
To: >, >
Subject: Reg. custom sinks in Flink

Hi there,

I have a flink streaming app where my source is Kafka and
a custom sink to Cassandra(I can't use standard C* sink
that comes with flink as I have customized auth to C*).
I'm currently have the following:

messageStream
 .rebalance()

 .map( s-> {

 returnmapper.readValue(s, JsonNode.class);)

 .filter(//filter some messages)

 .map(

  (MapFunction) message -> {

  getDbSession.execute("QUERY_TO_EXEC")

  })

private static Session getDbSession() {
 if(dbSession ==null &!=null) {
 dbSession = getSession();
 }

 return dbSession;
}

1. Is this the right way to add a custom sink? As you can see, I 
have dbSession as a class variable here and I'm storing its state.

2. This setup works fine in a standalone flink (java -jar 
MyJar.jar). When I run using flink with YARN on EMR I get a NPE at the session 
which is kind of weird.

Thanks



Re: Incremental aggregations - Example not working

2016-12-12 Thread Yassine MARZOUGUI
Hi Matt,

What version of Flink are you using?
The incremental agregation with fold(ACC, FoldFunction, WindowFunction) in
a new change that will be part of Flink 1.2, for Flink 1.1 the correct way
to perform incrementation aggregations is : apply(ACC, FoldFunction,
WindowFunction) (see the docs for 1.1 [1])

[1] :
https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/windows.html#windowfunction-with-incremental-aggregation

Best,
Yassine

2016-12-12 15:37 GMT+01:00 Chesnay Schepler :

> Hello Matt,
>
> This looks like a bug in the fold() function to me.
>
> I'm adding Timo to the discussion, he can probably shed some light on this.
>
> Regards,
> Chesnay
>
>
> On 12.12.2016 15:13, Matt wrote:
>
> In case this is important, if I remove the WindowFunction, and only use
> the FoldFunction it works fine.
>
> I don't see what is wrong...
>
> On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:
>
>> Hi,
>>
>> I'm following the documentation [1] of window functions with incremental
>> aggregations, but I'm getting an "input mismatch" error.
>>
>> The code [2] is almost identical to the one in the documentation, at the
>> bottom you can find the exact error.
>>
>> What am I missing? Can you provide a working example of a fold function
>> with both a FoldFunction and a WindowFunction?
>>
>> Regards,
>> Matt
>>
>> [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/
>> windows.html#windowfunction-with-incremental-aggregation
>>
>> [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d
>>
>
>
>


Re: Reg. custom sinks in Flink

2016-12-12 Thread Meghashyam Sandeep V
Hi Till,

Thanks for the information.

1. What do you mean by 'subtask', is it every partition or every message in
the stream?

2. I tried using CassandraSink with a Pojo. Is there a way to specify TTL
as I can't use a query when I have a datastream with Pojo?

CassandraSink.addSink(messageStream)
 .setClusterBuilder(new ClusterBuilder() {
 @Override
 protected Cluster buildCluster(Cluster.Builder builder) {
 return buildCassandraCluster();
 }
 })
 .build();


Thanks,


On Mon, Dec 12, 2016 at 3:17 AM, Till Rohrmann  wrote:

> Hi Meghashyam,
>
>1.
>
>You can perform initializations in the open method of the
>RichSinkFunction interface. The open method will be called once for
>every sub task when initializing it. If you want to share the resource
>across multiple sub tasks running in the same JVM you can also store the
>dbSession in a class variable.
>2.
>
>The Flink community is currently working on adding security support
>including ssl encryption to Flink. So maybe in the future you can use
>Flink’s Cassandra sink again.
>
> Cheers,
> Till
> ​
>
> On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V <
> vr1meghash...@gmail.com> wrote:
>
>> Thanks a lot for the quick reply Shannon.
>>
>> 1. I will create a class that extends SinkFunction and write my
>> connection logic there. My only question here is- will a dbSession be
>> created for each message/partition which might affect the performance?
>> Thats the reason why I added this line to create a connection once and use
>> it along the datastream. if(dbSession == null && store!=null) { dbSession
>> = getSession();}
>>
>> 2. I couldn't use flink-connector-cassandra as I have SSL enabled for my
>> C* cluster and I couldn't get it work with all my SSL
>> config(truststore,keystore etc) added to cluster building. I didn't find a
>> proper example with SSL enabled flink-connector-cassandra
>>
>>
>> Thanks
>>
>>
>>
>>
>> On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey 
>> wrote:
>>
>>> You haven't really added a sink in Flink terminology, you're just
>>> performing a side effect within a map operator. So while it may work, if
>>> you want to add a sink proper you need have an object that extends
>>> SinkFunction or RichSinkFunction. The method call on the stream should be
>>> ".addSink(…)".
>>>
>>> Also, the dbSession isn't really Flink state as it will not vary based
>>> on the position in or content in the stream. It's a necessary helper
>>> object, yes, but you don't need Flink to checkpoint it.
>>>
>>> You can still use the sinks provided with flink-connector-cassandra and
>>> customize the cluster building by passing your own ClusterBuilder into the
>>> constructor.
>>>
>>> -Shannon
>>>
>>> From: Meghashyam Sandeep V 
>>> Date: Friday, December 9, 2016 at 12:26 PM
>>> To: , 
>>> Subject: Reg. custom sinks in Flink
>>>
>>> Hi there,
>>>
>>> I have a flink streaming app where my source is Kafka and a custom sink
>>> to Cassandra(I can't use standard C* sink that comes with flink as I have
>>> customized auth to C*). I'm currently have the following:
>>>
>>> messageStream
>>> .rebalance()
>>>
>>> .map( s-> {
>>>
>>> return mapper.readValue(s, JsonNode.class);)
>>>
>>> .filter(//filter some messages)
>>>
>>> .map(
>>>
>>>  (MapFunction) message -> {
>>>
>>>  getDbSession.execute("QUERY_TO_EXEC")
>>>
>>>  })
>>>
>>> private static Session getDbSession() {
>>> if(dbSession == null && store!=null) {
>>> dbSession = getSession();
>>> }
>>>
>>> return dbSession;
>>> }
>>>
>>> 1. Is this the right way to add a custom sink? As you can see, I have 
>>> dbSession as a class variable here and I'm storing its state.
>>>
>>> 2. This setup works fine in a standalone flink (java -jar MyJar.jar). When 
>>> I run using flink with YARN on EMR I get a NPE at the session which is kind 
>>> of weird.
>>>
>>>
>>> Thanks
>>>
>>>
>>
>


Re: Incremental aggregations - Example not working

2016-12-12 Thread Matt
In case this is important, if I remove the WindowFunction, and only use the
FoldFunction it works fine.

I don't see what is wrong...

On Mon, Dec 12, 2016 at 10:53 AM, Matt  wrote:

> Hi,
>
> I'm following the documentation [1] of window functions with incremental
> aggregations, but I'm getting an "input mismatch" error.
>
> The code [2] is almost identical to the one in the documentation, at the
> bottom you can find the exact error.
>
> What am I missing? Can you provide a working example of a fold function
> with both a FoldFunction and a WindowFunction?
>
> Regards,
> Matt
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> master/dev/windows.html#windowfunction-with-incremental-aggregation
>
> [2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d
>


Incremental aggregations - Example not working

2016-12-12 Thread Matt
Hi,

I'm following the documentation [1] of window functions with incremental
aggregations, but I'm getting an "input mismatch" error.

The code [2] is almost identical to the one in the documentation, at the
bottom you can find the exact error.

What am I missing? Can you provide a working example of a fold function
with both a FoldFunction and a WindowFunction?

Regards,
Matt

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/windows.html#windowfunction-with-incremental-aggregation

[2] https://gist.github.com/cc7ed5570e4ce30c3a482ab835e3983d


Re: checkpoint notifier not found?

2016-12-12 Thread Till Rohrmann
Hi Abhishek,

great to hear that you like to become part of the Flink community. Here are
some information for how to contribute [1].

[1] http://flink.apache.org/how-to-contribute.html

Cheers,
Till

On Mon, Dec 12, 2016 at 12:36 PM, Abhishek Singh <
abhis...@tetrationanalytics.com> wrote:

> Will be happy to. Could you guide me a bit in terms of what I need to do?
>
> I am a newbie to open source contributing. And currently at Frankfurt
> airport. When I hit ground will be happy to contribute back. Love the
> project !!
>
> Thanks for the awesomeness.
>
>
> On Mon, Dec 12, 2016 at 12:29 PM Stephan Ewen  wrote:
>
>> Thanks for reporting this.
>> It would be awesome if you could file a JIRA or a pull request for fixing
>> the docs for that.
>>
>> On Sat, Dec 10, 2016 at 2:02 AM, Abhishek R. Singh <
>> abhis...@tetrationanalytics.com> wrote:
>>
>> I was following the official documentation: https://ci.
>> apache.org/projects/flink/flink-docs-release-1.1/apis/
>> streaming/state.html
>>
>> Looks like this is the right one to be using: import
>> org.apache.flink.runtime.state.CheckpointListener;
>>
>> -Abhishek-
>>
>> On Dec 9, 2016, at 4:30 PM, Abhishek R. Singh <
>> abhis...@tetrationanalytics.com> wrote:
>>
>> I can’t seem to find CheckpointNotifier. Appreciate help !
>>
>> CheckpointNotifier is not a member of package org.apache.flink.streaming.
>> api.checkpoint
>>
>> From my pom.xml:
>>
>> 
>> org.apache.flink
>> flink-scala_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-streaming-scala_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-clients_2.11
>> 1.1.3
>> 
>> 
>> org.apache.flink
>> flink-statebackend-rocksdb_2.11
>> 1.1.3
>> 
>>
>>
>>
>>


Re: How to retrieve values from yarn.taskmanager.env in a Job?

2016-12-12 Thread Chesnay Schepler

Hello,

can you clarify one small thing for me: Do you want to access this 
parameter when you define the plan
(aka when you call methods on the StreamExecutionEnvironment or 
DataStream instances)

or from within your functions/operators?

Regards,
Chesnay Schepler


On 12.12.2016 14:21, Till Rohrmann wrote:


Hi Shannon,

have you tried accessing the environment variables via 
|System.getenv()|? This should give you a map of string-string key 
value pairs where the key is the environment variable name.


If your values are not set in the returned map, then this indicates a 
bug in Flink and it would be great if you could open a JIRA issue.


Cheers,
Till

​

On Fri, Dec 9, 2016 at 7:33 PM, Shannon Carey > wrote:


This thread

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/passing-environment-variables-to-flink-program-td3337.html


describes the impetus for the addition of yarn.taskmanager.env.

I have configured a value within yarn.taskmanager.env, and I see
it appearing in the Flink web UI in the list underneath Job
Manager -> Configuration. However, I can't figure out how to
retrieve the value from within a Flink job. It doesn't appear in
the environment, the system properties, or my ParameterTool
instance, and I can't figure out how I would get to it via the
StreamExecutionEnvironment. Can anyone point me in the right
direction?

All I want to do is inform my Flink jobs which environment they're
running on, so that programmers don't have to specify the
environment as a job parameter every time they run it. I also see
that there is a "env.java.opts" configuration… does that work in
YARN apps (would my jobs be able to see it?)

Thanks!
Shannon






Re: Testing Flink Streaming applications - controlling the clock

2016-12-12 Thread Till Rohrmann
Hi Rohit,

it depends a little bit on your tests. If you test individual operators you
can use the AbstractStreamOperatorTestHarness class which allows to set the
processing time via AbstractStreamOperatorTestHarness#setProcessingTime.
You can also set the ProcessingTimeService used by a StreamTask via the
StreamTask#setProcessingTimeService method.

If you have an IT case where you want to test the whole topology, there is,
unfortunately, not much tooling around to set the test up so that you can
easily define your own ProcessingTimeService. But if you like, then you
could contribute it to Flink. It will be highly appreciated.

Cheers,
Till
​

On Fri, Dec 9, 2016 at 11:06 PM, Rohit Agarwal  wrote:

> Hi,
>
> I am writing tests for my flink streaming application. I mostly use
> event-time. But there are some aspects which are still controlled by
> wall-clock time. For example, I am using AssignerWithPeriodicWatermarks and
> so watermarks are triggered based on wall-clock time. Similarly,
> checkpoints are also triggered based on wall-clock time. Is there a way I
> can manually control the clock which flink uses from my tests.
>
> --
> Rohit Agarwal
>


Re: How to retrieve values from yarn.taskmanager.env in a Job?

2016-12-12 Thread Till Rohrmann
Hi Shannon,

have you tried accessing the environment variables via System.getenv()?
This should give you a map of string-string key value pairs where the key
is the environment variable name.

If your values are not set in the returned map, then this indicates a bug
in Flink and it would be great if you could open a JIRA issue.

Cheers,
Till
​

On Fri, Dec 9, 2016 at 7:33 PM, Shannon Carey  wrote:

> This thread http://apache-flink-user-mailing-list-archive.
> 2336050.n4.nabble.com/passing-environment-variables-to-
> flink-program-td3337.html describes the impetus for the addition
> of yarn.taskmanager.env.
>
> I have configured a value within yarn.taskmanager.env, and I see it
> appearing in the Flink web UI in the list underneath Job Manager ->
> Configuration. However, I can't figure out how to retrieve the value from
> within a Flink job. It doesn't appear in the environment, the system
> properties, or my ParameterTool instance, and I can't figure out how I
> would get to it via the StreamExecutionEnvironment. Can anyone point me in
> the right direction?
>
> All I want to do is inform my Flink jobs which environment they're running
> on, so that programmers don't have to specify the environment as a job
> parameter every time they run it. I also see that there is a
> "env.java.opts" configuration… does that work in YARN apps (would my jobs
> be able to see it?)
>
> Thanks!
> Shannon
>


Re: checkpoint notifier not found?

2016-12-12 Thread Abhishek Singh
Will be happy to. Could you guide me a bit in terms of what I need to do?

I am a newbie to open source contributing. And currently at Frankfurt
airport. When I hit ground will be happy to contribute back. Love the
project !!

Thanks for the awesomeness.


On Mon, Dec 12, 2016 at 12:29 PM Stephan Ewen  wrote:

> Thanks for reporting this.
> It would be awesome if you could file a JIRA or a pull request for fixing
> the docs for that.
>
> On Sat, Dec 10, 2016 at 2:02 AM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> I was following the official documentation:
> https://ci.apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
>
> Looks like this is the right one to be using: import
> org.apache.flink.runtime.state.CheckpointListener;
>
> -Abhishek-
>
> On Dec 9, 2016, at 4:30 PM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> I can’t seem to find CheckpointNotifier. Appreciate help !
>
> CheckpointNotifier is not a member of package
> org.apache.flink.streaming.api.checkpoint
>
> From my pom.xml:
>
> 
> org.apache.flink
> flink-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-streaming-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-clients_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-statebackend-rocksdb_2.11
> 1.1.3
> 
>
>
>
>


Re: checkpoint notifier not found?

2016-12-12 Thread Stephan Ewen
Thanks for reporting this.
It would be awesome if you could file a JIRA or a pull request for fixing
the docs for that.

On Sat, Dec 10, 2016 at 2:02 AM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:

> I was following the official documentation: https://ci.
> apache.org/projects/flink/flink-docs-release-1.1/apis/streaming/state.html
>
> Looks like this is the right one to be using: import
> org.apache.flink.runtime.state.CheckpointListener;
>
> -Abhishek-
>
> On Dec 9, 2016, at 4:30 PM, Abhishek R. Singh <
> abhis...@tetrationanalytics.com> wrote:
>
> I can’t seem to find CheckpointNotifier. Appreciate help !
>
> CheckpointNotifier is not a member of package org.apache.flink.streaming.
> api.checkpoint
>
> From my pom.xml:
>
> 
> org.apache.flink
> flink-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-streaming-scala_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-clients_2.11
> 1.1.3
> 
> 
> org.apache.flink
> flink-statebackend-rocksdb_2.11
> 1.1.3
> 
>
>
>


Re: Reg. custom sinks in Flink

2016-12-12 Thread Till Rohrmann
Hi Meghashyam,

   1.

   You can perform initializations in the open method of the
   RichSinkFunction interface. The open method will be called once for
   every sub task when initializing it. If you want to share the resource
   across multiple sub tasks running in the same JVM you can also store the
   dbSession in a class variable.
   2.

   The Flink community is currently working on adding security support
   including ssl encryption to Flink. So maybe in the future you can use
   Flink’s Cassandra sink again.

Cheers,
Till
​

On Fri, Dec 9, 2016 at 8:05 PM, Meghashyam Sandeep V <
vr1meghash...@gmail.com> wrote:

> Thanks a lot for the quick reply Shannon.
>
> 1. I will create a class that extends SinkFunction and write my connection
> logic there. My only question here is- will a dbSession be created for each
> message/partition which might affect the performance? Thats the reason why
> I added this line to create a connection once and use it along the
> datastream. if(dbSession == null && store!=null) { dbSession =
> getSession();}
>
> 2. I couldn't use flink-connector-cassandra as I have SSL enabled for my
> C* cluster and I couldn't get it work with all my SSL
> config(truststore,keystore etc) added to cluster building. I didn't find a
> proper example with SSL enabled flink-connector-cassandra
>
>
> Thanks
>
>
>
>
> On Fri, Dec 9, 2016 at 10:54 AM, Shannon Carey  wrote:
>
>> You haven't really added a sink in Flink terminology, you're just
>> performing a side effect within a map operator. So while it may work, if
>> you want to add a sink proper you need have an object that extends
>> SinkFunction or RichSinkFunction. The method call on the stream should be
>> ".addSink(…)".
>>
>> Also, the dbSession isn't really Flink state as it will not vary based on
>> the position in or content in the stream. It's a necessary helper object,
>> yes, but you don't need Flink to checkpoint it.
>>
>> You can still use the sinks provided with flink-connector-cassandra and
>> customize the cluster building by passing your own ClusterBuilder into the
>> constructor.
>>
>> -Shannon
>>
>> From: Meghashyam Sandeep V 
>> Date: Friday, December 9, 2016 at 12:26 PM
>> To: , 
>> Subject: Reg. custom sinks in Flink
>>
>> Hi there,
>>
>> I have a flink streaming app where my source is Kafka and a custom sink
>> to Cassandra(I can't use standard C* sink that comes with flink as I have
>> customized auth to C*). I'm currently have the following:
>>
>> messageStream
>> .rebalance()
>>
>> .map( s-> {
>>
>> return mapper.readValue(s, JsonNode.class);)
>>
>> .filter(//filter some messages)
>>
>> .map(
>>
>>  (MapFunction) message -> {
>>
>>  getDbSession.execute("QUERY_TO_EXEC")
>>
>>  })
>>
>> private static Session getDbSession() {
>> if(dbSession == null && store!=null) {
>> dbSession = getSession();
>> }
>>
>> return dbSession;
>> }
>>
>> 1. Is this the right way to add a custom sink? As you can see, I have 
>> dbSession as a class variable here and I'm storing its state.
>>
>> 2. This setup works fine in a standalone flink (java -jar MyJar.jar). When I 
>> run using flink with YARN on EMR I get a NPE at the session which is kind of 
>> weird.
>>
>>
>> Thanks
>>
>>
>