Re: Data Transfer between TM should be encrypted

2016-08-30 Thread vinay patil
Hi Vijay,

That's a good news for me. Eagerly waiting for this change so that I can
integrate and test it before going live.

Regards,
Vinay Patil

On Tue, Aug 30, 2016 at 4:06 PM, Vijay Srinivasaraghavan [via Apache Flink
User Mailing List archive.]  wrote:

> Hi Stephan,
>
> The dev work is almost complete except the Yarn mode deployment stuff that
> needs to be patched. We are expecting to send a PR in a week or two.
>
> Regards
> Vijay
>
>
> On Tuesday, August 30, 2016 12:39 AM, Stephan Ewen <[hidden email]
> > wrote:
>
>
> Let me loop in Vijay, I think he is the one working on this and can
> probably give the best estimate when it can be expected.
>
> @vijay: For the SSL/TLS transport encryption - do you have an estimate for
> the timeline of that feature?
>
>
> On Mon, Aug 29, 2016 at 8:54 PM, vinay patil <[hidden email]
> > wrote:
>
> Hi Stephan,
>
> Thank you for your reply.
>
> Till when can I expect this feature to be integrated in master or release
> version ?
>
> We are going to get production data (financial data) in October end , so
> want to have this feature before that.
>
> Regards,
> Vinay Patil
>
> On Mon, Aug 29, 2016 at 11:15 AM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]> wrote:
>
> Hi!
>
> The way that the JIRA issue you linked will achieve this is by hooking
> into the network stream pipeline directly, and encrypt the raw network byte
> stream. We built the network stack on Netty, and will use Netty's SSL/TLS
> handlers for that.
>
> That should be much more efficient than manual encryption/decryption in
> each user function.
>
> Stephan
>
>
>
>
>
>
> On Mon, Aug 29, 2016 at 6:12 PM, vinay patil <[hidden email]> wrote:
>
> Hi Ufuk,
>
> This is regarding this issue
> https://issues.apache.org/jira /browse/FLINK-4404
> 
>
> How can we achieve this, I am able to decrypt the data from Kafka coming
> in, but I want to make sure that the data is encrypted when flowing between
> TM's.
>
> One approach I can think of is to decrypt the data at the start of each
> operator and encrypt it at the end of each operator, but I feel this is not
> an efficient approach.
>
> I just want to check if there are alternatives to this and can this be
> achieved by doing some configurations.
>
> Regards,
> Vinay Patil
>
> --
> View this message in context: Data Transfer between TM should be encrypted
> 
> Sent from the Apache Flink User Mailing List archive. mailing list archive
>  at
> Nabble.com.
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-maili ng-list-archive.2336050.n4.
> nabble.com/Data-Transfer-betwe en-TM-should-be-encrypted- tp8781p8782.html
> 
> To start a new topic under Apache Flink User Mailing List archive., email 
> [hidden
> email]
> To unsubscribe from Apache Flink User Mailing List archive., click here.
> NAML
> 
>
>
>
> --
> View this message in context: Re: Data Transfer between TM should be
> encrypted
> 
>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
>  at
> Nabble.com.
>
>
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Data-Transfer-between-TM-should-be-
> encrypted-tp8781p8801.html
> To start a new topic under Apache Flink User Mailing List archive., email
> ml-node+s2336050n1...@n4.nabble.com
> To unsubscribe from Apache Flink User Mailing List archive., click here
> 
> .
> NAML
> 

Re: Firing windows multiple times

2016-08-30 Thread Shannon Carey
I appreciate your suggestion!

However, the main problem with your approach is the amount of time that goes by 
without an updated value from minuteAggregate and hourlyAggregate (lack of a 
continuously updated aggregate).

For example, if we use a tumbling window of 1 month duration, then we only get 
an update for that value once a month! The values from that stream will be on 
average 0.5 months stale. A year-long window is even worse.

-Shannon

From: Aljoscha Krettek >
Date: Tuesday, August 30, 2016 at 9:08 AM
To: Shannon Carey >, 
"user@flink.apache.org" 
>
Subject: Re: Firing windows multiple times

Hi,
I think this can be neatly expressed by using something like a tree of windowed 
aggregations, i.e. you specify your smallest window computation first and then 
specify larger window computations based smaller windows. I've written an 
example that showcases this approach: 
https://gist.github.com/aljoscha/728ac69361f75c3ca87053b1a6f91fcd

The basic idea in pseudo code is this:

DataStream input = ...
dailyAggregate = input.keyBy(...).window(Time.days(1)).reduce(new Sum())
weeklyAggregate = dailyAggregate.keyBy(...).window(Time.days(7)).reduce(new 
Sum())
monthlyAggregate = weeklyAggregate(...).window(Time.days(30)).reduce(new Sum())

the benefit of this approach is that you don't duplicate computation and that 
you can have incremental aggregation using a reduce function. When manually 
keeping elements and evicting them based on time the amount of state that would 
have to be kept would be much larger.

Does that make sense and would it help your use case?

Cheers,
Aljoscha

On Mon, 29 Aug 2016 at 23:18 Shannon Carey 
> wrote:
Yes, let me describe an example use-case that I'm trying to implement 
efficiently within Flink.

We've been asked to aggregate per-user data on a daily level, and from there 
produce aggregates on a variety of time frames. For example, 7 days, 30 days, 
180 days, and 365 days.

We can talk about the hardest one, the 365 day window, with the knowledge that 
adding the other time windows magnifies the problem.

I can easily use tumbling time windows of 1-day size for the first aggregation. 
However, for the longer aggregation, if I take the naive approach and use a 
sliding window, the window size would be 365 days and the slide would be one 
day. If a user comes back every day, I run the risk of magnifying the size of 
the data by up to 365 because each day of data will be included in up to 365 
year-long window panes. Also, if I want to fire the aggregate information more 
rapidly than once a day, then I have to worry about getting 365 different 
windows fired at the same time & trying to figure out which one to pay 
attention to, or coming up with a hare-brained custom firing trigger. We tried 
emitting each day-aggregate into a time series database and doing the final 365 
day aggregation as a query, but that was more complicated than we wanted: in 
particular we'd like to have all the logic in the Flink job not split across 
different technology & infrastructure.

The work-around I'm thinking of is to use a single window that contains 365 
days of data (relative to the current watermark) on an ongoing basis. The 
windowing function would be responsible for evicting old data based on the 
current watermark.

Does that make sense? Does it seem logical, or am I misunderstanding something 
about how Flink works?

-Shannon


From: Aljoscha Krettek >
Date: Monday, August 29, 2016 at 3:56 AM

To: "user@flink.apache.org" 
>
Subject: Re: Firing windows multiple times

Hi,
that would certainly be possible? What do you think can be gained by having 
knowledge about the current watermark in the WindowFunction, in a specific 
case, possibly?

Cheers,
Aljoscha

On Wed, 24 Aug 2016 at 23:21 Shannon Carey 
> wrote:
What do you think about adding the current watermark to the window function 
metadata in FLIP-2?

From: Shannon Carey >
Date: Friday, August 12, 2016 at 6:24 PM
To: Aljoscha Krettek >, 
"user@flink.apache.org" 
>

Subject: Re: Firing windows multiple times

Thanks Aljoscha, I didn't know about those. Yes, they look like handy changes, 
especially to enable flexible approaches for eviction. In particular, having 
the current watermark available to the evictor via EvictorContext is helpful: 
it will be able to evict the old data more easily without needing to rely on 
Window#maxTimestamp().

However, I think you 

Re: Data Transfer between TM should be encrypted

2016-08-30 Thread Vijay Srinivasaraghavan
Hi Stephan,
The dev work is almost complete except the Yarn mode deployment stuff that 
needs to be patched. We are expecting to send a PR in a week or two.
RegardsVijay 

On Tuesday, August 30, 2016 12:39 AM, Stephan Ewen  wrote:
 

 Let me loop in Vijay, I think he is the one working on this and can probably 
give the best estimate when it can be expected.
@vijay: For the SSL/TLS transport encryption - do you have an estimate for the 
timeline of that feature?

On Mon, Aug 29, 2016 at 8:54 PM, vinay patil  wrote:

Hi Stephan,
Thank you for your reply.
Till when can I expect this feature to be integrated in master or release 
version ?

We are going to get production data (financial data) in October end , so want 
to have this feature before that.
Regards,Vinay Patil
On Mon, Aug 29, 2016 at 11:15 AM, Stephan Ewen [via Apache Flink User Mailing 
List archive.] <[hidden email]> wrote:

 Hi!
The way that the JIRA issue you linked will achieve this is by hooking into the 
network stream pipeline directly, and encrypt the raw network byte stream. We 
built the network stack on Netty, and will use Netty's SSL/TLS handlers for 
that.

That should be much more efficient than manual encryption/decryption in each 
user function.
Stephan





On Mon, Aug 29, 2016 at 6:12 PM, vinay patil <[hidden email]> wrote:

Hi Ufuk,
This is regarding this issuehttps://issues.apache.org/jira /browse/FLINK-4404

How can we achieve this, I am able to decrypt the data from Kafka coming in, 
but I want to make sure that the data is encrypted when flowing between TM's.
One approach I can think of is to decrypt the data at the start of each 
operator and encrypt it at the end of each operator, but I feel this is not an 
efficient approach.
I just want to check if there are alternatives to this and can this be achieved 
by doing some configurations.
Regards,Vinay Patil 
View this message in context: Data Transfer between TM should be encrypted
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


 
 
   If you reply to this email, your message will be added to the discussion 
below: http://apache-flink-user-maili ng-list-archive.2336050.n4. 
nabble.com/Data-Transfer-betwe en-TM-should-be-encrypted- tp8781p8782.html   To 
start a new topic under Apache Flink User Mailing List archive., email [hidden 
email] 
 To unsubscribe from Apache Flink User Mailing List archive., click here.
 NAML 

 
View this message in context: Re: Data Transfer between TM should be encrypted
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.




   

"select as" in Flink SQL

2016-08-30 Thread Davran Muzafarov
I am try to execute simple sql like this:

 

DataSet dataSet0 = env.fromCollection( infos0 );

tableEnv.registerDataSet( "table0", dataSet0 );

 

 

Table table = tableEnv.sql( "select assetClass as \"asset class\" from
tabel0" );

 

I am getting:

 

org.apache.calcite.sql.parser.SqlParseException: Encountered "as \"" at line
1, column 19.

Was expecting one of:

"FROM" ...

"," ...

"AS"  ...

"AS"  ...

"AS"  ...

"AS"  ...

"AS"  ...

"." ...

"(" ...

"NOT" ...

"IN" ...

"BETWEEN" ...

"LIKE" ...

"SIMILAR" ...

"=" ...

">" ...

"<" ...

"<=" ...

">=" ...

"<>" ...

"+" ...

"-" ...

"*" ...

"/" ...

"||" ...

"AND" ...

"OR" ...

"IS" ...

"MEMBER" ...

"SUBMULTISET" ...

"MULTISET" ...

"[" ...



at
org.apache.calcite.sql.parser.impl.SqlParserImpl.convertException(SqlParserI
mpl.java:388)

at
org.apache.calcite.sql.parser.impl.SqlParserImpl.normalizeException(SqlParse
rImpl.java:119)

at
org.apache.calcite.sql.parser.SqlParser.parseQuery(SqlParser.java:131)

at
org.apache.calcite.sql.parser.SqlParser.parseStmt(SqlParser.java:156)

at
org.apache.flink.api.table.FlinkPlannerImpl.parse(FlinkPlannerImpl.scala:75)

.

 

If I use [ insteand of ", I am getting:

org.apache.calcite.sql.parser.SqlParseException: Encountered "as [" at line
1, column 19.
Was expecting one of:
"FROM" ...
"," ...
"AS"  ...
"AS"  ...
"AS"  ...
"AS"  ...
"AS"  ...
"." ...
"(" ...
"NOT" ...
"IN" ...
"BETWEEN" ...
"LIKE" ...
"SIMILAR" ...
"=" ...
">" ...
"<" ...
"<=" ...
">=" ...
"<>" ...
"+" ...
"-" ...
"*" ...
"/" ...
"||" ...
"AND" ...
"OR" ...
"IS" ...
"MEMBER" ...
"SUBMULTISET" ...
"MULTISET" ...
"[" ...

at
org.apache.calcite.sql.parser.impl.SqlParserImpl.convertException(SqlParserI
mpl.java:388)
at
org.apache.calcite.sql.parser.impl.SqlParserImpl.normalizeException(SqlParse
rImpl.java:119)
at
org.apache.calcite.sql.parser.SqlParser.parseQuery(SqlParser.java:131)
at
org.apache.calcite.sql.parser.SqlParser.parseStmt(SqlParser.java:156)
at
org.apache.flink.api.table.FlinkPlannerImpl.parse(FlinkPlannerImpl.scala:75)
at
org.apache.flink.api.table.BatchTableEnvironment.sql(BatchTableEnvironment.s
cala:128)

.

 

How would I use aliases which contain space(s)? 

 

Thank you,

Davran.

 

 

 

 

 

 

 

 

 



flink dataStream operate dataSet

2016-08-30 Thread rimin515
Hi, i have a problem,a dataStream read from rabbitMQ,and others data from a 
hbase table,which is a dataSet.Those two data from follow:
 val words=connectHelper.readFromRabbitMq(...)  // words is 
DataStream[String] val dataSet=HBaseWrite.fullScan()  //dataSet is 
DataSet[(int,String)]
 words.map{ word => val res = dataSet.map{ y => 
  val score = computerScore(x,y)   (word,score)  }  
   HBaseWrite.writeToTable(res,...,) }
   the  error is task not serializable,what is the solution?  under a 
DataStream, how to operate a DataSet?







Re: Accessing state in connected streams

2016-08-30 Thread aris kol
Hi Aljoscha,


I removed business objects and logic etc.. I am happy to post here []  I am 
sure this is a common issue when you start to seriously mess with state.


Assuming a type for the Output
And assuming that there is a function (EventA :=> String) in the mapWithState 
operator of typeAStream (implying the State is just a Seq[String] per key)

def coFun = new CoFlatMapFunction[EventA, EventB, Option[Output]] {

override def flatMap1(in: EventA, out: Collector[Option[Output]]) = 
out.collect(None)

override def flatMap2(in: EventB, out: Collector[Option[Output]]) = {

 new RichFlatMapFunction[EventB, Option[Output]] with StatefulFunction[EventB, 
Option[Output], Seq[String]] {

   lazy val stateTypeInfo: TypeInformation[Seq[String]] = 
implicitly[TypeInformation[Seq[String]]]
   lazy val serializer: TypeSerializer[Seq[String]] = 
stateTypeInfo.createSerializer(getRuntimeContext.getExecutionConfig)
   override lazy val stateSerializer: TypeSerializer[Seq[String]] = serializer

   override def flatMap(in: EventB, out: Collector[Option[Output]]): Unit = {
 out.collect(
   applyWithState(
 in,
 (in, state) =>
   (state match {
 case None => None
 case Some(s) => Some(Output(...))
   }, state)
   )
 )
   }

   flatMap(in, out)

 }
}
}

applyWithState throws the exception and my intuition says I am doing seriously 
wrong in the instantiation. I tried to make something work using the 
mapWithState implementation as a guide and I ended up here.

Thanks,
Aris


From: Aljoscha Krettek 
Sent: Tuesday, August 30, 2016 10:06 AM
To: user@flink.apache.org
Subject: Re: Accessing state in connected streams

Hi Aris,
I think you're on the right track with using a CoFlatMap for this. Could you 
maybe post the code of your CoFlatMapFunction (or you could send it to me 
privately if you have concerns with publicly posting it) then I could have a 
look.

Cheers,
Aljoscha

On Mon, 29 Aug 2016 at 15:48 aris kol 
> wrote:

Any other opinion on this?


Thanks :)

Aris

From: aris kol >
Sent: Sunday, August 28, 2016 12:04 AM

To: user@flink.apache.org
Subject: Re: Accessing state in connected streams

In the implementation I am passing just one CoFlatMapFunction, where flatMap1, 
which operates on EventA, just emits a None (doesn't do anything practically) 
and flatMap2 tries to access the state and throws the NPE.

It wouldn't make sense to use a mapper in this context, I would still want to 
flatten afterwards before pushing dowstream.


Aris



From: Sameer W >
Sent: Saturday, August 27, 2016 11:40 PM
To: user@flink.apache.org
Subject: Re: Accessing state in connected streams

Ok sorry about that :-). I misunderstood as I am not familiar with Scala code. 
Just curious though how are you passing two MapFunction's to the flatMap 
function on the connected stream. The interface of ConnectedStream requires 
just one CoMap function- 
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/ConnectedStreams.html

Sameer

On Sat, Aug 27, 2016 at 6:13 PM, aris kol 
> wrote:

Let's say I have two types sharing the same trait

trait Event {
def id: Id
}

case class EventA(id: Id, info: InfoA) extends Event
case class EventB(id: Id, info: InfoB) extends Event

Each of these events gets pushed to a Kafka topic and gets consumed by a stream 
in Flink.

Let's say I have two streams

Events of type A create state:

val typeAStream = env.addSource(...)
.flatMap(someUnmarshallerForA)
.keyBy(_.id)
.mapWithState(...)

val typeBStream = env.addSource(...)
.flatMap(someUnmarshallerForB)
.keyBy(_.id)

I want now to process the events in typeBStream using the information stored in 
the State of typeAStream.

One approach would be to use the same stream for the two topics and then 
pattern match, but Event subclasses may grow in numbers and
may have different loads, thus I may want to keep things separate.

Would something along the lines of:

typeAStream.connect(typeBStream).
flatMap(
new IdentityFlatMapFunction(),
new SomeRichFlatMapFunctionForEventB[EventB, O] with StateFulFuntion[EventB, O, 
G[EventA]] { ... }
)

work?

I tried this approach and I ended up in a NPE because the state object was not 
initialized (meaning it was not there).


Thanks,
Aris




Re: Accessing state in connected streams

2016-08-30 Thread Aljoscha Krettek
Hi Aris,
I think you're on the right track with using a CoFlatMap for this. Could
you maybe post the code of your CoFlatMapFunction (or you could send it to
me privately if you have concerns with publicly posting it) then I could
have a look.

Cheers,
Aljoscha

On Mon, 29 Aug 2016 at 15:48 aris kol  wrote:

> Any other opinion on this?
>
>
> Thanks :)
>
> Aris
> *From:* aris kol 
> *Sent:* Sunday, August 28, 2016 12:04 AM
>
> *To:* user@flink.apache.org
> *Subject:* Re: Accessing state in connected streams
>
> In the implementation I am passing just one CoFlatMapFunction, where
> flatMap1, which operates on EventA, just emits a None (doesn't do anything
> practically) and flatMap2 tries to access the state and throws the NPE.
>
> It wouldn't make sense to use a mapper in this context, I would still want
> to flatten afterwards before pushing dowstream.
>
>
> Aris
>
>
> --
> *From:* Sameer W 
> *Sent:* Saturday, August 27, 2016 11:40 PM
> *To:* user@flink.apache.org
> *Subject:* Re: Accessing state in connected streams
>
> Ok sorry about that :-). I misunderstood as I am not familiar with Scala
> code. Just curious though how are you passing two MapFunction's to the
> flatMap function on the connected stream. The interface of ConnectedStream
> requires just one CoMap function-
> https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/ConnectedStreams.html
>
> Sameer
>
> On Sat, Aug 27, 2016 at 6:13 PM, aris kol  wrote:
>
>> Let's say I have two types sharing the same trait
>>
>> trait Event {
>> def id: Id
>> }
>>
>> case class EventA(id: Id, info: InfoA) extends Event
>> case class EventB(id: Id, info: InfoB) extends Event
>>
>> Each of these events gets pushed to a Kafka topic and gets consumed by a
>> stream in Flink.
>>
>> Let's say I have two streams
>>
>> Events of type A create state:
>>
>> val typeAStream = env.addSource(...)
>> .flatMap(someUnmarshallerForA)
>> .keyBy(_.id)
>> .mapWithState(...)
>>
>> val typeBStream = env.addSource(...)
>> .flatMap(someUnmarshallerForB)
>> .keyBy(_.id)
>>
>> I want now to process the events in typeBStream using the information
>> stored in the State of typeAStream.
>>
>> One approach would be to use the same stream for the two topics and then
>> pattern match, but Event subclasses may grow in numbers and
>> may have different loads, thus I may want to keep things separate.
>>
>> Would something along the lines of:
>>
>> typeAStream.connect(typeBStream).
>> flatMap(
>> new IdentityFlatMapFunction(),
>> new SomeRichFlatMapFunctionForEventB[EventB, O] with
>> StateFulFuntion[EventB, O, G[EventA]] { ... }
>> )
>>
>> work?
>>
>> I tried this approach and I ended up in a NPE because the state object
>> was not initialized (meaning it was not there).
>>
>>
>> Thanks,
>> Aris
>>
>>
>


Re: Data Transfer between TM should be encrypted

2016-08-30 Thread Stephan Ewen
Let me loop in Vijay, I think he is the one working on this and can
probably give the best estimate when it can be expected.

@vijay: For the SSL/TLS transport encryption - do you have an estimate for
the timeline of that feature?


On Mon, Aug 29, 2016 at 8:54 PM, vinay patil 
wrote:

> Hi Stephan,
>
> Thank you for your reply.
>
> Till when can I expect this feature to be integrated in master or release
> version ?
>
> We are going to get production data (financial data) in October end , so
> want to have this feature before that.
>
> Regards,
> Vinay Patil
>
> On Mon, Aug 29, 2016 at 11:15 AM, Stephan Ewen [via Apache Flink User
> Mailing List archive.] <[hidden email]
> > wrote:
>
>> Hi!
>>
>> The way that the JIRA issue you linked will achieve this is by hooking
>> into the network stream pipeline directly, and encrypt the raw network byte
>> stream. We built the network stack on Netty, and will use Netty's SSL/TLS
>> handlers for that.
>>
>> That should be much more efficient than manual encryption/decryption in
>> each user function.
>>
>> Stephan
>>
>>
>>
>>
>>
>>
>> On Mon, Aug 29, 2016 at 6:12 PM, vinay patil <[hidden email]
>> > wrote:
>>
>>> Hi Ufuk,
>>>
>>> This is regarding this issue
>>> https://issues.apache.org/jira/browse/FLINK-4404
>>>
>>> How can we achieve this, I am able to decrypt the data from Kafka coming
>>> in, but I want to make sure that the data is encrypted when flowing between
>>> TM's.
>>>
>>> One approach I can think of is to decrypt the data at the start of each
>>> operator and encrypt it at the end of each operator, but I feel this is not
>>> an efficient approach.
>>>
>>> I just want to check if there are alternatives to this and can this be
>>> achieved by doing some configurations.
>>>
>>> Regards,
>>> Vinay Patil
>>>
>>> --
>>> View this message in context: Data Transfer between TM should be
>>> encrypted
>>> 
>>> Sent from the Apache Flink User Mailing List archive. mailing list
>>> archive
>>> 
>>> at Nabble.com.
>>>
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Data-Transfer-between-TM-should-be-encrypted-tp8781p8782.html
>> To start a new topic under Apache Flink User Mailing List archive., email 
>> [hidden
>> email] 
>> To unsubscribe from Apache Flink User Mailing List archive., click here.
>> NAML
>> 
>>
>
>
> --
> View this message in context: Re: Data Transfer between TM should be
> encrypted
> 
>
> Sent from the Apache Flink User Mailing List archive. mailing list archive
>  at
> Nabble.com.
>