[jira] [Created] (FLINK-6024) Need more fine-grained info for "InvalidProgramException: This type (...) cannot be used as key"

2017-03-10 Thread Luke Hutchison (JIRA)
Luke Hutchison created FLINK-6024:
-

 Summary: Need more fine-grained info for "InvalidProgramException: 
This type (...) cannot be used as key"
 Key: FLINK-6024
 URL: https://issues.apache.org/jira/browse/FLINK-6024
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 1.2.0
Reporter: Luke Hutchison


I got this very confusing exception:

InvalidProgramException: This type (MyType) cannot be used as key

I dug through the code, and could not find what was causing this. The help text 
for type.isKeyType(), in Keys.java:329, right before the exception is thrown, 
says: "Checks whether this type can be used as a key. As a bare minimum, types 
have to be hashable and comparable to be keys." However, this didn't solve the 
problem.

I discovered that in my case, the error was occurring because I added a new 
constructor to the type, and I didn't have a default constructor. This is 
probably quite a common thing to happen for POJOs, so the error message should 
give some detail saying that this is the problem.

Other things that can cause this to fail, including that the class is not 
public, or the constructor is not public, or the key field is not public, or 
that the key field is not a serializable type, or the key is not Comparable, or 
the key is not hashable, should be given in the error message instead, 
depending on the actual cause of the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Flink CSV parsing

2017-03-10 Thread Flavio Pompermaier
If you already have an idea on how to proceed maybe I can try to take care
of issue a PR using commons-csv or whatever library you prefer

On 10 Mar 2017 22:07, "Fabian Hueske"  wrote:

Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of
apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>


Re: Flink CSV parsing

2017-03-10 Thread Fabian Hueske
Hi Flavio,

Flink's CsvInputFormat was originally meant to be an efficient way to parse
structured text files and dates back to the very early days of the project
(probably 2011 or so).
It was never meant to be compliant with the RFC specification and initially
didn't support many features like quoting, quote escaping, etc. Some of
these were later added but others not.

I agree that the requirements for the CsvInputFormat have changed as more
people are using the project and that a standard compliant parser would be
desirable.
We could definitely look into using an existing library for the parsing,
but it would still need to be integrated with the way that Flink's
InputFormats work. For instance, you're approach isn't standard compliant
either, because TextInputFormat is not aware of quotes and would break
records with quoted record delimiters (FLINK-6016 [1]).

I would be OK with having a less efficient format which is not based on the
current implementation but which is standard compliant.
IMO that would be a very useful contribution.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-6016





2017-03-10 11:28 GMT+01:00 Flavio Pompermaier :

> Hi to all,
> I want to discuss with the dev group something about CSV parsing.
> Since I started using Flink with CSVs I always faced some little problem
> here and there and the new tickets about the CSV parsing seems to confirm
> that this part is still problematic.
> In my production jobs I gave up using Flink CSV parsing in favour of  apace
> commons-csv and it works great. It's perfectly configurable ans robust.
> A working example is available at [1].
>
> Thus, why not to use that library directly and contribute back (if needed)
> to another apache library if improvements are required to speed up the
> parsing? Have you ever tried to compare the performances of the 2 parsers?
>
> Best,
> Flavio
>
> [1]
> https://github.com/okkam-it/flink-examples/blob/master/
> src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/
> Csv2RowExample.java
>


Re: Scala / Java window issue

2017-03-10 Thread Fabian Hueske
Hi Radu,

there are already several WindowFunction implementations in the Table API
that can help as a reference:

- IncrementalAggregateAllTimeWindowFunction [1]
- IncrementalAggregateAllWindowFunction [2]
- IncrementalAggregateTimeWindowFunction [3]
- IncrementalAggregateTimeWindowFunction [4]

Also have have a look at the DataStreamAggregate [5] class that assembles
the DataStream programs based on these functions.

Best, Fabian

[1]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/IncrementalAggregateAllTimeWindowFunction.scala
[2]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/IncrementalAggregateAllWindowFunction.scala
[3]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/IncrementalAggregateAllTimeWindowFunction.scala
[4]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/runtime/aggregate/IncrementalAggregateTimeWindowFunction.scala
[5]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamAggregate.scala

2017-03-10 18:46 GMT+01:00 Radu Tudoran :

> Hi,
>
> I am struggling to move a working implementation from Java to Scala
> :(...this is for computing window aggregates (sliding window).
> As I am not proficient in Scala I got block in (probably a stupid
> error)...maybe someone can help me.
>
>
> I am trying to create a simple window function to be applied to the
> datastream after the window is created (I have one case with global windows
> and another case with keyed windows, so the question applies on both
> AllWindowFunction as well as to WindowFunction). However I get a
> typemistamtch error when applying the function to the window.
>
> As I need to implement the function in scala... I tried 2 options, which
> both fail:
> Option 1: implement MyWindowFunction by extending the WindowFunction from
> the scala package (org.apache.flink.streaming.api.scala.function)
> ..in this case when I apply the function to the window it tells me that
> the there is a typemistmatched
> Option 2: implement MyWindowFunction by extending the Windowfunction from
> the default package (org.apache.flink.streaming.api.functions.windowing)
> ..in this case when I try to override the apply function I get a
> compilation error that the class needs to be abstract as it does not
> implement the apply function :(
>
> ...any solution?
>
>
>
>


Re: [DISCUSS] FLIP-17 Side Inputs

2017-03-10 Thread Kenneth Knowles
Hi all,

I thought I would briefly join this thread to mention some side input
lessons from Apache Beam. My knowledge of Flink is not deep enough,
technically or philosophically, to make any specific recommendations. And I
might just be repeating things that the docs and threads cover, but I hope
it might be helpful anyhow.

Side Input Visibility / matching: Beam started with a coupling between the
windowing on a stream and the way that windows are mapped between main
input and side input. This is actually not needed and we'll be making the
mapping explicit (with sensible defaults). In particular, the mapping
determines when you can garbage collect, when you know that no main input
element will ever map to a particular window again (so opaque mappings need
some metadata).

Side Input Readiness: There is an unpleasant asymmetry between waiting for
the first triggering of a side input but not waiting for any later
triggering. This manifests strongly when a user actually wants to know
something about the relationship to side input update latency and main
input processing. This echoes some of the concern here about user-defined
control over readiness. IMO this is a rather open area.

Default values for singleton side inputs: A special case of side input
readiness that is related also to windowing. By far the most useful
singleton side input is the result of a global reduction with an
associative operator. A lot of these operators also have an
identity element. It is nice for this identity element (known a priori) to
be "always available" on the side input, for every window, if it is
expected to be something that is continually updated. But if the
configuration is such that it is a one-time triggering of bounded data,
that behavior is not right. Related, after some amount of time we conclude
that no input will ever be received for a window, and the side input
becomes ready.

Map Side Inputs with triggers: When new data arrives for a key in Beam,
there's no way to know which value should "win", so you basically just
can't use map side inputs with triggers.

These are just some quick thoughts at a very high level.

Kenn

On Thu, Mar 9, 2017 at 7:59 AM, Aljoscha Krettek 
wrote:

> Hi Jamie,
> actually the approach where the .withSideInput() comes before the user
> function is only required for implementation proposal #1, which I like
> the least. For the other two it can be after the user function, which is
> also what I prefer.
>
> Regarding semantics: yes, we simply wait for anything to be available.
> For GlobalWindows, i.e. side inputs on a normal function where we simply
> don't have windows, this means that we wait for anything. For the
> windowed case, which I'm proposing as a second step we will wait for
> side input in a window to be available that matches the main-input
> window. For the keyed case we wait for something on the same key to be
> available, for the broadcast case we wait for anything.
>
> Best,
> Aljoscha
>
> On Thu, Mar 9, 2017, at 16:55, Jamie Grier wrote:
> > Hi, I think the proposal looks good.  The only thing I wasn't clear on
> > was
> > which API is actually being proposed.  The one where .withSideInput()
> > comes
> > before the user function or after.  I would definitely prefer it come
> > after
> > since that's the normal pattern in the Flink API.  I understood that
> > makes
> > the implementation different (maybe harder) but I think it helps keep the
> > API uniform which is really good.
> >
> > Overall I think the API looks good and yes there are some tricky
> > semantics
> > here but in general if, when processing keyed main streams, we always
> > wait
> > until there is a side-input available for that key we're off to a great
> > start and I think that was what you're suggesting in the design doc.
> >
> > -Jamie
> >
> >
> > On Thu, Mar 9, 2017 at 7:27 AM, Aljoscha Krettek 
> > wrote:
> >
> > > Hi,
> > > these are all valuable suggestions and I think that we should implement
> > > them when the time is right. However, I would like to first get a
> > > minimal viable version of this feature into Flink and then expand on
> it.
> > > I think the last few tries of tackling this problem fizzled out because
> > > we got to deep into discussing special semantics and features. I think
> > > the most important thing to agree on right now is the basic API and the
> > > implementation plan. What do you think about that?
> > >
> > > Regarding your suggestions, I have in fact a branch [1] from May 2016
> > > where I implemented a prototype implementation. This has an n-ary
> > > operator and inputs can be either bounded or unbounded and the
> > > implementation actually waits for all bounded inputs to finish before
> > > starting to process the unbounded inputs.
> > >
> > > In general, I think blocking on an input is only possible while you're
> > > waiting for a bounded input to finish. If all inputs are unbounded you
> > > cannot block because 

Re: Machine Learning on Flink - Next steps

2017-03-10 Thread Stavros Kontopoulos
Thanks Theodore,

I'd vote for

- Offline learning with Streaming API

- Low-latency prediction serving

Some comments...

Online learning

Good to have but my feeling is that it is not a strong requirement (if a
requirement at all) across the industry right now. May become hot in the
future.

Offline learning with Streaming API:

Although it requires engine changes or extensions (feasibility is an issue
here), my understanding is that it reflects the industry common practice
(train every few minutes at most) and it would be great if that was
supported out of the box providing a friendly API for the developer.

Offline learning with the batch API:

I would love to have a limited set of algorithms so someone does not leave
Flink to work  with another tool
for some initial dataset if he wants to. In other words, let's reach a
mature state with some basic algos merged.
There is a lot of work pending let's not waste it.

Low-latency prediction serving

Model serving is a long standing problem, we could definitely help with
that.

Regards,
Stavros



On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann  wrote:

> Thanks Theo for steering Flink's ML effort here :-)
>
> I'd vote to concentrate on
>
> - Online learning
> - Low-latency prediction serving
>
> because of the following reasons:
>
> Online learning:
>
> I agree that this topic is highly researchy and it's not even clear whether
> it will ever be of any interest outside of academia. However, it was the
> same for other things as well. Adoption in industry is usually slow and
> sometimes one has to dare to explore something new.
>
> Low-latency prediction serving:
>
> Flink with its streaming engine seems to be the natural fit for such a task
> and it is a rather low hanging fruit. Furthermore, I think that users would
> directly benefit from such a feature.
>
> Offline learning with Streaming API:
>
> I'm not fully convinced yet that the streaming API is powerful enough
> (mainly due to lack of proper iteration support and spilling capabilities)
> to support a wide range of offline ML algorithms. And if then it will only
> support rather small problem sizes because streaming cannot gracefully
> spill the data to disk. There are still to many open issues with the
> streaming API to be applicable for this use case imo.
>
> Offline learning with the batch API:
>
> For offline learning the batch API is imo still better suited than the
> streaming API. I think it will only make sense to port the algorithms to
> the streaming API once batch and streaming are properly unified. Alone the
> highly efficient implementations for joining and sorting of data which can
> go out of memory are important to support big sized ML problems. In
> general, I think it might make sense to offer a basic set of ML primitives.
> However, already offering this basic set is a considerable amount of work.
>
> Concering the independent organization for the development: I think it
> would be great if the development could still happen under the umbrella of
> Flink's ML library because otherwise we might risk some kind of
> fragmentation. In order for people to collaborate, one can also open PRs
> against a branch of a forked repo.
>
> I'm currently working on wrapping the project re-organization discussion
> up. The general position was that it would be best to have an incremental
> build and keep everything in the same repo. If this is not possible then we
> want to look into creating a sub repository for the libraries (maybe other
> components will follow later). I hope to make some progress on this front
> in the next couple of days/week. I'll keep you updated.
>
> As a general remark for the discussions on the google doc. I think it would
> be great if we could at least mirror the discussions happening in the
> google doc back on the mailing list or ideally conduct the discussions
> directly on the mailing list. That's at least what the ASF encourages.
>
> Cheers,
> Till
>
> On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann 
> wrote:
>
> > Hey all,
> >
> > Sorry for the bit late response.
> >
> > I'd like to work on
> > - Offline learning with Streaming API
> > - Low-latency prediction serving
> >
> > I would drop the batch API ML because of past experience with lack of
> > support, and online learning because the lack of use-cases.
> >
> > I completely agree with Kate that offline learning should be supported,
> > but given Flink's resources I prefer using the streaming API as Roberto
> > suggested. Also, full model lifecycle (or end-to-end ML) could be more
> > easily supported in one system (one API). Connecting Flink Batch with
> Flink
> > Streaming is currently cumbersome (although side inputs [1] might help).
> In
> > my opinion, a crucial part of end-to-end ML is low-latency predictions.
> >
> > As another direction, we could integrate Flink Streaming API with other
> > projects (such as Prediction IO). However, I believe it's better to 

Scala / Java window issue

2017-03-10 Thread Radu Tudoran
Hi,

I am struggling to move a working implementation from Java to Scala :(...this 
is for computing window aggregates (sliding window).
As I am not proficient in Scala I got block in (probably a stupid 
error)...maybe someone can help me.


I am trying to create a simple window function to be applied to the datastream 
after the window is created (I have one case with global windows and another 
case with keyed windows, so the question applies on both AllWindowFunction as 
well as to WindowFunction). However I get a typemistamtch error when applying 
the function to the window.

As I need to implement the function in scala... I tried 2 options, which both 
fail:
Option 1: implement MyWindowFunction by extending the WindowFunction from the 
scala package (org.apache.flink.streaming.api.scala.function)
..in this case when I apply the function to the window it tells me that the 
there is a typemistmatched
Option 2: implement MyWindowFunction by extending the Windowfunction from the 
default package (org.apache.flink.streaming.api.functions.windowing)
..in this case when I try to override the apply function I get a compilation 
error that the class needs to be abstract as it does not implement the apply 
function :(

...any solution?





Re: [DISCUSS] FLIP-17 Side Inputs

2017-03-10 Thread Gábor Hermann

Hi all,

Thanks Aljoscha for going forward with the side inputs and for the nice 
proposal!


I'm also in favor of the implementation with N-ary input (3.) for the 
reasons Ventura explained. I'm strongly against managing side inputs at 
StreamTask level (2.), as it would create another abstraction for almost 
the same purposes as a TwoInputOperator. Making use of the second input 
of a 2-input operator (1.) could be useful for prototyping. I assume it 
would be easier to implement a minimal solution with that, but I'm not 
sure. If the N-ary input prototype is almost ready, then it's best to go 
with that.


For side input readiness, it would be better to wait for the side input 
to be completely ready. As Gyula has suggested, waiting only for the 
first record does not differ much from not waiting at all. I would also 
prefer user-defined readiness, but for the minimal solution we could fix 
this for completely read side input and maybe go only for static side 
inputs first.


I understand that we should push a minimal viable solution forward. The 
current API and implementation proposal seems like a good start. 
However, long term goals are also important, to avoid going in a wrong 
direction. As I have not participated in the discussion let me share 
also some longer term considerations in reply to the others. (Sorry for 
the length.)



How would side inputs help the users? For the simple, non-windowed cases 
with static input a CoFlatMap might be sufficient. The main input can be 
buffered while the side input is consumed and stored in the operator 
state. Thus, the user can decide inside the CoFlatMap UDF when to start 
consuming the stream input (e.g. when the side input is ready). Of 
course, this might be problematic to implement, so the side inputs API 
could help the user with this pattern.


1)
First, marking the end of side input is not easy. Every side input 
should broadcast some kind of EOF to the consuming operator. If we 
generalize to non-static (slowly changing) inputs, then progress 
tracking messages should be broadcast periodically. This is reminiscent 
of the watermark time tracking for windows.


I agree with Gyula that we should have user defined side input 
readiness. Although, couldn't we use windowing for this? It's not worth 
having two separate time tracking mechanisms (one for windows, one for 
side inputs). If the windowing is not flexible enough to handle such 
cases, then what about exposing watermark tracking to the user? E.g. we 
could have an extra user defined event handler in RichFunctions when 
time progress is made. This generalizes the two progress tracking. Of 
course, this approach requires more work so it's not for the minimal 
viable solution.


2)
Second, exposing a buffer to the user helps a bit, but the users could 
buffer the data simply in an operator state. How would a buffer help 
more? Of course, the interface could have multiple implementations, such 
as a spilling buffer, and the user could choose. That helps the "waiting 
pattern".


I agree with Wenlong's suggestion that a blocking (or backpressure) must 
be an option. It seems crucial to avoid consuming a large part of the 
main input, that would take a lot of space. I suggest not to expose a 
buffer, but to allow the users to control whether to read from the 
different inputs. E.g. in the N-ary input operator UDF the user could 
control this per input: startConsuming(), stopConsuming(). Then it's the 
user's responsibility not to get into deadlocks, but the runtime handles 
the buffering. For reading static side input, the user could stop 
consuming the main input until she considers the side input ready.


User controlled backpressure would also benefit avoiding deadlock in 
stream loops.


3)
I also agree with Wenlong's 2. point, that checkpointing should be 
considered, but I don't think it's really important for the prototype. 
If we maintain the side input in the state of the consuming operator 
then the checkpoint would not stop once the static side input is 
finished, because the main input goes on, the operator stays running. 
Incremental checkpointing could prevent checkpointing static data at 
every checkpoint.



Cheers,
Gabor

On 2017-03-09 16:59, Aljoscha Krettek wrote:


Hi Jamie,
actually the approach where the .withSideInput() comes before the user
function is only required for implementation proposal #1, which I like
the least. For the other two it can be after the user function, which is
also what I prefer.

Regarding semantics: yes, we simply wait for anything to be available.
For GlobalWindows, i.e. side inputs on a normal function where we simply
don't have windows, this means that we wait for anything. For the
windowed case, which I'm proposing as a second step we will wait for
side input in a window to be available that matches the main-input
window. For the keyed case we wait for something on the same key to be
available, for the broadcast case we wait for anything.


Re: Flink 1.2 / YARN on Secure MapR Cluster

2017-03-10 Thread Till Rohrmann
Hi,

could it be that this issue is related to [1]? If so, then it should soon
be fixed.

[1] https://issues.apache.org/jira/browse/FLINK-5949

Cheers,
Till

On Wed, Mar 8, 2017 at 11:50 PM, dschexna  wrote:

> I am attempting to run flink / YARN on a secure MapR 5.2 cluster.  The
> cluster is secured using "MapR Native Security", not kerberos.
>
> I did include the MapR zookeeper when building:
>
> opt/apache-maven-3.3.9/bin/mvn clean install -DskipTests -Pvendor-repos
> -Dhadoop.version=2.7.0-mapr-1607 -Dzookeeper.version=3.4.5-mapr-1604
>
> When I try to launch yarn-session.sh, I receive the following error:
>
> [mapr@maprdemo flink]$ ./bin/yarn-session.sh -n 1
> 2017-03-08 14:47:24,223 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.address, localhost
> 2017-03-08 14:47:24,224 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.port, 6123
> 2017-03-08 14:47:24,224 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.heap.mb, 256
> 2017-03-08 14:47:24,224 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.heap.mb, 512
> 2017-03-08 14:47:24,225 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2017-03-08 14:47:24,225 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.memory.preallocate, false
> 2017-03-08 14:47:24,225 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: parallelism.default, 1
> 2017-03-08 14:47:24,225 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.web.port, 8081
> 2017-03-08 14:47:24,292 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.address, localhost
> 2017-03-08 14:47:24,292 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.port, 6123
> 2017-03-08 14:47:24,293 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.heap.mb, 256
> 2017-03-08 14:47:24,293 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.heap.mb, 512
> 2017-03-08 14:47:24,293 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2017-03-08 14:47:24,293 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.memory.preallocate, false
> 2017-03-08 14:47:24,293 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: parallelism.default, 1
> 2017-03-08 14:47:24,294 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.web.port, 8081
> 2017-03-08 14:47:26,107 WARN  org.apache.hadoop.util.NativeCodeLoader
> - Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 2017-03-08 14:47:26,390 WARN
> org.apache.flink.runtime.security.modules.HadoopModule- Hadoop
> security is enabled but current login user does not have Kerberos
> credentials
> 2017-03-08 14:47:26,390 INFO
> org.apache.flink.runtime.security.modules.HadoopModule- Hadoop
> user
> set to mapr (auth:CUSTOM)
> 2017-03-08 14:47:26,431 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.address, localhost
> 2017-03-08 14:47:26,431 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.rpc.port, 6123
> 2017-03-08 14:47:26,431 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: jobmanager.heap.mb, 256
> 2017-03-08 14:47:26,431 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.heap.mb, 512
> 2017-03-08 14:47:26,432 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2017-03-08 14:47:26,432 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: taskmanager.memory.preallocate, false
> 2017-03-08 14:47:26,433 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration property: parallelism.default, 1
> 2017-03-08 14:47:26,433 INFO
> org.apache.flink.configuration.GlobalConfiguration- Loading
> configuration 

Re: Machine Learning on Flink - Next steps

2017-03-10 Thread Till Rohrmann
Thanks Theo for steering Flink's ML effort here :-)

I'd vote to concentrate on

- Online learning
- Low-latency prediction serving

because of the following reasons:

Online learning:

I agree that this topic is highly researchy and it's not even clear whether
it will ever be of any interest outside of academia. However, it was the
same for other things as well. Adoption in industry is usually slow and
sometimes one has to dare to explore something new.

Low-latency prediction serving:

Flink with its streaming engine seems to be the natural fit for such a task
and it is a rather low hanging fruit. Furthermore, I think that users would
directly benefit from such a feature.

Offline learning with Streaming API:

I'm not fully convinced yet that the streaming API is powerful enough
(mainly due to lack of proper iteration support and spilling capabilities)
to support a wide range of offline ML algorithms. And if then it will only
support rather small problem sizes because streaming cannot gracefully
spill the data to disk. There are still to many open issues with the
streaming API to be applicable for this use case imo.

Offline learning with the batch API:

For offline learning the batch API is imo still better suited than the
streaming API. I think it will only make sense to port the algorithms to
the streaming API once batch and streaming are properly unified. Alone the
highly efficient implementations for joining and sorting of data which can
go out of memory are important to support big sized ML problems. In
general, I think it might make sense to offer a basic set of ML primitives.
However, already offering this basic set is a considerable amount of work.

Concering the independent organization for the development: I think it
would be great if the development could still happen under the umbrella of
Flink's ML library because otherwise we might risk some kind of
fragmentation. In order for people to collaborate, one can also open PRs
against a branch of a forked repo.

I'm currently working on wrapping the project re-organization discussion
up. The general position was that it would be best to have an incremental
build and keep everything in the same repo. If this is not possible then we
want to look into creating a sub repository for the libraries (maybe other
components will follow later). I hope to make some progress on this front
in the next couple of days/week. I'll keep you updated.

As a general remark for the discussions on the google doc. I think it would
be great if we could at least mirror the discussions happening in the
google doc back on the mailing list or ideally conduct the discussions
directly on the mailing list. That's at least what the ASF encourages.

Cheers,
Till

On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann 
wrote:

> Hey all,
>
> Sorry for the bit late response.
>
> I'd like to work on
> - Offline learning with Streaming API
> - Low-latency prediction serving
>
> I would drop the batch API ML because of past experience with lack of
> support, and online learning because the lack of use-cases.
>
> I completely agree with Kate that offline learning should be supported,
> but given Flink's resources I prefer using the streaming API as Roberto
> suggested. Also, full model lifecycle (or end-to-end ML) could be more
> easily supported in one system (one API). Connecting Flink Batch with Flink
> Streaming is currently cumbersome (although side inputs [1] might help). In
> my opinion, a crucial part of end-to-end ML is low-latency predictions.
>
> As another direction, we could integrate Flink Streaming API with other
> projects (such as Prediction IO). However, I believe it's better to first
> evaluate the capabilities and drawbacks of the streaming API with some
> prototype of using Flink Streaming for some ML task. Otherwise we could run
> into critical issues just as the System ML integration with e.g. caching.
> These issues makes the integration of Batch API with other ML projects
> practically infeasible.
>
> I've already been experimenting with offline learning with the Streaming
> API. Hopefully, I can share some initial performance results next week on
> matrix factorization. Naturally, I've run into issues. E.g. I could only
> mark the end of input with some hacks, because this is not needed at a
> streaming job consuming input forever. AFAIK, this would be resolved by
> side inputs [1].
>
> @Theodore:
> +1 for doing the prototype project(s) separately the main Flink
> repository. Although, I would strongly suggest to follow Flink development
> guidelines as closely as possible. As another note, there is already a
> GitHub organization for Flink related projects [2], but it seems like it
> has not been used much.
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+
> Side+Inputs+for+DataStream+API
> [2] https://github.com/project-flink
>
>
> On 2017-03-04 08:44, Roberto Bentivoglio wrote:
>
> Hi All,
>>
>> I'd like to start working 

Re: [DISCUSS] Flink ML roadmap

2017-03-10 Thread Till Rohrmann
Hi Roberto,

jpmml looks quite promising and this could be a first step towards the
model serving story. Thus, looking really forward seeing it being open
sourced by you guys :-)

@Katherin, I'm not saying that there is no interest in the community to
work on batch features. However, there is simply not much capacity left to
mentor such an effort at the moment. I fear without the mentoring from an
experienced contributor who has worked on the batch part, it will be
extremely hard to get such a change into the code base. But this will
hopefully change in the future.

I think the discussion from this thread moved over to [1] and I will
continue discussing there.

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none

Cheers,
Till

On Wed, Mar 8, 2017 at 1:59 AM, Kavulya, Soila P 
wrote:

> Hi Theodore,
>
> We had put together a proposal for an ML DSL in Apache Beam. We had
> developed a couple of scoring engines as part of TAP https://github.com/
> tapanalyticstoolkit/model-scoring-java and https://github.com/
> tapanalyticstoolkit/scoring-pipelines. However, our group is no longer
> actively developing them.
>
> Thanks,
>
> Soila
>
> From: Theodore Vasiloudis [mailto:theodoros.vasilou...@gmail.com]
> Sent: Friday, March 3, 2017 4:11 AM
> To: dev@flink.apache.org
> Cc: Kavulya, Soila P 
> Subject: Re: [DISCUSS] Flink ML roadmap
>
> It seems like a relatively new project, backed by Intel.
>
> My impression from the doc Roberto linked is that they might switch to
> using Beam instead of Spark (?)
> I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
> past, perhaps she has some input on how they plan to work with streaming
> and ML in TAP.
>
> Repos:
> [1] https://github.com/tapanalyticstoolkit/
>
> On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
> st.kontopou...@gmail.com> wrote:
> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>
> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivog...@radicalbit.io bentivog...@radicalbit.io>> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > 

Flink 1.2 / YARN on Secure MapR Cluster

2017-03-10 Thread dschexna
I am attempting to run flink / YARN on a secure MapR 5.2 cluster.  The
cluster is secured using "MapR Native Security", not kerberos.

I did include the MapR zookeeper when building:

opt/apache-maven-3.3.9/bin/mvn clean install -DskipTests -Pvendor-repos
-Dhadoop.version=2.7.0-mapr-1607 -Dzookeeper.version=3.4.5-mapr-1604

When I try to launch yarn-session.sh, I receive the following error:

[mapr@maprdemo flink]$ ./bin/yarn-session.sh -n 1
2017-03-08 14:47:24,223 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.address, localhost
2017-03-08 14:47:24,224 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.port, 6123
2017-03-08 14:47:24,224 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.heap.mb, 256
2017-03-08 14:47:24,224 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.heap.mb, 512
2017-03-08 14:47:24,225 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.numberOfTaskSlots, 1
2017-03-08 14:47:24,225 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.memory.preallocate, false
2017-03-08 14:47:24,225 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: parallelism.default, 1
2017-03-08 14:47:24,225 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.web.port, 8081
2017-03-08 14:47:24,292 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.address, localhost
2017-03-08 14:47:24,292 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.port, 6123
2017-03-08 14:47:24,293 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.heap.mb, 256
2017-03-08 14:47:24,293 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.heap.mb, 512
2017-03-08 14:47:24,293 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.numberOfTaskSlots, 1
2017-03-08 14:47:24,293 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.memory.preallocate, false
2017-03-08 14:47:24,293 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: parallelism.default, 1
2017-03-08 14:47:24,294 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.web.port, 8081
2017-03-08 14:47:26,107 WARN  org.apache.hadoop.util.NativeCodeLoader   
   
- Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
2017-03-08 14:47:26,390 WARN 
org.apache.flink.runtime.security.modules.HadoopModule- Hadoop
security is enabled but current login user does not have Kerberos
credentials
2017-03-08 14:47:26,390 INFO 
org.apache.flink.runtime.security.modules.HadoopModule- Hadoop user
set to mapr (auth:CUSTOM)
2017-03-08 14:47:26,431 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.address, localhost
2017-03-08 14:47:26,431 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.rpc.port, 6123
2017-03-08 14:47:26,431 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.heap.mb, 256
2017-03-08 14:47:26,431 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.heap.mb, 512
2017-03-08 14:47:26,432 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.numberOfTaskSlots, 1
2017-03-08 14:47:26,432 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: taskmanager.memory.preallocate, false
2017-03-08 14:47:26,433 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: parallelism.default, 1
2017-03-08 14:47:26,433 INFO 
org.apache.flink.configuration.GlobalConfiguration- Loading
configuration property: jobmanager.web.port, 8081
2017-03-08 14:47:26,440 ERROR org.apache.flink.yarn.YarnClusterDescriptor   
   
- Hadoop security is enabled but the login user does not have Kerberos
credentials
Error while deploying YARN cluster: Couldn't deploy Yarn cluster
java.lang.RuntimeException: Couldn't deploy Yarn cluster
at

[jira] [Created] (FLINK-6023) Fix Scala snippet into Process Function (Low-level Operations) Doc

2017-03-10 Thread Mauro Cortellazzi (JIRA)
Mauro Cortellazzi created FLINK-6023:


 Summary: Fix Scala snippet into Process Function (Low-level 
Operations) Doc
 Key: FLINK-6023
 URL: https://issues.apache.org/jira/browse/FLINK-6023
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Mauro Cortellazzi
Priority: Trivial
 Fix For: 1.3.0, 1.2.1


The current `/docs/dev/stream/process_function.md` has some errors in the Scala 
snippet



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: [DISCUSS] Project build time and possible restructuring

2017-03-10 Thread Till Rohrmann
Thanks for all your input. In order to wrap the discussion up I'd like to
summarize the mentioned points:

The problem of increasing build times and complexity of the project has
been acknowledged. Ideally we would have everything in one repository using
an incremental build tool. Since Maven does not properly support this we
would have to switch our build tool to something like Gradle, for example.

Another option is introducing build profiles for different sets of modules
as well as separating integration and unit tests. The third alternative
would be creating sub-projects with their own repositories. I actually
think that these two proposal are not necessarily exclusive and it would
also make sense to have a separation between unit and integration tests if
we split the respository.

The overall consensus seems to be that we don't want to split the community
and want to keep everything under the same umbrella. I think this is the
right way to go, because otherwise some parts of the project could become
second class citizens. Given that and that we continue using Maven, I still
think that creating sub-projects for the libraries, for example, could be
beneficial. A split could reduce the project's complexity and make it
potentially easier for libraries to get actively developed. The main
concern is setting up the build infrastructure to aggregate docs from
multiple repositories and making them publicly available.

Since I started this thread and I would really like to see Flink's ML
library being revived again, I'd volunteer investigating first whether it
is doable establishing a proper incremental build for Flink. If that should
not be possible, I will look into splitting the repository, first only for
the libraries. I'll share my results with the community once I'm done with
the investigation.

Cheers,
Till

On Fri, Feb 24, 2017 at 3:50 PM, Robert Metzger  wrote:

> @Jin Mingjian: You can not use the paid travis version for open source
> projects. It only works for private repositories (at least back then when
> we've asked them about that).
>
> @Stephan: I don't think that incremental builds will be available with
> Maven anytime soon.
>
> I agree that we need to fix the build time issue on Travis. I've recently
> pushed a commit to use now three instead of two test groups.
> But I don't think that this is feasible long-term solution.
>
> If this discussion is only about reducing the build and test time,
> introducing build profiles for different components as Aljoscha suggested
> would solve the problem Till mentioned.
> Also, if we decide that travis is not a good tool anymore for the testing,
> I guess we can find a different solution. There are now competitors to
> Travis that might be willing to offer a paid plan for an open source
> project, or we set up our own infra on a server sponsored by one of the
> contributing companies.
> If we want to solve "community issues" with the change as well, then I
> think its work the effort of splitting up Flink into different
> repositories.
>
> Splitting up repositories is not a trivial task in my opinion. As others
> have mentioned before, we need to consider the following things:
> - How are we doing to build the documentation? Ideally every repo should
> contain its docs, so we would need to pull them together when building the
> main docs.
> - How do organize the dependencies? If we have library repository depend on
> snapshot Flink versions, we need to make sure that the snapshot deployment
> always works. This also means that people working on a library repository
> will pull from snapshot OR need to build first locally.
> - We need to update the release scripts
>
> If we commit to do these changes, we need to assign at least one committer
> (yes, in this case we need somebody who can commit, for example for
> updating the buildbot stuff) who volunteers to do the change.
> I've done a lot of infrastructure work in the past, but I'm currently
> pretty booked with many other things, so I don't realistically see myself
> doing that. Max who used to work on these things is taking some time off.
> I think we need, best case 3 days for the change, worst case 5 days. The
> problem is that there are no "unit tests" for the infra stuff, so many
> things are "trial and error" (like Apache's buildbot, our release scripts,
> the doc scripts, maven stuff, nightly builds).
>
>
>
> On Thu, Feb 23, 2017 at 1:33 PM, Stephan Ewen  wrote:
>
> > If we can get a incremental builds to work, that would actually be the
> > preferred solution in my opinion.
> >
> > Many companies have invested heavily in making a "single repository" code
> > base work, because it has the advantage of not having to update/publish
> > several repositories first.
> > However, the strong prerequisite for that is an incremental build system
> > that builds only (fine grained) what it has to build. I am not sure how
> we
> > could make that work
> > with Maven 

[jira] [Created] (FLINK-6022) Improve support for Avro GenericRecord

2017-03-10 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-6022:
-

 Summary: Improve support for Avro GenericRecord
 Key: FLINK-6022
 URL: https://issues.apache.org/jira/browse/FLINK-6022
 Project: Flink
  Issue Type: Improvement
  Components: Type Serialization System
Reporter: Robert Metzger


Currently, Flink is serializing the schema for each Avro GenericRecord in the 
stream.
This leads to a lot of overhead over the wire/disk + high serialization costs.

Therefore, I'm proposing to improve the support for GenericRecord in Flink by 
shipping the schema to each serializer  through the AvroTypeInformation.
Then, we can only support GenericRecords with the same type per stream, but the 
performance will be much better.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Flink CSV parsing

2017-03-10 Thread Flavio Pompermaier
Hi to all,
I want to discuss with the dev group something about CSV parsing.
Since I started using Flink with CSVs I always faced some little problem
here and there and the new tickets about the CSV parsing seems to confirm
that this part is still problematic.
In my production jobs I gave up using Flink CSV parsing in favour of  apace
commons-csv and it works great. It's perfectly configurable ans robust.
A working example is available at [1].

Thus, why not to use that library directly and contribute back (if needed)
to another apache library if improvements are required to speed up the
parsing? Have you ever tried to compare the performances of the 2 parsers?

Best,
Flavio

[1]
https://github.com/okkam-it/flink-examples/blob/master/src/main/java/it/okkam/datalinks/batch/flink/datasourcemanager/importers/Csv2RowExample.java


[jira] [Created] (FLINK-6021) Downloads page references "Hadoop 1 version" which isn't an option

2017-03-10 Thread Patrick Lucas (JIRA)
Patrick Lucas created FLINK-6021:


 Summary: Downloads page references "Hadoop 1 version" which isn't 
an option
 Key: FLINK-6021
 URL: https://issues.apache.org/jira/browse/FLINK-6021
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Patrick Lucas


The downloads pages says

{quote}
Apache Flink® 1.2.0 is our latest stable release.

You don’t have to install Hadoop to use Flink, but if you plan to use Flink 
with data stored in Hadoop, pick the version matching your installed Hadoop 
version. If you don’t want to do this, pick the Hadoop 1 version.
{quote}

But Hadoop 1 appears to no longer be an available alternative.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Machine Learning on Flink - Next steps

2017-03-10 Thread Gábor Hermann

Hey all,

Sorry for the bit late response.

I'd like to work on
- Offline learning with Streaming API
- Low-latency prediction serving

I would drop the batch API ML because of past experience with lack of 
support, and online learning because the lack of use-cases.


I completely agree with Kate that offline learning should be supported, 
but given Flink's resources I prefer using the streaming API as Roberto 
suggested. Also, full model lifecycle (or end-to-end ML) could be more 
easily supported in one system (one API). Connecting Flink Batch with 
Flink Streaming is currently cumbersome (although side inputs [1] might 
help). In my opinion, a crucial part of end-to-end ML is low-latency 
predictions.


As another direction, we could integrate Flink Streaming API with other 
projects (such as Prediction IO). However, I believe it's better to 
first evaluate the capabilities and drawbacks of the streaming API with 
some prototype of using Flink Streaming for some ML task. Otherwise we 
could run into critical issues just as the System ML integration with 
e.g. caching. These issues makes the integration of Batch API with other 
ML projects practically infeasible.


I've already been experimenting with offline learning with the Streaming 
API. Hopefully, I can share some initial performance results next week 
on matrix factorization. Naturally, I've run into issues. E.g. I could 
only mark the end of input with some hacks, because this is not needed 
at a streaming job consuming input forever. AFAIK, this would be 
resolved by side inputs [1].


@Theodore:
+1 for doing the prototype project(s) separately the main Flink 
repository. Although, I would strongly suggest to follow Flink 
development guidelines as closely as possible. As another note, there is 
already a GitHub organization for Flink related projects [2], but it 
seems like it has not been used much.


[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API

[2] https://github.com/project-flink

On 2017-03-04 08:44, Roberto Bentivoglio wrote:


Hi All,

I'd like to start working on:
  - Offline learning with Streaming API
  - Online learning

I think also that using a new organisation on github, as Theodore propsed,
to keep an initial indipendency to speed up the prototyping and development
phases it's really interesting.

I totally agree with Katherin, we need offline learning, but my opinion is
that it will be more straightforward to fix the streaming issues than batch
issues because we will have more support on that by the Flink community.

Thanks and have a nice weekend,
Roberto

On 3 March 2017 at 20:20, amir bahmanyari 
wrote:


Great points to start:- Online learning
   - Offline learning with the streaming API

Thanks + have a great weekend.

   From: Katherin Eri 
  To: dev@flink.apache.org
  Sent: Friday, March 3, 2017 7:41 AM
  Subject: Re: Machine Learning on Flink - Next steps

Thank you, Theodore.

Shortly speaking I vote for:
1) Online learning
2) Low-latency prediction serving -> Offline learning with the batch API

In details:
1) If streaming is strong side of Flink lets use it, and try to support
some online learning or light weight inmemory learning algorithms. Try to
build pipeline for them.

2) I think that Flink should be part of production ecosystem, and if now
productions require ML support, multiple models deployment and so on, we
should serve this. But in my opinion we shouldn’t compete with such
projects like PredictionIO, but serve them, to be an execution core. But
that means a lot:

a. Offline training should be supported, because typically most of ML algs
are for offline training.
b. Model lifecycle should be supported:
ETL+transformation+training+scoring+exploitation quality monitoring

I understand that batch world is full of competitors, but for me that
doesn’t mean that batch should be ignored. I think that separated
streaming/batching applications causes additional deployment and
exploitation overhead which typically tried to be avoided. That means that
we should attract community to this problem in my opinion.


пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

Hello all,

 From our previous discussion started by Stavros, we decided to start a
planning document [1]
to figure out possible next steps for ML on Flink.

Our concerns where mainly ensuring active development while satisfying the
needs of
the community.

We have listed a number of proposals for future work in the document. In
short they are:

   - Offline learning with the batch API
   - Online learning
   - Offline learning with the streaming API
   - Low-latency prediction serving

I saw there is a number of people willing to work on ML for Flink, but the
truth is that we cannot
cover all of these suggestions without fragmenting the development too
much.

So my recommendation is to pick out 2 of these options, 

[jira] [Created] (FLINK-6020) Blob Server cannot hanlde multiple job sumits(with same content) parallelly

2017-03-10 Thread Tao Wang (JIRA)
Tao Wang created FLINK-6020:
---

 Summary: Blob Server cannot hanlde multiple job sumits(with same 
content) parallelly
 Key: FLINK-6020
 URL: https://issues.apache.org/jira/browse/FLINK-6020
 Project: Flink
  Issue Type: Bug
Reporter: Tao Wang
Priority: Critical


In yarn-cluster mode, if we submit one same job multiple times parallelly, the 
task will encounter class load problem and lease occuputation.

Because blob server stores user jars in name with generated sha1sum of those, 
first writes a temp file and move it to finalialize. For recovery it also will 
put them to HDFS with same file name.

In same time, when multiple clients sumit same job with same jar, the local jar 
files in blob server and those file on hdfs will be handled in multiple 
threads(BlobServerConnection), and impact each other.

It's better to have a way to handle this, now two ideas comes up to my head:
1. lock the write operation, or
2. use some unique identifier as file name instead of ( or added up to) sha1sum 
of the file contents.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6019) Some log4j messages do not have a loglevel field set, so they can't be suppressed

2017-03-10 Thread Luke Hutchison (JIRA)
Luke Hutchison created FLINK-6019:
-

 Summary: Some log4j messages do not have a loglevel field set, so 
they can't be suppressed
 Key: FLINK-6019
 URL: https://issues.apache.org/jira/browse/FLINK-6019
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 1.2.0
 Environment: Linux
Reporter: Luke Hutchison


Some of the log messages do not appear to have a loglevel value set, so they 
can't be suppressed by setting the log4j level to WARN. There's this line at 
the beginning which doesn't even have a timestamp:

Connected to JobManager at Actor[akka://flink/user/jobmanager_1#1844933939]

And then there are numerous lines like this, missing an "INFO" field:

03/10/2017 00:01:14 Job execution switched to status RUNNING.
03/10/2017 00:01:14 DataSource (at readTable(DBTableReader.java:165) 
(org.apache.flink.api.java.io.PojoCsvInputFormat))(1/8) switched to SCHEDULED 
03/10/2017 00:01:14 DataSink (count())(1/8) switched to SCHEDULED 
03/10/2017 00:01:14 DataSink (count())(3/8) switched to DEPLOYING 
03/10/2017 00:01:15 DataSink (count())(3/8) switched to RUNNING 
03/10/2017 00:01:17 DataSink (count())(6/8) switched to FINISHED 
03/10/2017 00:01:17 DataSource (at readTable(DBTableReader.java:165) 
(org.apache.flink.api.java.io.PojoCsvInputFormat))(6/8) switched to FINISHED 
03/10/2017 00:01:17 Job execution switched to status FINISHED.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)