[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284484#comment-15284484
 ] 

Stavros Kontopoulos commented on FLINK-2147:


ok i will have a look as well to get familiar with it.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284476#comment-15284476
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok i agree then we calculate statistics per window in isolated manner like sum, 
mean etc without the aggregation in buffer. Ok so lets see how we avoid that 
correct?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1984) Integrate Flink with Apache Mesos

2016-05-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283960#comment-15283960
 ] 

Stavros Kontopoulos commented on FLINK-1984:


hey whats the status for this? 

> Integrate Flink with Apache Mesos
> -
>
> Key: FLINK-1984
> URL: https://issues.apache.org/jira/browse/FLINK-1984
> Project: Flink
>  Issue Type: New Feature
>  Components: New Components
>Reporter: Robert Metzger
>Assignee: Eron Wright 
>Priority: Minor
> Attachments: 251.patch
>
>
> There are some users asking for an integration of Flink into Mesos.
> -There also is a pending pull request for adding Mesos support for Flink-: 
> https://github.com/apache/flink/pull/251
> Update (May '16):  a new effort is now underway, building on the recent 
> ResourceManager work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283961#comment-15283961
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Anyone working on this?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Gabor Gevay
>Priority: Minor
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284431#comment-15284431
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok... btw I was looking your old PR for median etc, i am wondering what is the 
status of memory management for window buffering in master.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284440#comment-15284440
 ] 

Stavros Kontopoulos commented on FLINK-2147:


For sure if you apply it per window then you need to avoid keeping any data 
after you update your algorithm/structure. If you have window results i guess 
you can update a statistic about the whole stream when this is valid, depending 
on the statistic.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284410#comment-15284410
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Yes i agree the api is the hard part, but we could work on this if you want or 
at least check if it is a mature task now considering stream api stability etc.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-16 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284446#comment-15284446
 ] 

Stavros Kontopoulos commented on FLINK-2147:


Ok so if there multiple windows evaluated at different times in parallel since 
data comes out of order, what kind of statistic is computable in this model? 
What are the correct semantics here? Emit a statistic update only when ordering 
is reconstructed (appropriate windows are calculated) and delay future results? 
What about count min?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-18 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15288771#comment-15288771
 ] 

Stavros Kontopoulos commented on FLINK-2147:


>From a first look, something like StreamGroupedFold 
>https://github.com/eBay/Flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/StreamGroupedFold.java
>  ,would be enough right? Define our own operator to keep the value updated.

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2016-05-18 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289498#comment-15289498
 ] 

Stavros Kontopoulos commented on FLINK-2147:


How do you want to move on? collaborate on a branch on a fork?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1984) Integrate Flink with Apache Mesos

2016-07-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385994#comment-15385994
 ] 

Stavros Kontopoulos commented on FLINK-1984:


+1 make it major pls.

> Integrate Flink with Apache Mesos
> -
>
> Key: FLINK-1984
> URL: https://issues.apache.org/jira/browse/FLINK-1984
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Reporter: Robert Metzger
>Assignee: Eron Wright 
>Priority: Minor
> Attachments: 251.patch
>
>
> There are some users asking for an integration of Flink into Mesos.
> -There also is a pending pull request for adding Mesos support for Flink-: 
> https://github.com/apache/flink/pull/251
> Update (May '16):  a new effort is now underway, building on the recent 
> ResourceManager work.
> Design document:  ([google 
> doc|https://docs.google.com/document/d/1WItafBmGbjlaBbP8Of5PAFOH9GUJQxf5S4hjEuPchuU/edit?usp=sharing])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] (FLINK-2539) More unified code style for Scala code

2017-01-31 Thread Stavros Kontopoulos (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stavros Kontopoulos edited a comment on  FLINK-2539 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: More unified code style for Scala code  
 
 
 
 
 
 
 
 
 
  [~chiwanpark] , [~aljoscha]  Is there a document where we can add guidelines for review and iteration? The url in the description is dead.  
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (FLINK-2539) More unified code style for Scala code

2017-01-31 Thread Stavros Kontopoulos (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stavros Kontopoulos commented on  FLINK-2539 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: More unified code style for Scala code  
 
 
 
 
 
 
 
 
 
 
 Chiwan Park Is there a document where we can add guidelines for review and iteration? The url in the description is dead.  
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (FLINK-2539) More unified code style for Scala code

2017-01-31 Thread Stavros Kontopoulos (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stavros Kontopoulos edited a comment on  FLINK-2539 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: More unified code style for Scala code  
 
 
 
 
 
 
 
 
 
  [~chiwanpark], [~aljoscha] Is there a document where we can add guidelines for review and iteration? The url in the description is dead.    Current guidelines: https://cwiki.apache.org/confluence/display/FLINK/Coding+Guidelines+for+Scala 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] (FLINK-2539) More unified code style for Scala code

2017-01-31 Thread Stavros Kontopoulos (JIRA)
Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stavros Kontopoulos edited a comment on  FLINK-2539 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: More unified code style for Scala code  
 
 
 
 
 
 
 
 
 
  [~chiwanpark], [~aljoscha] Is there a document  (or much better a repo)  where we can add guidelines for review and iteration?  The  Is it possible to create one so it can be official? Btw the  url in the description is dead. Current guidelines: https://cwiki.apache.org/confluence/display/FLINK/Coding+Guidelines+for+Scala 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d) 
 
 
 
 
  
 
 
 
 
 
 
 
 
   



[jira] [Comment Edited] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-21 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833070#comment-15833070
 ] 

Stavros Kontopoulos edited comment on FLINK-5525 at 1/21/17 5:38 PM:
-

[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others and work on them. If you want to 
co-ordinate on this let me know.
For example we need to set the abstractions. Check Spark implementation for an 
example of what the abstractions might be. I didn't open other issues because I 
wanted to see what people think first. 


was (Author: skonto):
[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others and work on them. If you want to 
co-ordinate on this let me know.
For example we need to set the abstractions first like, check Spark 
implementation for an example. I didn't do that because I wanted to see what 
people think first. 

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-21 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833070#comment-15833070
 ] 

Stavros Kontopoulos commented on FLINK-5525:


[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others. I didn't do that because I wanted 
to see what people think first. 

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-21 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833070#comment-15833070
 ] 

Stavros Kontopoulos edited comment on FLINK-5525 at 1/21/17 5:38 PM:
-

[~mtunqiue] Sure I agree there are other algorithms eg. clustering which may 
have a streaming version feel free to open others and work on them. If you want 
to co-ordinate on this let me know.
For example we need to set the abstractions. Check Spark implementation for an 
example of what the abstractions might be. I didn't open other issues because I 
wanted to see what people think first. 


was (Author: skonto):
[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others and work on them. If you want to 
co-ordinate on this let me know.
For example we need to set the abstractions. Check Spark implementation for an 
example of what the abstractions might be. I didn't open other issues because I 
wanted to see what people think first. 

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-21 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833070#comment-15833070
 ] 

Stavros Kontopoulos edited comment on FLINK-5525 at 1/21/17 5:37 PM:
-

[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others and work on them. If you want to 
co-ordinate on this let me know.
For example we need to set the abstractions first like, check Spark 
implementation for an example. I didn't do that because I wanted to see what 
people think first. 


was (Author: skonto):
[~mtunqiue] Sure I agree there other algorithms eg. clustering which may have a 
streaming version feel free to open others. I didn't do that because I wanted 
to see what people think first. 

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831686#comment-15831686
 ] 

Stavros Kontopoulos edited comment on FLINK-5525 at 1/20/17 12:50 PM:
--

[~tvas] I was thinking of working on this what do you think?
Depends on this https://issues.apache.org/jira/browse/FLINK-2013


was (Author: skonto):
[~tvas]I was thinking of working on this what do you think?
Depends on this https://issues.apache.org/jira/browse/FLINK-2013

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831686#comment-15831686
 ] 

Stavros Kontopoulos edited comment on FLINK-5525 at 1/20/17 12:49 PM:
--

[~tvas]I was thinking of working on this what do you think?
Depends on this https://issues.apache.org/jira/browse/FLINK-2013


was (Author: skonto):
[~tvas]I was thinking of working on this what do you think?

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831735#comment-15831735
 ] 

Stavros Kontopoulos commented on FLINK-5588:


[~till.rohrmann]May I work on this?

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831735#comment-15831735
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 1/20/17 1:26 PM:
-

[~till.rohrmann] May I work on this?


was (Author: skonto):
[~till.rohrmann]May I work on this?

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831703#comment-15831703
 ] 

Stavros Kontopoulos commented on FLINK-5525:


We need this first I guess.

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created FLINK-5588:
--

 Summary: Add a unit scaler based on different norms
 Key: FLINK-5588
 URL: https://issues.apache.org/jira/browse/FLINK-5588
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Stavros Kontopoulos
Priority: Minor


So far ML has two scalers: min-max and the standard.
A third one used is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling

  was:
So far ML has two scalers: min-max and the standard.
A third one used is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831735#comment-15831735
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 1/20/17 1:58 PM:
-

[~till.rohrmann] [~twalthr] May I work on this? Can I get self-assign rights?


was (Author: skonto):
[~till.rohrmann] May I work on this?

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-26 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
Right now the existing scalers support per feature normalization. I think its 
trivial to add per sample normalization.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html

  was:
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
Right now the existing scalers support per feature normalization. I think its 
trivial to add per sample normalization.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Axis for scaling either features or samples (0 for columns-features 1 for 
> samples-rows). 
> Right now the existing scalers support per feature normalization. I think its 
> trivial to add per sample normalization.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-26 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
Right now the existing scalers support per feature normalization. I think its 
trivial to add per sample normalization.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html

  was:
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows)

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Axis for scaling either features or samples (0 for columns-features 1 for 
> samples-rows). 
> Right now the existing scalers support per feature normalization. I think its 
> trivial to add per sample normalization.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-26 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples ( 0 for columns-features 1 for 
samples-rows)

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html

  was:
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Axis for scaling either features or samples ( 0 for columns-features 1 for 
> samples-rows)
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-26 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows)

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html

  was:
So far ML has two scalers: min-max and the standard.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples ( 0 for columns-features 1 for 
samples-rows)

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Axis for scaling either features or samples (0 for columns-features 1 for 
> samples-rows)
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-27 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
I will make this a separate class for the Normalization procedure by using the 
Transformer API.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transforer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html

  was:
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
Right now the existing scalers support per feature normalization. I think its 
trivial to add per sample normalization.

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Axis for scaling either features or samples (0 for columns-features 1 for 
> samples-rows). 
> I will make this a separate class for the Normalization procedure by using 
> the Transformer API.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transforer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-27 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
I will make a separate class for the Normalization procedure by using the 
Transformer API because it is easy to add
it, fit method does nothing in this case.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
These calls work on any axis but they are not re-usable in a pipeline [4]
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transforer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html

  was:
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.

Axis for scaling either features or samples (0 for columns-features 1 for 
samples-rows). 
I will make this a separate class for the Normalization procedure by using the 
Transformer API.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transforer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization procedure by using the 
> Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transforer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-27 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
I will make a separate class for the Normalization per sample procedure by 
using the Transformer API because it is easy to add
it, fit method does nothing in this case.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
These calls work on any axis but they are not re-usable in a pipeline [4]
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transformer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html

  was:
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
I will make a separate class for the Normalization procedure by using the 
Transformer API because it is easy to add
it, fit method does nothing in this case.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
These calls work on any axis but they are not re-usable in a pipeline [4]
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transformer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5588) Add a unit scaler based on different norms

2017-01-27 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5588:
---
Description: 
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
I will make a separate class for the Normalization procedure by using the 
Transformer API because it is easy to add
it, fit method does nothing in this case.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
These calls work on any axis but they are not re-usable in a pipeline [4]
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transformer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html

  was:
So far ML has two scalers: min-max and the standard scaler.
A third one frequently used, is the scaler to unit.
We could implement a transformer for this type of scaling for different norms 
available to the user.
I will make a separate class for the Normalization procedure by using the 
Transformer API because it is easy to add
it, fit method does nothing in this case.
Scikit-learn has also some calls available outside the Transform API, we might 
want add that in the future.
These calls work on any axis but they are not re-usable in a pipeline [4]
Right now the existing scalers in Flink ML support per feature normalization by 
using the Transforer API. 

Resources
[1] https://en.wikipedia.org/wiki/Feature_scaling
[2] 
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
[3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
[4] http://scikit-learn.org/stable/modules/preprocessing.html


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization procedure by using the 
> Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5841) Algorithms for each pipeline stage should handle NaN, infinity like in scikit-learn

2017-02-24 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882885#comment-15882885
 ] 

Stavros Kontopoulos commented on FLINK-5841:


Cool will give it a shot :)

> Algorithms for each pipeline stage should handle NaN, infinity like in 
> scikit-learn
> ---
>
> Key: FLINK-5841
> URL: https://issues.apache.org/jira/browse/FLINK-5841
> Project: Flink
>  Issue Type: Bug
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> Algorithms in scikit-learn don't accept NaN, Infinity values. Since we are 
> following the scikit-learn approach we should conform to that.
> Right now values are propagated... check pre-processing algos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5841) Algorithms for each pipeline stage should handle NaN, infinity like in scikit-learn

2017-02-18 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created FLINK-5841:
--

 Summary: Algorithms for each pipeline stage should handle NaN, 
infinity like in scikit-learn
 Key: FLINK-5841
 URL: https://issues.apache.org/jira/browse/FLINK-5841
 Project: Flink
  Issue Type: Bug
  Components: Machine Learning Library
Reporter: Stavros Kontopoulos
Assignee: Stavros Kontopoulos


Algorithms in scikit-learn don't accept NaN, Infinity values. Since we are 
following the scikit-learn approach we should conform that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5841) Algorithms for each pipeline stage should handle NaN, infinity like in scikit-learn

2017-02-18 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5841:
---
Description: 
Algorithms in scikit-learn don't accept NaN, Infinity values. Since we are 
following the scikit-learn approach we should conform to that.
Right now values are propagated... check pre-processing algos.

  was:Algorithms in scikit-learn don't accept NaN, Infinity values. Since we 
are following the scikit-learn approach we should conform to that.


> Algorithms for each pipeline stage should handle NaN, infinity like in 
> scikit-learn
> ---
>
> Key: FLINK-5841
> URL: https://issues.apache.org/jira/browse/FLINK-5841
> Project: Flink
>  Issue Type: Bug
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> Algorithms in scikit-learn don't accept NaN, Infinity values. Since we are 
> following the scikit-learn approach we should conform to that.
> Right now values are propagated... check pre-processing algos.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (FLINK-5841) Algorithms for each pipeline stage should handle NaN, infinity like in scikit-learn

2017-02-18 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5841:
---
Description: Algorithms in scikit-learn don't accept NaN, Infinity values. 
Since we are following the scikit-learn approach we should conform to that.  
(was: Algorithms in scikit-learn don't accept NaN, Infinity values. Since we 
are following the scikit-learn approach we should conform that.)

> Algorithms for each pipeline stage should handle NaN, infinity like in 
> scikit-learn
> ---
>
> Key: FLINK-5841
> URL: https://issues.apache.org/jira/browse/FLINK-5841
> Project: Flink
>  Issue Type: Bug
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> Algorithms in scikit-learn don't accept NaN, Infinity values. Since we are 
> following the scikit-learn approach we should conform to that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos commented on FLINK-5588:


[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf
I am ok with norms 1,2 but i am not sure about p>2

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 11:30 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo if there is one.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 11:30 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 11:30 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf


> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 11:55 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.

Thinking of dividing with Xmax to avoid overflow and use 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm for the sum of many 
small numbers.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.

Thinking of dividing with Xmax and use 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
for the sum of many small numbers.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 10:15 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf
I am ok with norms 1,2 but I am not sure about p>2


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf
I am ok with norms 1,2 but i am not sure about p>2

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 10:17 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf



was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow:
Reference: 
http://www.scan2014.uni-wuerzburg.de/fileadmin/1003/scan2014/talks/B2_2.pdf...
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf
I am ok with norms 1,2 but I am not sure about p>2

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/14/17 11:41 PM:
--

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.

Thinking of dividing with Xmax and use 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
for the sum of many small numbers.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo if there is one.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866685#comment-15866685
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/15/17 2:09 AM:
-

[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.

Thinking of dividing with Xmax to avoid overflow and use 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm for the sum of many 
small numbers.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors, need something similar.


was (Author: skonto):
[~till.rohrmann] Have already implemented the Normalizer... need to check 
floating arithmetic for the UnitScaler because the sum might lead to overflow, 
so have to find a proper algo.

Thinking of dividing with Xmax to avoid overflow and use 
https://en.wikipedia.org/wiki/Kahan_summation_algorithm for the sum of many 
small numbers.
 
Standard scaler uses this algo: 
http://www.cs.yale.edu/publications/techreports/tr222.pdf for variance on big 
vectors.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (FLINK-5785) Add an Imputer for preparing data

2017-02-13 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos reassigned FLINK-5785:
--

Assignee: Stavros Kontopoulos

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5588) Add a unit scaler based on different norms

2017-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867964#comment-15867964
 ] 

Stavros Kontopoulos commented on FLINK-5588:


Hi [~till.rohrmann] my pleasure. I will wait for the review, meanwhile I will 
continue working on the other stuff and reviews PRs.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867964#comment-15867964
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/15/17 2:57 PM:
-

Hi [~till.rohrmann] my pleasure. I will wait for the review, meanwhile I will 
continue working on the other stuff and review PRs. It is important at some 
point for people involved here to discuss roadmap and plan releases of several 
things.


was (Author: skonto):
Hi [~till.rohrmann] my pleasure. I will wait for the review, meanwhile I will 
continue working on the other stuff and reviews PRs.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5588) Add a unit scaler based on different norms

2017-02-15 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867964#comment-15867964
 ] 

Stavros Kontopoulos edited comment on FLINK-5588 at 2/15/17 2:58 PM:
-

Hi [~till.rohrmann] my pleasure. I will wait for the review, meanwhile I will 
continue working on the other stuff and review PRs. It is important at some 
point for people involved here to discuss roadmap and plan releases of several 
things. Maybe it would be good to start the discussion on the list.


was (Author: skonto):
Hi [~till.rohrmann] my pleasure. I will wait for the review, meanwhile I will 
continue working on the other stuff and review PRs. It is important at some 
point for people involved here to discuss roadmap and plan releases of several 
things.

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard scaler.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> I will make a separate class for the Normalization per sample procedure by 
> using the Transformer API because it is easy to add
> it, fit method does nothing in this case.
> Scikit-learn has also some calls available outside the Transform API, we 
> might want add that in the future.
> These calls work on any axis but they are not re-usable in a pipeline [4]
> Right now the existing scalers in Flink ML support per feature normalization 
> by using the Transformer API. 
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling
> [2] 
> http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html
> [3] https://spark.apache.org/docs/2.1.0/mllib-feature-extraction.html
> [4] http://scikit-learn.org/stable/modules/preprocessing.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5785) Add an Imputer for preparing data

2017-02-13 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created FLINK-5785:
--

 Summary: Add an Imputer for preparing data
 Key: FLINK-5785
 URL: https://issues.apache.org/jira/browse/FLINK-5785
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Stavros Kontopoulos


We need to add an Imputer as described in [1].

"The Imputer class provides basic strategies for imputing missing values, 
either using the mean, the median or the most frequent value of the row or 
column in which the missing values are located. This class also allows for 
different missing values encodings."

References
1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
2. 
http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-17 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created FLINK-5525:
--

 Summary: Streaming Version of a Linear Regression model
 Key: FLINK-5525
 URL: https://issues.apache.org/jira/browse/FLINK-5525
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Stavros Kontopoulos


Given the nature of Flink we should have a streaming version of the algorithms 
when possible.
Update of the model should be done on a per window basis.
An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning

Resources

[1] 
http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
[2] 
http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2013) Create generalized linear model framework

2017-01-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826000#comment-15826000
 ] 

Stavros Kontopoulos commented on FLINK-2013:


[~tvas] is this finished?

> Create generalized linear model framework
> -
>
> Key: FLINK-2013
> URL: https://issues.apache.org/jira/browse/FLINK-2013
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Theodore Vasiloudis
>Assignee: Theodore Vasiloudis
>  Labels: ML
>
> [Generalized linear 
> models|http://en.wikipedia.org/wiki/Generalized_linear_model] (GLMs) provide 
> an abstraction for many learning models that can be used for regression and 
> classification tasks.
> Some example GLMs are linear regression, logistic regression, LASSO and ridge 
> regression.
> The goal for this JIRA is to provide interfaces for the set of common 
> properties and functions these models share. 
> In order to achieve that, a design pattern similar to the one that 
> [sklearn|http://scikit-learn.org/stable/modules/linear_model.html] and 
> [MLlib|http://spark.apache.org/docs/1.3.0/mllib-linear-methods.html] employ 
> will be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1934) Add approximative k-nearest-neighbours (kNN) algorithm to machine learning library

2017-01-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825887#comment-15825887
 ] 

Stavros Kontopoulos commented on FLINK-1934:


Hey guys Is this still active?

> Add approximative k-nearest-neighbours (kNN) algorithm to machine learning 
> library
> --
>
> Key: FLINK-1934
> URL: https://issues.apache.org/jira/browse/FLINK-1934
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Daniel Blazevski
>  Labels: ML
>
> kNN is still a widely used algorithm for classification and regression. 
> However, due to the computational costs of an exact implementation, it does 
> not scale well to large amounts of data. Therefore, it is worthwhile to also 
> add an approximative kNN implementation as proposed in [1,2].  Reference [3] 
> is cited a few times in [1], and gives necessary background on the z-value 
> approach.
> Resources:
> [1] https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf
> [2] http://www.computer.org/csdl/proceedings/wacv/2007/2794/00/27940028.pdf
> [3] http://cs.sjtu.edu.cn/~yaobin/papers/icde10_knn.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5588) Add a unit scaler based on different norms

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831872#comment-15831872
 ] 

Stavros Kontopoulos commented on FLINK-5588:


Thnx [~till.rohrmann] :) Will give it a shot :)

> Add a unit scaler based on different norms
> --
>
> Key: FLINK-5588
> URL: https://issues.apache.org/jira/browse/FLINK-5588
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Minor
>
> So far ML has two scalers: min-max and the standard.
> A third one frequently used, is the scaler to unit.
> We could implement a transformer for this type of scaling for different norms 
> available to the user.
> Resources
> [1] https://en.wikipedia.org/wiki/Feature_scaling



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-20 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated FLINK-5525:
---
Description: 
Given the nature of Flink we should have a streaming version of the algorithms 
when possible.
Update of the model should be done on a per window basis.
An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning

Resources

[1] 
http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
[2] 
http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
[3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html


  was:
Given the nature of Flink we should have a streaming version of the algorithms 
when possible.
Update of the model should be done on a per window basis.
An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning

Resources

[1] 
http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
[2] 
http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression



> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression
> [3] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5525) Streaming Version of a Linear Regression model

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831686#comment-15831686
 ] 

Stavros Kontopoulos commented on FLINK-5525:


[~tvas]I was thinking of working on this what do you think?

> Streaming Version of a Linear Regression model
> --
>
> Key: FLINK-5525
> URL: https://issues.apache.org/jira/browse/FLINK-5525
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>
> Given the nature of Flink we should have a streaming version of the 
> algorithms when possible.
> Update of the model should be done on a per window basis.
> An extreme case is: https://en.wikipedia.org/wiki/Online_machine_learning
> Resources
> [1] 
> http://scikit-learn.org/dev/modules/scaling_strategies.html#incremental-learning
> [2] 
> http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1743) Add multinomial logistic regression to machine learning library

2017-01-20 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831692#comment-15831692
 ] 

Stavros Kontopoulos commented on FLINK-1743:


[~dedrummond] what is the status for this, are you actively involved? Need 
help? I ask since there are many dependencies with ML tasks and try to figure 
out where progress is possible at this point of time.

> Add multinomial logistic regression to machine learning library
> ---
>
> Key: FLINK-1743
> URL: https://issues.apache.org/jira/browse/FLINK-1743
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: David E Drummond
>  Labels: ML
>
> Multinomial logistic regression [1] would be good first classification 
> algorithm which can classify multiple classes. 
> Resources:
> [1] [http://en.wikipedia.org/wiki/Multinomial_logistic_regression]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-5785) Add an Imputer for preparing data

2017-03-28 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944851#comment-15944851
 ] 

Stavros Kontopoulos commented on FLINK-5785:


[~beera]Thnx I will have a look ASAP.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-03 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15953422#comment-15953422
 ] 

Stavros Kontopoulos commented on FLINK-2147:


[~aljoscha]What is your suggestion for this?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955103#comment-15955103
 ] 

Stavros Kontopoulos commented on FLINK-2147:


I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-mi sketches. This could be windows based and would involve 
to implement custom operators. This is a high level description and may not fit 
exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955103#comment-15955103
 ] 

Stavros Kontopoulos edited comment on FLINK-2147 at 4/4/17 1:01 PM:


I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-min sketches. Sketched of this type can be merged.

This could be windows based and would involve to implement custom operators. 
This is a high level description and may not fit exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files


was (Author: skonto):
I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-mi sketches. This could be windows based and would involve 
to implement custom operators. This is a high level description and may not fit 
exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955103#comment-15955103
 ] 

Stavros Kontopoulos edited comment on FLINK-2147 at 4/4/17 1:02 PM:


I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-min sketches. Sketches of this type can be merged.

This could be windows based and would involve to implement custom operators. 
This is a high level description and may not fit exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files


was (Author: skonto):
I think Count-min sketch can be implemented in way that each task keeps a local 
count-min sketch as state, and as a next step it emits the frequencies after an 
aggregation of count-min sketches. Sketched of this type can be merged.

This could be windows based and would involve to implement custom operators. 
This is a high level description and may not fit exactly to the internals.

A distributed implementation here:
https://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
https://github.com/apache/spark/pull/10911/files

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955164#comment-15955164
 ] 

Stavros Kontopoulos edited comment on FLINK-2147 at 4/4/17 2:01 PM:


You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N count-min sketches. Wouldn't 
that work?


was (Author: skonto):
You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N count-min sketched. Wouldn't 
that work?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955164#comment-15955164
 ] 

Stavros Kontopoulos commented on FLINK-2147:


You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N count-min sketched. Wouldn't 
that work?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-2147) Approximate calculation of frequencies in data streams

2017-04-04 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955164#comment-15955164
 ] 

Stavros Kontopoulos edited comment on FLINK-2147 at 4/4/17 2:04 PM:


You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N-1 count-min sketches, keep the 
first as is. Wouldn't that work?


was (Author: skonto):
You just pick one of the sketches merge it with another one kill the task (3 
down to 2 case).
For 1 to N. Just split the stream and create N count-min sketches. Wouldn't 
that work?

> Approximate calculation of frequencies in data streams
> --
>
> Key: FLINK-2147
> URL: https://issues.apache.org/jira/browse/FLINK-2147
> Project: Flink
>  Issue Type: New Feature
>  Components: DataStream API
>Reporter: Gabor Gevay
>  Labels: approximate, statistics
>
> Count-Min sketch is a hashing-based algorithm for approximately keeping track 
> of the frequencies of elements in a data stream. It is described by Cormode 
> et al. in the following paper:
> http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf
> Note that this algorithm can be conveniently implemented in a distributed 
> way, as described in section 3.2 of the paper.
> The paper
> http://www.vldb.org/conf/2002/S10P03.pdf
> also describes algorithms for approximately keeping track of frequencies, but 
> here the user can specify a threshold below which she is not interested in 
> the frequency of an element. The error-bounds are also different than the 
> Count-min sketch algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5785) Add an Imputer for preparing data

2017-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901830#comment-15901830
 ] 

Stavros Kontopoulos edited comment on FLINK-5785 at 3/8/17 8:25 PM:


[~beera] Let me know if you want any kind of help.


was (Author: skonto):
[~beera] If you do that please follow my approach here for raising exceptions:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127

I will finish that PR for unit scaling ASAP.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5785) Add an Imputer for preparing data

2017-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901830#comment-15901830
 ] 

Stavros Kontopoulos edited comment on FLINK-5785 at 3/8/17 8:25 PM:


[~beera] If you do that please follow my approach here for raising exceptions:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127

I will finish that PR for unit scaling ASAP.


was (Author: skonto):
[~beera] If you do that please follow my approach here:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127

I will finish that PR for unit scaling ASAP.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5785) Add an Imputer for preparing data

2017-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901830#comment-15901830
 ] 

Stavros Kontopoulos commented on FLINK-5785:


[~beera] If you do that please follow my approach here:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127
I will finish that PR ASAP.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5785) Add an Imputer for preparing data

2017-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901830#comment-15901830
 ] 

Stavros Kontopoulos edited comment on FLINK-5785 at 3/8/17 8:26 PM:


[~beera] Let me know if you want any help. Also let me know when finished so I 
can review your work.  


was (Author: skonto):
[~beera] Let me know if you want any kind of help.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5785) Add an Imputer for preparing data

2017-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901830#comment-15901830
 ] 

Stavros Kontopoulos edited comment on FLINK-5785 at 3/8/17 7:25 PM:


[~beera] If you do that please follow my approach here:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127

I will finish that PR for unit scaling ASAP.


was (Author: skonto):
[~beera] If you do that please follow my approach here:
https://github.com/skonto/flink/blob/6736a66ae1bd2c0efbaa29cf170cabd18b281a8a/flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/Normalizer.scala#L127
I will finish that PR ASAP.

> Add an Imputer for preparing data
> -
>
> Key: FLINK-5785
> URL: https://issues.apache.org/jira/browse/FLINK-5785
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>
> We need to add an Imputer as described in [1].
> "The Imputer class provides basic strategies for imputing missing values, 
> either using the mean, the median or the most frequent value of the row or 
> column in which the missing values are located. This class also allows for 
> different missing values encodings."
> References
> 1. http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
> 2. 
> http://scikit-learn.org/stable/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5536) Config option: HA

2017-05-10 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005585#comment-16005585
 ] 

Stavros Kontopoulos commented on FLINK-5536:


https://github.com/mesosphere/dcos-flink-service/pull/23
https://github.com/mesosphere/universe/pull/1163

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5536) Config option: HA

2017-05-10 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16005585#comment-16005585
 ] 

Stavros Kontopoulos edited comment on FLINK-5536 at 5/10/17 10:43 PM:
--

[~eronwright]
https://github.com/mesosphere/dcos-flink-service/pull/23
https://github.com/mesosphere/universe/pull/1163


was (Author: skonto):
https://github.com/mesosphere/dcos-flink-service/pull/23
https://github.com/mesosphere/universe/pull/1163

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (FLINK-6668) Add flink history server to DCOS

2017-05-22 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created FLINK-6668:
--

 Summary: Add flink history server to DCOS
 Key: FLINK-6668
 URL: https://issues.apache.org/jira/browse/FLINK-6668
 Project: Flink
  Issue Type: New Feature
  Components: Mesos
Reporter: Stavros Kontopoulos


We need to have history server within dc/os env as with the spark case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6668) Add flink history server to DCOS

2017-05-22 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020233#comment-16020233
 ] 

Stavros Kontopoulos commented on FLINK-6668:


[~eronwright] what do you think? I created a separate issue for this. Is this 
viable to implement, I am willing to try it.

> Add flink history server to DCOS
> 
>
> Key: FLINK-6668
> URL: https://issues.apache.org/jira/browse/FLINK-6668
> Project: Flink
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Stavros Kontopoulos
>
> We need to have history server within dc/os env as with the spark case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6668) Add flink history server to DCOS

2017-05-23 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020749#comment-16020749
 ] 

Stavros Kontopoulos commented on FLINK-6668:


In DC/OS spark history server is just another service reading from hdfs the 
history data. It runs independently from the dispatcher that's why I was 
thinking to take the work in FLINK-1579 and run it the same way. I will have a 
look.

> Add flink history server to DCOS
> 
>
> Key: FLINK-6668
> URL: https://issues.apache.org/jira/browse/FLINK-6668
> Project: Flink
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Stavros Kontopoulos
>
> We need to have history server within dc/os env as with the spark case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-6668) Add flink history server to DCOS

2017-05-23 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020749#comment-16020749
 ] 

Stavros Kontopoulos edited comment on FLINK-6668 at 5/23/17 7:20 AM:
-

In DC/OS spark history server is just another service reading from hdfs the 
history data. It runs independently from the dispatcher that's why I was 
thinking to take the work in FLINK-1579 and run it the same way. Cool I will 
have a look.


was (Author: skonto):
In DC/OS spark history server is just another service reading from hdfs the 
history data. It runs independently from the dispatcher that's why I was 
thinking to take the work in FLINK-1579 and run it the same way. I will have a 
look.

> Add flink history server to DCOS
> 
>
> Key: FLINK-6668
> URL: https://issues.apache.org/jira/browse/FLINK-6668
> Project: Flink
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Stavros Kontopoulos
>
> We need to have history server within dc/os env as with the spark case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5536) Config option: HA

2017-05-02 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993627#comment-15993627
 ] 

Stavros Kontopoulos edited comment on FLINK-5536 at 5/2/17 8:06 PM:


Ok can do that. I verified that these work in dc/os:
extra-args: -Dhigh-availability=zookeeper 
-Dhigh-availability.zookeeper.quorum=master.mesos:2181 
-Dhigh-availability.zookeeper.storageDir=hdfs://hdfs/flink/recovery 
-Drecovery.zookeeper.path.mesos-workers=/flink

Hdfs: config-url: http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints

An hdfs folder is requirement.

Of course need to verify that a job restarts ok etc...

Passing them to the service is trivial. As noon as I check this I will create a 
PR.


was (Author: skonto):
Ok can do that. I verified that these work in dc/os:
extra-args: -Dhigh-availability=zookeeper 
-Dhigh-availability.zookeeper.quorum=master.mesos:2181 
-Dhigh-availability.zookeeper.storageDir=hdfs://hdfs/flink/recovery 
-Drecovery.zookeeper.path.mesos-workers=/flink

Hdfs: config-url: http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints
Of course need to verify that a job restarts ok etc...

Passing them to the service is trivial. As noon as I check this I will create a 
PR.

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5536) Config option: HA

2017-05-02 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993627#comment-15993627
 ] 

Stavros Kontopoulos edited comment on FLINK-5536 at 5/2/17 8:06 PM:


Ok can do that. I verified that these work in dc/os:
extra-args: -Dhigh-availability=zookeeper 
-Dhigh-availability.zookeeper.quorum=master.mesos:2181 
-Dhigh-availability.zookeeper.storageDir=hdfs://hdfs/flink/recovery 
-Drecovery.zookeeper.path.mesos-workers=/flink

Hdfs: config-url: http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints

An hdfs folder is required.

Of course need to verify that a job restarts ok etc...

Passing them to the service is trivial. As noon as I check this I will create a 
PR.


was (Author: skonto):
Ok can do that. I verified that these work in dc/os:
extra-args: -Dhigh-availability=zookeeper 
-Dhigh-availability.zookeeper.quorum=master.mesos:2181 
-Dhigh-availability.zookeeper.storageDir=hdfs://hdfs/flink/recovery 
-Drecovery.zookeeper.path.mesos-workers=/flink

Hdfs: config-url: http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints

An hdfs folder is requirement.

Of course need to verify that a job restarts ok etc...

Passing them to the service is trivial. As noon as I check this I will create a 
PR.

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5536) Config option: HA

2017-05-02 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15993627#comment-15993627
 ] 

Stavros Kontopoulos commented on FLINK-5536:


Ok can do that. I verified that these work in dc/os:
extra-args: -Dhigh-availability=zookeeper 
-Dhigh-availability.zookeeper.quorum=master.mesos:2181 
-Dhigh-availability.zookeeper.storageDir=hdfs://hdfs/flink/recovery 
-Drecovery.zookeeper.path.mesos-workers=/flink

Hdfs: config-url: http://api.hdfs.marathon.l4lb.thisdcos.directory/v1/endpoints
Of course need to verify that a job restarts ok etc...

Passing them to the service is trivial. As noon as I check this I will create a 
PR.

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5536) Config option: HA

2017-05-02 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992696#comment-15992696
 ] 

Stavros Kontopoulos edited comment on FLINK-5536 at 5/2/17 10:43 AM:
-

[~eronwright] I see you can pass extra args: 
https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/F/flink/1/config.json#L50
 is this enough? Is HA fully implemented for flink on mesos?



was (Author: skonto):
[~eronwright] I see you can pass extra args: 
https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/F/flink/1/config.json#L50
 is this enough?


> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5536) Config option: HA

2017-05-02 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15992696#comment-15992696
 ] 

Stavros Kontopoulos commented on FLINK-5536:


[~eronwright] I see you can pass extra args: 
https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/F/flink/1/config.json#L50
 is this enough?


> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (FLINK-5536) Config option: HA

2017-05-06 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999615#comment-15999615
 ] 

Stavros Kontopoulos edited comment on FLINK-5536 at 5/6/17 10:31 PM:
-

I tested it, ti seems to work fine. You need to setup flink-config.yaml plus 
the zookeeper namespace at the client side.
I will work for the PR. There are several properties we need to expose.


was (Author: skonto):
I tested it, i seems to work fine. You need to setup flink-config.yaml plus the 
zookeeper namespace at the client side.
I will work for the PR. There are several properties we need to expose.

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-5536) Config option: HA

2017-05-06 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15999615#comment-15999615
 ] 

Stavros Kontopoulos commented on FLINK-5536:


I tested it, i seems to work fine. You need to setup flink-config.yaml plus the 
zookeeper namespace at the client side.
I will work for the PR. There are several properties we need to expose.

> Config option: HA
> -
>
> Key: FLINK-5536
> URL: https://issues.apache.org/jira/browse/FLINK-5536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Mesos
>Reporter: Eron Wright 
>Assignee: Stavros Kontopoulos
>
> Configure Flink HA thru package options plus good defaults.   The main 
> components are ZK configuration and state backend configuration.
> - The ZK information can be defaulted to `master.mesos` as with other packages
> - Evaluate whether ZK can be fully configured by default, even if a state 
> backend isn't configured.
> - Use DCOS HDFS as the filesystem for the state backend.  Evaluate whether to 
> assume that DCOS HDFS is installed by default, or whether to make it explicit.
> - To use DCOS HDFS, the init script should download the core-site.xml and 
> hdfs-site.xml from the HDFS 'connection' endpoint.   Supply a default value 
> for the endpoint address; see 
> [https://docs.mesosphere.com/service-docs/hdfs/connecting-clients/].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos commented on FLINK-7771:


We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (or persisted log) in 
order to bind together state with (operator_id, task_id) and time. This way any 
query about state could always point to the correct task. Is this feasible or 
too adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:40 PM:
---

We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (or persisted log) in 
order to bind together state with (operator_id, task_id) and time. This way any 
query about state could always point to the correct task. Is this feasible or 
adds too much overhead?


was (Author: skonto):
We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (or persisted log) in 
order to bind together state with (operator_id, task_id) and time. This way any 
query about state could always point to the correct task. Is this feasible or 
too adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:48 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
checkpoint state changes for recovery) in order to correlate state with 
(operator_id, task_id) and time. This way any query about state could always 
point to the correct task. Is this feasible or adds too much overhead?


was (Author: skonto):
[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (or 
push changes to a distributed log for recovery) in order to correlate state 
with (operator_id, task_id) and time. This way any query about state could 
always point to the correct task. Is this feasible or adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:50 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
we should checkpoint state changes for recovery and when client wants to reply 
the event sequence) in order to correlate state with (operator_id, task_id) and 
time. This way any query about state could always point to the correct task. Is 
this feasible or adds too much overhead?


was (Author: skonto):
[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
checkpoint state changes for recovery and when client want to reply the event 
sequence) in order to correlate state with (operator_id, task_id) and time. 
This way any query about state could always point to the correct task. Is this 
feasible or adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:52 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
we should checkpoint state changes for recovery and when the client wants to 
reply the event sequence or we could write to a distributed log directly) in 
order to correlate state with (operator_id, task_id) and time. This way any 
query about state could always point to the correct task. Is this feasible or 
adds too much overhead?


was (Author: skonto):
[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
we should checkpoint state changes for recovery and when client wants to reply 
the event sequence) in order to correlate state with (operator_id, task_id) and 
time. This way any query about state could always point to the correct task. Is 
this feasible or adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:47 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (or 
push changes to a distributed log for recovery) in order to correlate state 
with (operator_id, task_id) and time. This way any query about state could 
always point to the correct task. Is this feasible or adds too much overhead?


was (Author: skonto):
[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (the 
client could push changes to a distributed log for recovery) in order to 
correlate state with (operator_id, task_id) and time. This way any query about 
state could always point to the correct task. Is this feasible or adds too much 
overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:48 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
checkpoint state changes for recovery and when client want to reply the event 
sequence) in order to correlate state with (operator_id, task_id) and time. 
This way any query about state could always point to the correct task. Is this 
feasible or adds too much overhead?


was (Author: skonto):
[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (plus 
checkpoint state changes for recovery) in order to correlate state with 
(operator_id, task_id) and time. This way any query about state could always 
point to the correct task. Is this feasible or adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:44 PM:
---

We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (the client could push 
changes to a distributed log for recovery) in order to correlate state with 
(operator_id, task_id) and time. This way any query about state could always 
point to the correct task. Is this feasible or adds too much overhead?


was (Author: skonto):
We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (or persisted log) in 
order to bind together state with (operator_id, task_id) and time. This way any 
query about state could always point to the correct task. Is this feasible or 
adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (FLINK-7771) Make the operator state queryable

2017-10-12 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202692#comment-16202692
 ] 

Stavros Kontopoulos edited comment on FLINK-7771 at 10/12/17 10:44 PM:
---

[~kkl0u] We could overcome some problems by allowing Flink to inform an 
external system about state changes. If re-assignment is done the client who 
issues the queries should know. It could subscribe to that event channel (the 
client could push changes to a distributed log for recovery) in order to 
correlate state with (operator_id, task_id) and time. This way any query about 
state could always point to the correct task. Is this feasible or adds too much 
overhead?


was (Author: skonto):
We could overcome some problems by allowing Flink to inform an external system 
about state changes. If re-assignment is done the client who issues the queries 
should know. It could subscribe to that event channel (the client could push 
changes to a distributed log for recovery) in order to correlate state with 
(operator_id, task_id) and time. This way any query about state could always 
point to the correct task. Is this feasible or adds too much overhead?

> Make the operator state queryable
> -
>
> Key: FLINK-7771
> URL: https://issues.apache.org/jira/browse/FLINK-7771
> Project: Flink
>  Issue Type: Improvement
>  Components: Queryable State
>Affects Versions: 1.4.0
>Reporter: Kostas Kloudas
>Assignee: Kostas Kloudas
>
> There seem to be some requests for making the operator (non-keyed) state 
> queryable. This means that the user will specify the *uuid* of the operator 
> and the *taskId*, and he will be able to access the state that corresponds to 
> that operator and for that specific task.
> This issue will serve to document the discussion on the topic, so that 
> everybody can participate.
> I also link [~till.rohrmann] and [~skonto] as he also mentioned that this 
> feature could be helpful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-5005) Remove Scala 2.10 support; add Scala 2.12 support

2018-08-01 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-5005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565406#comment-16565406
 ] 

Stavros Kontopoulos commented on FLINK-5005:


[~aljoscha] since [Flink closure 
cleaner|https://github.com/apache/flink/blob/d6126e7ca7a973635cc0d4ebaff52f35653df503/flink-scala/src/main/scala/org/apache/flink/api/scala/ClosureCleaner.scala#L33]
 relies on Spark closure cleaner,

I have an update: [https://github.com/apache/spark/pull/21930], so it should be 
close to add the support for Flink as well.

> Remove Scala 2.10 support; add Scala 2.12 support
> -
>
> Key: FLINK-5005
> URL: https://issues.apache.org/jira/browse/FLINK-5005
> Project: Flink
>  Issue Type: Improvement
>  Components: Scala API
>Reporter: Andrew Roberts
>Assignee: Aljoscha Krettek
>Priority: Major
> Fix For: 1.6.0
>
>
> Scala 2.12 was [released|http://www.scala-lang.org/news/2.12.0] today, and 
> offers many compile-time and runtime speed improvements. It would be great to 
> get artifacts up on maven central to allow Flink users to migrate to Scala 
> 2.12.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-6594) Implement Flink Dispatcher for Kubernetes

2018-03-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398419#comment-16398419
 ] 

Stavros Kontopoulos edited comment on FLINK-6594 at 3/14/18 11:38 AM:
--

[~till.rohrmann] Is there a task for making the ResourceManager aware of 
Kubernetes? For example for starting new Taskmanagers. I don't see a task like:

"Implement FLIP-6 Kubernetes Resource Manager". I see one for both Mesos and 
Yarn.


was (Author: skonto):
[~till.rohrmann] Is there a task for making the ResourceManager aware of 
Kubernetes? For example for starting new Taskmanagers.

> Implement Flink Dispatcher for Kubernetes
> -
>
> Key: FLINK-6594
> URL: https://issues.apache.org/jira/browse/FLINK-6594
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Reporter: Larry Wu
>Assignee: Larry Wu
>Priority: Major
>  Labels: Kubernetes
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This task is to implement Flink Dispatcher for Kubernetes, which is deployed 
> to Kubernetes cluster as a long-running pod. The Flink Dispatcher accepts job 
> submissions from Flink clients and asks Kubernetes API Server to create and 
> monitor a virtual cluster of Flink JobManager pod and Flink TaskManager Pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6594) Implement Flink Dispatcher for Kubernetes

2018-03-14 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398419#comment-16398419
 ] 

Stavros Kontopoulos commented on FLINK-6594:


[~till.rohrmann] Is there a task for making the ResourceManager aware of 
Kubernetes? For example for starting new Taskmanagers.

> Implement Flink Dispatcher for Kubernetes
> -
>
> Key: FLINK-6594
> URL: https://issues.apache.org/jira/browse/FLINK-6594
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Reporter: Larry Wu
>Assignee: Larry Wu
>Priority: Major
>  Labels: Kubernetes
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This task is to implement Flink Dispatcher for Kubernetes, which is deployed 
> to Kubernetes cluster as a long-running pod. The Flink Dispatcher accepts job 
> submissions from Flink clients and asks Kubernetes API Server to create and 
> monitor a virtual cluster of Flink JobManager pod and Flink TaskManager Pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6594) Implement Flink Dispatcher for Kubernetes

2018-03-19 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404641#comment-16404641
 ] 

Stavros Kontopoulos commented on FLINK-6594:


Sure will do [~till.rohrmann] 

> Implement Flink Dispatcher for Kubernetes
> -
>
> Key: FLINK-6594
> URL: https://issues.apache.org/jira/browse/FLINK-6594
> Project: Flink
>  Issue Type: New Feature
>  Components: Cluster Management
>Reporter: Larry Wu
>Assignee: Larry Wu
>Priority: Major
>  Labels: Kubernetes
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This task is to implement Flink Dispatcher for Kubernetes, which is deployed 
> to Kubernetes cluster as a long-running pod. The Flink Dispatcher accepts job 
> submissions from Flink clients and asks Kubernetes API Server to create and 
> monitor a virtual cluster of Flink JobManager pod and Flink TaskManager Pods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-10276) Job Manager and Task Manager Metrics Reporter Ports Configuration

2018-10-25 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664420#comment-16664420
 ] 

Stavros Kontopoulos edited comment on FLINK-10276 at 10/25/18 11:28 PM:


This might be a problem on K8s too, which does not cover port ranges: 
[https://github.com/prometheus/prometheus/issues/3756]


was (Author: skonto):
This might be a problem on K8s two, which does not cover port ranges: 
https://github.com/prometheus/prometheus/issues/3756

> Job Manager and Task Manager Metrics Reporter Ports Configuration
> -
>
> Key: FLINK-10276
> URL: https://issues.apache.org/jira/browse/FLINK-10276
> Project: Flink
>  Issue Type: New Feature
>  Components: Core
>Reporter: Deirdre Kong
>Priority: Major
>
> *Problem Statement:*
> When deploying Flink using YARN, the job manager and task manager can be on 
> the same node or different nodes.  Say I specify the port range to be 
> 9249-9250, if JM and TM are deployed on the same node, the port for JM will 
> be 9249 and the port for TM will be 9250.  If JM and TM are deployed on 
> different nodes, then the ports for JM and TM will be 9249.
> I can only configure Prometheus once for the ports to scrape JM and TMs 
> metrics.  In this case, I won't know whether port 9249 is for JM or TM.  If 
> would be great if we can specify in flink-conf.yaml on the port we want for 
> JM reporter and TMs reporter.
> *Comment from Till:*
> I think we could extend Vino's proposal for Yarn as well: Maybe it makes 
> sense to allow to override certain configuration settings for the 
> TaskManagers when deploying on Yarn. That way one could define a fixed port 
> for the JM and a port range for the TMs. Having such a distinction you can 
> configure your Prometheus to scrape for the single JM and the TMs 
> individually. However, Flink does not yet support such a feature. You can 
> open a JIRA issue to track the problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >