[jira] [Commented] (FLINK-1514) [Gelly] Add a Gather-Sum-Apply iteration method

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504119#comment-14504119
 ] 

ASF GitHub Bot commented on FLINK-1514:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/408#issuecomment-94606828
  
Hi @balidani!

I pulled your changes and made some improvements, mainly to the examples, 
so that they are consistent with the rest of Gelly's examples. You can see my 
changes in [this branch] (https://github.com/vasia/flink/tree/gsa).

I've been running some experiments on a small cluster and so far everything 
runs smoothly. GSA SSSP actually seems to be faster than vertex-centric SSSP 
for both input datasets I'm using :smile:

I will run a few more experiments in the next days and if I find no 
problems and there are no objections, I will merge by the end of the week.


> [Gelly] Add a Gather-Sum-Apply iteration method
> ---
>
> Key: FLINK-1514
> URL: https://issues.apache.org/jira/browse/FLINK-1514
> Project: Flink
>  Issue Type: New Feature
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Vasia Kalavri
>Assignee: Daniel Bali
>
> This will be a method that implements the GAS computation model, but without 
> the "scatter" step. The phases can be mapped into the following steps inside 
> a delta iteration:
> gather: a map on each < srcVertex, edge, trgVertex > that produces a partial 
> value
> sum: a reduce that combines the partial values
> apply: join with vertex set to update the vertex values using the results of 
> sum and the previous state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1514][Gelly] Add a Gather-Sum-Apply ite...

2015-04-20 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/408#issuecomment-94606828
  
Hi @balidani!

I pulled your changes and made some improvements, mainly to the examples, 
so that they are consistent with the rest of Gelly's examples. You can see my 
changes in [this branch] (https://github.com/vasia/flink/tree/gsa).

I've been running some experiments on a small cluster and so far everything 
runs smoothly. GSA SSSP actually seems to be faster than vertex-centric SSSP 
for both input datasets I'm using :smile:

I will run a few more experiments in the next days and if I find no 
problems and there are no objections, I will merge by the end of the week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503979#comment-14503979
 ] 

ASF GitHub Bot commented on FLINK-1297:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/605#discussion_r28739811
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/statistics/OperatorStatistics.java ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.statistics;
+
+import 
com.clearspring.analytics.stream.cardinality.CardinalityMergeException;
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import com.clearspring.analytics.stream.cardinality.ICardinality;
+import com.clearspring.analytics.stream.cardinality.LinearCounting;
+import org.apache.flink.statistics.heavyhitters.IHeavyHitter;
+import org.apache.flink.statistics.heavyhitters.LossyCounting;
+import org.apache.flink.statistics.heavyhitters.CountMinHeavyHitter;
+import org.apache.flink.statistics.heavyhitters.HeavyHitterMergeException;
+
+import java.io.Serializable;
+import java.util.Map;
+
+/**
+ * Data structure that encapsulates statistical information of data that 
has only been processed by one pass
+ * This statistical information is meant to help determine the 
distribution of the data that has been processed
+ * in an operator so that we can determine if it is necessary to 
repartition the data
+ *
+ * The statistics to be gathered are configurable and represented by a 
{@link OperatorStatisticsConfig} object.
+ *
+ * The information encapsulated in this class is min, max, a structure 
enabling estimation of count distinct and a
+ * structure holding the heavy hitters along with their frequency.
+ *
+ */
+public class OperatorStatistics implements Serializable {
+
+   OperatorStatisticsConfig config;
+
+   Object min;
+   Object max;
+   ICardinality countDistinct;
+   IHeavyHitter heavyHitter;
+   long cardinality = 0;
+
+   public OperatorStatistics(OperatorStatisticsConfig config) {
+   this.config = config;
+   if 
(config.countDistinctAlgorithm.equals(OperatorStatisticsConfig.CountDistinctAlgorithm.LINEAR_COUNTING))
 {
+   countDistinct = new 
LinearCounting(OperatorStatisticsConfig.COUNTD_BITMAP_SIZE);
+   }
+   
if(config.countDistinctAlgorithm.equals(OperatorStatisticsConfig.CountDistinctAlgorithm.HYPERLOGLOG)){
+   countDistinct = new 
HyperLogLog(OperatorStatisticsConfig.COUNTD_LOG2M);
+   }
+   if 
(config.heavyHitterAlgorithm.equals(OperatorStatisticsConfig.HeavyHitterAlgorithm.LOSSY_COUNTING)){
+   heavyHitter =
+   new 
LossyCounting(OperatorStatisticsConfig.HEAVY_HITTER_FRACTION, 
OperatorStatisticsConfig.HEAVY_HITTER_ERROR);
+   }
+   if 
(config.heavyHitterAlgorithm.equals(OperatorStatisticsConfig.HeavyHitterAlgorithm.COUNT_MIN_SKETCH)){
+   heavyHitter =
+   new 
CountMinHeavyHitter(OperatorStatisticsConfig.HEAVY_HITTER_FRACTION,
+   
OperatorStatisticsConfig.HEAVY_HITTER_ERROR,
+   
OperatorStatisticsConfig.HEAVY_HITTER_CONFIDENCE,
+   
OperatorStatisticsConfig.HEAVY_HITTER_SEED);
+   }
+   }
+
+   public void process(Object tupleObject){
--- End diff --

It looks like this method expected to be called for each passing element if 
statistics collection is enabled. This could add significant processing 
overhead. Would it make sense to add an optional skip interval n that processes 
on

[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/605#discussion_r28739811
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/statistics/OperatorStatistics.java ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.statistics;
+
+import 
com.clearspring.analytics.stream.cardinality.CardinalityMergeException;
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+import com.clearspring.analytics.stream.cardinality.ICardinality;
+import com.clearspring.analytics.stream.cardinality.LinearCounting;
+import org.apache.flink.statistics.heavyhitters.IHeavyHitter;
+import org.apache.flink.statistics.heavyhitters.LossyCounting;
+import org.apache.flink.statistics.heavyhitters.CountMinHeavyHitter;
+import org.apache.flink.statistics.heavyhitters.HeavyHitterMergeException;
+
+import java.io.Serializable;
+import java.util.Map;
+
+/**
+ * Data structure that encapsulates statistical information of data that 
has only been processed by one pass
+ * This statistical information is meant to help determine the 
distribution of the data that has been processed
+ * in an operator so that we can determine if it is necessary to 
repartition the data
+ *
+ * The statistics to be gathered are configurable and represented by a 
{@link OperatorStatisticsConfig} object.
+ *
+ * The information encapsulated in this class is min, max, a structure 
enabling estimation of count distinct and a
+ * structure holding the heavy hitters along with their frequency.
+ *
+ */
+public class OperatorStatistics implements Serializable {
+
+   OperatorStatisticsConfig config;
+
+   Object min;
+   Object max;
+   ICardinality countDistinct;
+   IHeavyHitter heavyHitter;
+   long cardinality = 0;
+
+   public OperatorStatistics(OperatorStatisticsConfig config) {
+   this.config = config;
+   if 
(config.countDistinctAlgorithm.equals(OperatorStatisticsConfig.CountDistinctAlgorithm.LINEAR_COUNTING))
 {
+   countDistinct = new 
LinearCounting(OperatorStatisticsConfig.COUNTD_BITMAP_SIZE);
+   }
+   
if(config.countDistinctAlgorithm.equals(OperatorStatisticsConfig.CountDistinctAlgorithm.HYPERLOGLOG)){
+   countDistinct = new 
HyperLogLog(OperatorStatisticsConfig.COUNTD_LOG2M);
+   }
+   if 
(config.heavyHitterAlgorithm.equals(OperatorStatisticsConfig.HeavyHitterAlgorithm.LOSSY_COUNTING)){
+   heavyHitter =
+   new 
LossyCounting(OperatorStatisticsConfig.HEAVY_HITTER_FRACTION, 
OperatorStatisticsConfig.HEAVY_HITTER_ERROR);
+   }
+   if 
(config.heavyHitterAlgorithm.equals(OperatorStatisticsConfig.HeavyHitterAlgorithm.COUNT_MIN_SKETCH)){
+   heavyHitter =
+   new 
CountMinHeavyHitter(OperatorStatisticsConfig.HEAVY_HITTER_FRACTION,
+   
OperatorStatisticsConfig.HEAVY_HITTER_ERROR,
+   
OperatorStatisticsConfig.HEAVY_HITTER_CONFIDENCE,
+   
OperatorStatisticsConfig.HEAVY_HITTER_SEED);
+   }
+   }
+
+   public void process(Object tupleObject){
--- End diff --

It looks like this method expected to be called for each passing element if 
statistics collection is enabled. This could add significant processing 
overhead. Would it make sense to add an optional skip interval n that processes 
only every n-th element (except from increasing the `cardinality` counter.

Also it makes sense to perform as many checks outside of the function as 
possible. The type of the data elements is known before execution. So it can be 
checked if a data t

[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/605#discussion_r28739333
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/statistics/heavyhitters/IHeavyHitter.java
 ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.statistics.heavyhitters;
+
+import java.util.HashMap;
+
+/**
+ * Interface for classes that track heavy hitters. It follows the same 
design as the interfaces from
+ * {@link com.clearspring.analytics.stream}
+ */
+public interface IHeavyHitter {
--- End diff --

Interfaces do not follow a special naming convention in the Flink code base 
such as starting with a capital `I`. Please rename to `HeavyHitter` for 
consistency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503961#comment-14503961
 ] 

ASF GitHub Bot commented on FLINK-1297:
---

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/605#discussion_r28739333
  
--- Diff: 
flink-core/src/main/java/org/apache/flink/statistics/heavyhitters/IHeavyHitter.java
 ---
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.statistics.heavyhitters;
+
+import java.util.HashMap;
+
+/**
+ * Interface for classes that track heavy hitters. It follows the same 
design as the interfaces from
+ * {@link com.clearspring.analytics.stream}
+ */
+public interface IHeavyHitter {
--- End diff --

Interfaces do not follow a special naming convention in the Flink code base 
such as starting with a capital `I`. Please rename to `HeavyHitter` for 
consistency.


> Add support for tracking statistics of intermediate results
> ---
>
> Key: FLINK-1297
> URL: https://issues.apache.org/jira/browse/FLINK-1297
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Runtime
>Reporter: Alexander Alexandrov
>Assignee: Alexander Alexandrov
> Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503927#comment-14503927
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94591244
  
Thanks for the prompt update and the good work!
Will try the PR and merge it if everything works fine :-)


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94591244
  
Thanks for the prompt update and the good work!
Will try the PR and merge it if everything works fine :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1486) Add a string to the print method to identify output

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503920#comment-14503920
 ] 

ASF GitHub Bot commented on FLINK-1486:
---

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/372#issuecomment-94590741
  
LGTM


> Add a string to the print method to identify output
> ---
>
> Key: FLINK-1486
> URL: https://issues.apache.org/jira/browse/FLINK-1486
> Project: Flink
>  Issue Type: Improvement
>  Components: Local Runtime
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Minor
>  Labels: usability
> Fix For: 0.9
>
>
> The output of the {{print}} method of {[DataSet}} is mainly used for debug 
> purposes. Currently, it is difficult to identify the output.
> I would suggest to add another {{print(String str)}} method which allows the 
> user to supply a String to identify the output. This could be a prefix before 
> the actual output or a format string (which might be an overkill).
> {code}
> DataSet data = env.fromElements(1,2,3,4,5);
> {code}
> For example, {{data.print("MyDataSet: ")}} would output print
> {noformat}
> MyDataSet: 1
> MyDataSet: 2
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1486] add print method for prefixing a ...

2015-04-20 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/372#issuecomment-94590741
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503801#comment-14503801
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94581541
  
@DarkKnightCZ that sounds strange. The TM should not terminate itself if it 
cannot connect to the JM unless the maximum registration duration has been 
configured. Is it possible that you link the log file of one of the failed TM? 
That would allow to investigate the problem more thoroughly.


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94581541
  
@DarkKnightCZ that sounds strange. The TM should not terminate itself if it 
cannot connect to the JM unless the maximum registration duration has been 
configured. Is it possible that you link the log file of one of the failed TM? 
That would allow to investigate the problem more thoroughly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1719) Add naive Bayes classification algorithm to machine learning library

2015-04-20 Thread Jonathan Hasenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503788#comment-14503788
 ] 

Jonathan Hasenburg commented on FLINK-1719:
---

I will write my Bachelor Thesis about the implementation of the algorithm among 
other things. So please don't "steal" my topic :)

> Add naive Bayes classification algorithm to machine learning library
> 
>
> Key: FLINK-1719
> URL: https://issues.apache.org/jira/browse/FLINK-1719
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Jonathan Hasenburg
>  Labels: ML
>
> Add naive Bayes algorithm to Flink's machine learning library as a basic 
> classification algorithm. Maybe we can incorporate some of the improvements 
> developed by [Karl-Michael 
> Schneider|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.2085&rep=rep1&type=pdf],
>  [Sang-Bum Kim et 
> al.|http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1704799] or 
> [Jason Rennie et 
> al.|http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf] into the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1758) Extend Gelly's neighborhood methods

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503665#comment-14503665
 ] 

ASF GitHub Bot commented on FLINK-1758:
---

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/576#issuecomment-94572014
  
Apart from the fact that I don't agree with those two methods having the 
same name, everything else was updated. 


> Extend Gelly's neighborhood methods
> ---
>
> Key: FLINK-1758
> URL: https://issues.apache.org/jira/browse/FLINK-1758
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Vasia Kalavri
>Assignee: Andra Lungu
>
> Currently, the neighborhood methods only allow returning a single value per 
> vertex. In many cases, it is desirable to return several or no value per 
> vertex. This is the case in clustering coefficient computation, 
> vertex-centric jaccard, algorithms where a vertex computes a value per edge 
> or when a vertex computes a value only for some of its neighbors.
> This issue proposes to 
> - change the current reduceOnEdges/reduceOnNeighbors methods to use 
> combinable reduce operations where possible
> - provide groupReduce-versions, which will use a Collector and allow 
> returning none or more values per vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1758][gelly] Neighborhood Methods Exten...

2015-04-20 Thread andralungu
Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/576#issuecomment-94572014
  
Apart from the fact that I don't agree with those two methods having the 
same name, everything else was updated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1907) Scala Interactive Shell

2015-04-20 Thread Nikolaas Steenbergen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503349#comment-14503349
 ] 

Nikolaas Steenbergen commented on FLINK-1907:
-

Hey, I don't know if im doing this correctly.
I've cloned and branched the flink repository and added a wrapper for the scala 
ILoop:

[https://github.com/nikste/flink/tree/Scala_shell]

You can put in a custom welcome message and promt (which is certainly nice).
It can write out the compiled code from user input via "scala-shell> 
writeFlinkVD" to the /tmp/ directory.

It will write a bunch of directories (line0,line1 etc.) with a bunch of classes 
read.class and eval.class.

Initially I thought you could compile a jar out of it and send it somehow to 
flink to process it. 
However I am not so sure anymore if this can solve the ClassNotFoundException.




> Scala Interactive Shell
> ---
>
> Key: FLINK-1907
> URL: https://issues.apache.org/jira/browse/FLINK-1907
> Project: Flink
>  Issue Type: New Feature
>  Components: Scala API
>Reporter: Nikolaas Steenbergen
>Assignee: Nikolaas Steenbergen
>Priority: Minor
>
> Build an interactive Shell for the Scala api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1915) Faulty plan selection by optimizer

2015-04-20 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-1915:


 Summary: Faulty plan selection by optimizer
 Key: FLINK-1915
 URL: https://issues.apache.org/jira/browse/FLINK-1915
 Project: Flink
  Issue Type: Bug
Reporter: Till Rohrmann
Priority: Minor


The optimizer selects for certain jobs a sub-optimal execution plan. 

For example, the {{WebLogAnalysis}} example job contains a coGroup input which 
consists of a {{Filter}} and a subsequent {{Projection}}. The optimizer inserts 
a hash partitioning between the filter and the mapper (projection) and a 
sorting after the projection. It would be more efficient if the hash 
partitioning would have been done after the projection, because the data is 
smaller at this stage.

I could observe a similar behaviour for a larger job, where the hash 
partitioning was executed before a filter operation which was then used as 
input for a join operator. I suspect that the optimizer considers the two plans 
(hash partitioning before the filter and after the filter) as equivalent in the 
absence of proper size estimates. However, executing the hash partitioning 
after the filter should always be more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503224#comment-14503224
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94512484
  
I updated this PR. Thanks for advice. :)

* Remove `setParallelism(1)` in test code.
* Simplify `testGroupReduceWithAtomicValue` in `GroupReduceITCase`.
* Add a `null` check for `expressionsIn` in constructor of `ExpressionKeys`.


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94512484
  
I updated this PR. Thanks for advice. :)

* Remove `setParallelism(1)` in test code.
* Simplify `testGroupReduceWithAtomicValue` in `GroupReduceITCase`.
* Add a `null` check for `expressionsIn` in constructor of `ExpressionKeys`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1486) Add a string to the print method to identify output

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503147#comment-14503147
 ] 

ASF GitHub Bot commented on FLINK-1486:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/372#issuecomment-94502825
  
I've updated the pull request. I decided to implement the concise method:
```
sinkId:taskId> output  <- sink id provided, parallelism > 1
sinkId> output <- sink id provided, parallelism == 1
taskId> output <- no sink id provided, parallelism > 1
output <- no sink id provided, parallelism == 1
```
If no objections, I will merge this tomorrow.


> Add a string to the print method to identify output
> ---
>
> Key: FLINK-1486
> URL: https://issues.apache.org/jira/browse/FLINK-1486
> Project: Flink
>  Issue Type: Improvement
>  Components: Local Runtime
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Minor
>  Labels: usability
> Fix For: 0.9
>
>
> The output of the {{print}} method of {[DataSet}} is mainly used for debug 
> purposes. Currently, it is difficult to identify the output.
> I would suggest to add another {{print(String str)}} method which allows the 
> user to supply a String to identify the output. This could be a prefix before 
> the actual output or a format string (which might be an overkill).
> {code}
> DataSet data = env.fromElements(1,2,3,4,5);
> {code}
> For example, {{data.print("MyDataSet: ")}} would output print
> {noformat}
> MyDataSet: 1
> MyDataSet: 2
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1486] add print method for prefixing a ...

2015-04-20 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/372#issuecomment-94502825
  
I've updated the pull request. I decided to implement the concise method:
```
sinkId:taskId> output  <- sink id provided, parallelism > 1
sinkId> output <- sink id provided, parallelism == 1
taskId> output <- no sink id provided, parallelism > 1
output <- no sink id provided, parallelism == 1
```
If no objections, I will merge this tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread chiwanpark
Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28699341
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
+   for (Integer i : values) {
+   set.add(i);
+   }
+   for (Integer i : set) {
+   out.collect(i);
+   }
+   }
+   });
+
+   env.setParallelism(1);
--- End diff --

Oh, I just copied code from other test case. I will fix it :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503045#comment-14503045
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user chiwanpark commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28699341
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
+   for (Integer i : values) {
+   set.add(i);
+   }
+   for (Integer i : set) {
+   out.collect(i);
+   }
+   }
+   });
+
+   env.setParallelism(1);
--- End diff --

Oh, I just copied code from other test case. I will fix it :)


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1799) Scala API does not support generic arrays

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503030#comment-14503030
 ] 

ASF GitHub Bot commented on FLINK-1799:
---

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/582#issuecomment-94483743
  
I fixed @StephanEwen's complaint. It was incorrect but the type class of 
TypeInformation does not seem to be used in any places where it matters.


> Scala API does not support generic arrays
> -
>
> Key: FLINK-1799
> URL: https://issues.apache.org/jira/browse/FLINK-1799
> Project: Flink
>  Issue Type: Bug
>Reporter: Till Rohrmann
>Assignee: Aljoscha Krettek
>
> The Scala API does not support generic arrays at the moment. It throws a 
> rather unhelpful error message ```InvalidTypesException: The given type is 
> not a valid object array```.
> Code to reproduce the problem is given below:
> {code}
> def main(args: Array[String]) {
>   foobar[Double]
> }
> def foobar[T: ClassTag: TypeInformation]: DataSet[Block[T]] = {
>   val tpe = createTypeInformation[Array[T]]
>   null
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1799][scala] Fix handling of generic ar...

2015-04-20 Thread aljoscha
Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/582#issuecomment-94483743
  
I fixed @StephanEwen's complaint. It was incorrect but the type class of 
TypeInformation does not seem to be used in any places where it matters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503018#comment-14503018
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94482477
  
Thanks for the update. Looks really good! 
I have just a few inline comments. Once these are resolved, I would try the 
code and merge it if everything works.


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on the pull request:

https://github.com/apache/flink/pull/572#issuecomment-94482477
  
Thanks for the update. Looks really good! 
I have just a few inline comments. Once these are resolved, I would try the 
code and merge it if everything works.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503011#comment-14503011
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28697764
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/operators/Keys.java ---
@@ -274,24 +274,31 @@ private static int 
countNestedElementsBefore(CompositeType compositeType, int
 * Create ExpressionKeys from String-expressions
 */
public ExpressionKeys(String[] expressionsIn, 
TypeInformation type) {
-   if(!(type instanceof CompositeType)) {
-   throw new IllegalArgumentException("Key 
expressions are only supported on POJO types and Tuples. "
-   + "A type is considered a POJO 
if all its fields are public, or have both getters and setters defined");
-   }
-   CompositeType cType = (CompositeType) type;
-   
-   String[] expressions = removeDuplicates(expressionsIn);
-   if(expressionsIn.length != expressions.length) {
-   LOG.warn("The key expressions contained 
duplicates. They are now unique");
-   }
-   // extract the keys on their flat position
-   keyFields = new 
ArrayList(expressions.length);
-   for (int i = 0; i < expressions.length; i++) {
-   List keys = 
cType.getFlatFields(expressions[i]); // use separate list to do a size check
-   if(keys.size() == 0) {
-   throw new 
IllegalArgumentException("Unable to extract key from expression 
'"+expressions[i]+"' on key "+cType);
+   if (type instanceof AtomicType) {
--- End diff --

can you add a `null` check for `expressionsIn`?


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28697764
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/operators/Keys.java ---
@@ -274,24 +274,31 @@ private static int 
countNestedElementsBefore(CompositeType compositeType, int
 * Create ExpressionKeys from String-expressions
 */
public ExpressionKeys(String[] expressionsIn, 
TypeInformation type) {
-   if(!(type instanceof CompositeType)) {
-   throw new IllegalArgumentException("Key 
expressions are only supported on POJO types and Tuples. "
-   + "A type is considered a POJO 
if all its fields are public, or have both getters and setters defined");
-   }
-   CompositeType cType = (CompositeType) type;
-   
-   String[] expressions = removeDuplicates(expressionsIn);
-   if(expressionsIn.length != expressions.length) {
-   LOG.warn("The key expressions contained 
duplicates. They are now unique");
-   }
-   // extract the keys on their flat position
-   keyFields = new 
ArrayList(expressions.length);
-   for (int i = 0; i < expressions.length; i++) {
-   List keys = 
cType.getFlatFields(expressions[i]); // use separate list to do a size check
-   if(keys.size() == 0) {
-   throw new 
IllegalArgumentException("Unable to extract key from expression 
'"+expressions[i]+"' on key "+cType);
+   if (type instanceof AtomicType) {
--- End diff --

can you add a `null` check for `expressionsIn`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502996#comment-14502996
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696934
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
+   for (Integer i : values) {
+   set.add(i);
+   }
+   for (Integer i : set) {
+   out.collect(i);
+   }
+   }
+   });
+
+   env.setParallelism(1);
--- End diff --

Why do you set the parallelism to 1?


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696934
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
+   for (Integer i : values) {
+   set.add(i);
+   }
+   for (Integer i : set) {
+   out.collect(i);
+   }
+   }
+   });
+
+   env.setParallelism(1);
--- End diff --

Why do you set the parallelism to 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502997#comment-14502997
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696965
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/JoinITCase.java
 ---
@@ -663,6 +663,39 @@ public void 
testNonPojoToVerifyNestedTupleElementSelectionWithFirstKeyFieldGreat
"((3,2,Hello world),(3,2,Hello 
world)),((3,2,Hello world),(3,2,Hello world))\n";
}
 
+   @Test
+   public void testJoinWithAtomicType1() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet> ds1 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+   DataSet ds2 = env.fromElements(1, 2);
+
+   DataSet, Integer>> joinDs 
= ds1.join(ds2).where(0).equalTo("*");
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
--- End diff --

Why parallelism == 1?


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502998#comment-14502998
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696976
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/JoinITCase.java
 ---
@@ -663,6 +663,39 @@ public void 
testNonPojoToVerifyNestedTupleElementSelectionWithFirstKeyFieldGreat
"((3,2,Hello world),(3,2,Hello 
world)),((3,2,Hello world),(3,2,Hello world))\n";
}
 
+   @Test
+   public void testJoinWithAtomicType1() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet> ds1 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+   DataSet ds2 = env.fromElements(1, 2);
+
+   DataSet, Integer>> joinDs 
= ds1.join(ds2).where(0).equalTo("*");
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
+   env.execute();
+
+   expected = "(1,1,Hi),1\n" +
+   "(2,2,Hello),2";
+   }
+
+   public void testJoinWithAtomicType2() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet ds1 = env.fromElements(1, 2);
+   DataSet> ds2 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+
+   DataSet>> joinDs 
= ds1.join(ds2).where("*").equalTo(0);
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
--- End diff --

Why parallelism == 1?


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696965
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/JoinITCase.java
 ---
@@ -663,6 +663,39 @@ public void 
testNonPojoToVerifyNestedTupleElementSelectionWithFirstKeyFieldGreat
"((3,2,Hello world),(3,2,Hello 
world)),((3,2,Hello world),(3,2,Hello world))\n";
}
 
+   @Test
+   public void testJoinWithAtomicType1() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet> ds1 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+   DataSet ds2 = env.fromElements(1, 2);
+
+   DataSet, Integer>> joinDs 
= ds1.join(ds2).where(0).equalTo("*");
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
--- End diff --

Why parallelism == 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696976
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/JoinITCase.java
 ---
@@ -663,6 +663,39 @@ public void 
testNonPojoToVerifyNestedTupleElementSelectionWithFirstKeyFieldGreat
"((3,2,Hello world),(3,2,Hello 
world)),((3,2,Hello world),(3,2,Hello world))\n";
}
 
+   @Test
+   public void testJoinWithAtomicType1() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet> ds1 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+   DataSet ds2 = env.fromElements(1, 2);
+
+   DataSet, Integer>> joinDs 
= ds1.join(ds2).where(0).equalTo("*");
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
+   env.execute();
+
+   expected = "(1,1,Hi),1\n" +
+   "(2,2,Hello),2";
+   }
+
+   public void testJoinWithAtomicType2() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+
+   DataSet ds1 = env.fromElements(1, 2);
+   DataSet> ds2 = 
CollectionDataSets.getSmall3TupleDataSet(env);
+
+   DataSet>> joinDs 
= ds1.join(ds2).where("*").equalTo(0);
+
+   joinDs.writeAsCsv(resultPath);
+   env.setParallelism(1);
--- End diff --

Why parallelism == 1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-703) Use complete element as join key.

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502995#comment-14502995
 ] 

ASF GitHub Bot commented on FLINK-703:
--

Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696843
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
--- End diff --

Since all elements of a group are identical when you do `groupBy("*")` this 
could be implemented a bit easier by just returning the first element of a 
group (`out.collect(values.iterator().next())`). 


> Use complete element as join key.
> -
>
> Key: FLINK-703
> URL: https://issues.apache.org/jira/browse/FLINK-703
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chiwan Park
>Priority: Trivial
>  Labels: github-import
> Fix For: pre-apache
>
>
> In some situations such as semi-joins it could make sense to use a complete 
> element as join key. 
> Currently this can be done using a key-selector function, but we could offer 
> a shortcut for that.
> This is not an urgent issue, but might be helpful.
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/703
> Created by: [fhueske|https://github.com/fhueske]
> Labels: enhancement, java api, user satisfaction, 
> Milestone: Release 0.6 (unplanned)
> Created at: Thu Apr 17 23:40:00 CEST 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-703] Use complete element as join key

2015-04-20 Thread fhueske
Github user fhueske commented on a diff in the pull request:

https://github.com/apache/flink/pull/572#discussion_r28696843
  
--- Diff: 
flink-tests/src/test/java/org/apache/flink/test/javaApiOperators/GroupReduceITCase.java
 ---
@@ -1063,6 +1065,33 @@ public void reduce(Iterable> values
 
}
 
+   @Test
+   public void testGroupReduceWithAtomicValue() throws Exception {
+   final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   DataSet ds = env.fromElements(1, 1, 2, 3, 4);
+   DataSet reduceDs = ds.groupBy("*").reduceGroup(new 
GroupReduceFunction() {
+   @Override
+   public void reduce(Iterable values, 
Collector out) throws Exception {
+   Set set = new HashSet();
--- End diff --

Since all elements of a group are identical when you do `groupBy("*")` this 
could be implemented a bit easier by just returning the first element of a 
group (`out.collect(values.iterator().next())`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1867) TaskManagerFailureRecoveryITCase causes stalled travis builds

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502967#comment-14502967
 ] 

ASF GitHub Bot commented on FLINK-1867:
---

GitHub user aljoscha opened a pull request:

https://github.com/apache/flink/pull/612

[FLINK-1867/1880] Raise test timeouts in hope of fixing Travis fails



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aljoscha/flink raise-test-timeouts

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/612.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #612


commit 4df27ee0ec1ce68376d51c4a882116651fd52788
Author: Aljoscha Krettek 
Date:   2015-04-13T14:16:51Z

[FLINK-1867/1880] Raise test timeouts in hope of fixing Travis fails




> TaskManagerFailureRecoveryITCase causes stalled travis builds
> -
>
> Key: FLINK-1867
> URL: https://issues.apache.org/jira/browse/FLINK-1867
> Project: Flink
>  Issue Type: Bug
>  Components: TaskManager, Tests
>Affects Versions: 0.9
>Reporter: Robert Metzger
>Assignee: Aljoscha Krettek
>
> There are currently tests on travis failing:
> https://travis-ci.org/apache/flink/jobs/57943063



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1867/1880] Raise test timeouts in hope ...

2015-04-20 Thread aljoscha
GitHub user aljoscha opened a pull request:

https://github.com/apache/flink/pull/612

[FLINK-1867/1880] Raise test timeouts in hope of fixing Travis fails



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aljoscha/flink raise-test-timeouts

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/612.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #612


commit 4df27ee0ec1ce68376d51c4a882116651fd52788
Author: Aljoscha Krettek 
Date:   2015-04-13T14:16:51Z

[FLINK-1867/1880] Raise test timeouts in hope of fixing Travis fails




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1914) Wrong FS while starting YARN session without correct HADOOP_HOME

2015-04-20 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502939#comment-14502939
 ] 

Robert Metzger commented on FLINK-1914:
---

I have to admit its not very prominent, but its there ;)
I'll add catch around the FS access to make the error message a bit more 
helpful.

> Wrong FS while starting YARN session without correct HADOOP_HOME
> 
>
> Key: FLINK-1914
> URL: https://issues.apache.org/jira/browse/FLINK-1914
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Reporter: Zoltán Zvara
>Priority: Trivial
>  Labels: yarn, yarn-client
>
> When YARN session invoked ({{yarn-session.sh}}) without a correct 
> {{HADOOP_HOME}} (AM still deployed to - for example to {{0.0.0.0:8032}}), but 
> the deployed AM fails with an {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: Wrong FS: 
> file:/home/.../flink-dist-0.9-SNAPSHOT.jar, expected: hdfs://localhost:9000
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.flink.yarn.Utils.registerLocalResource(Utils.java:105)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:436)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:371)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$class.org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession(ApplicationMasterActor.scala:371)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$receiveYarnMessages$1.applyOrElse(ApplicationMasterActor.scala:155)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
> {code}
> IMO this {{IllegalArgumentException}} should get handled in 
> {{org.apache.flink.yarn.Utils.registerLocalResource}} or on an upper level to 
> provide a better error message. This needs to be looked up from YARN logs at 
> the moment, which is painful to a trivial mistake like missing 
> {{HADOOP_HOME}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1914) Wrong FS while starting YARN session without correct HADOOP_HOME

2015-04-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/FLINK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502932#comment-14502932
 ] 

Zoltán Zvara commented on FLINK-1914:
-

There is a warn! Thanks!

> Wrong FS while starting YARN session without correct HADOOP_HOME
> 
>
> Key: FLINK-1914
> URL: https://issues.apache.org/jira/browse/FLINK-1914
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Reporter: Zoltán Zvara
>Priority: Trivial
>  Labels: yarn, yarn-client
>
> When YARN session invoked ({{yarn-session.sh}}) without a correct 
> {{HADOOP_HOME}} (AM still deployed to - for example to {{0.0.0.0:8032}}), but 
> the deployed AM fails with an {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: Wrong FS: 
> file:/home/.../flink-dist-0.9-SNAPSHOT.jar, expected: hdfs://localhost:9000
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.flink.yarn.Utils.registerLocalResource(Utils.java:105)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:436)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:371)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$class.org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession(ApplicationMasterActor.scala:371)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$receiveYarnMessages$1.applyOrElse(ApplicationMasterActor.scala:155)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
> {code}
> IMO this {{IllegalArgumentException}} should get handled in 
> {{org.apache.flink.yarn.Utils.registerLocalResource}} or on an upper level to 
> provide a better error message. This needs to be looked up from YARN logs at 
> the moment, which is painful to a trivial mistake like missing 
> {{HADOOP_HOME}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1799][scala] Fix handling of generic ar...

2015-04-20 Thread aljoscha
Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/582#discussion_r28692599
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/typeutils/ObjectArrayTypeInfo.java
 ---
@@ -143,6 +143,18 @@ else if (type instanceof Class && ((Class) 
type).isArray()
throw new InvalidTypesException("The given type is not a valid 
object array.");
}
 
+   /**
+* Creates a new {@link 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo} from a
+* {@link TypeInformation} for the component type.
+*
+* 
+* This must be used in cases where the complete type of the array is 
not available as a
+* {@link java.lang.reflect.Type} or {@link java.lang.Class}.
+*/
+   public static  ObjectArrayTypeInfo 
getInfoFor(TypeInformation componentInfo) {
+   return new ObjectArrayTypeInfo(Object[].class, 
componentInfo.getTypeClass(), componentInfo);
--- End diff --

That's right, I didn't think of that. Will change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1745) Add k-nearest-neighbours algorithm to machine learning library

2015-04-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502909#comment-14502909
 ] 

Till Rohrmann commented on FLINK-1745:
--

Hi Chiwan and Raghav,

cool that you two want to implement kNN :-) I think it is a good idea to first 
start with the exact kNN join implementations (H-BNLJ, H-BRJ). Having the kNN 
join, we automatically obtain kNN by simply doing a self join where we exclude 
the pairs which result from joining the same element to each other.

The most trivial implementation would be a simple cross operation followed by a 
group reduce which picks the k nearest points for each point. The other 
implementations (H-BNLJ and H-BRJ) are effectively the same only that you block 
the data and work on a set of data points. In fact, I think that the cross 
implementation should perform equivalently to H-BNLJ whereas it has the 
advantage of being more robust. But it would be interesting to compare the 
performance of the different implementations.

I really like the idea with the different distance measures. It would be really 
good if we define a trait for distance measures, something like
{code}
trait DistanceMeasure {
  def distance(a: Vector, b: Vector): Double
}
{code}

and make the available as basic building blocks in the ML library. That way 
also other algorithms can use them.

I also really like the way you defined the interaction with the kNN 
implementation. The only remark I have is that kNN does not require 
{{LabeledVectors}}. It should also work with a {{DataSet[Vector]}}. Concerning 
the model we have to think how to identify the calculated neighbours in the 
resulting {{DataSet}}. Either one only queries the model with a single data 
point and then all contained data points in the DataSet are the neighbours or 
one groups the neighbours with respect to the queried data points. What do you 
think?

> Add k-nearest-neighbours algorithm to machine learning library
> --
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Chiwan Park
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression.
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1799) Scala API does not support generic arrays

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502906#comment-14502906
 ] 

ASF GitHub Bot commented on FLINK-1799:
---

Github user aljoscha commented on a diff in the pull request:

https://github.com/apache/flink/pull/582#discussion_r28692599
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/typeutils/ObjectArrayTypeInfo.java
 ---
@@ -143,6 +143,18 @@ else if (type instanceof Class && ((Class) 
type).isArray()
throw new InvalidTypesException("The given type is not a valid 
object array.");
}
 
+   /**
+* Creates a new {@link 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo} from a
+* {@link TypeInformation} for the component type.
+*
+* 
+* This must be used in cases where the complete type of the array is 
not available as a
+* {@link java.lang.reflect.Type} or {@link java.lang.Class}.
+*/
+   public static  ObjectArrayTypeInfo 
getInfoFor(TypeInformation componentInfo) {
+   return new ObjectArrayTypeInfo(Object[].class, 
componentInfo.getTypeClass(), componentInfo);
--- End diff --

That's right, I didn't think of that. Will change.


> Scala API does not support generic arrays
> -
>
> Key: FLINK-1799
> URL: https://issues.apache.org/jira/browse/FLINK-1799
> Project: Flink
>  Issue Type: Bug
>Reporter: Till Rohrmann
>Assignee: Aljoscha Krettek
>
> The Scala API does not support generic arrays at the moment. It throws a 
> rather unhelpful error message ```InvalidTypesException: The given type is 
> not a valid object array```.
> Code to reproduce the problem is given below:
> {code}
> def main(args: Array[String]) {
>   foobar[Double]
> }
> def foobar[T: ClassTag: TypeInformation]: DataSet[Block[T]] = {
>   val tpe = createTypeInformation[Array[T]]
>   null
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1914) Wrong FS while starting YARN session without correct HADOOP_HOME

2015-04-20 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502901#comment-14502901
 ] 

Robert Metzger commented on FLINK-1914:
---

Hi,
thank you for the feedback.
The Yarn client should have shown you a WARNING like when you started Flink on 
YARN:
{code}
WARN  org.apache.flink.yarn.FlinkYarnClient - The file 
system scheme is 'file'. This indicates that the specified Hadoop configuration 
path is wrong and the sytem is using the default Hadoop configuration 
values.The Flink YARN client needs to store its files in a distributed file 
system
{code}

I can not let the YARN client fail when a user writes to a local file system 
because it could be a valid system setup (all NodeManagers have shared NFS or 
when they are running on the same machine).

I can catch the exception and provide a better error message, but I don't see a 
good way to detect this situation on the client before starting the AM.

> Wrong FS while starting YARN session without correct HADOOP_HOME
> 
>
> Key: FLINK-1914
> URL: https://issues.apache.org/jira/browse/FLINK-1914
> Project: Flink
>  Issue Type: Bug
>  Components: YARN Client
>Reporter: Zoltán Zvara
>Priority: Trivial
>  Labels: yarn, yarn-client
>
> When YARN session invoked ({{yarn-session.sh}}) without a correct 
> {{HADOOP_HOME}} (AM still deployed to - for example to {{0.0.0.0:8032}}), but 
> the deployed AM fails with an {{IllegalArgumentException}}:
> {code}
> java.lang.IllegalArgumentException: Wrong FS: 
> file:/home/.../flink-dist-0.9-SNAPSHOT.jar, expected: hdfs://localhost:9000
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
>   at org.apache.flink.yarn.Utils.registerLocalResource(Utils.java:105)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:436)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:371)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$class.org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession(ApplicationMasterActor.scala:371)
>   at 
> org.apache.flink.yarn.ApplicationMasterActor$$anonfun$receiveYarnMessages$1.applyOrElse(ApplicationMasterActor.scala:155)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>   at 
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
> {code}
> IMO this {{IllegalArgumentException}} should get handled in 
> {{org.apache.flink.yarn.Utils.registerLocalResource}} or on an upper level to 
> provide a better error message. This needs to be looked up from YARN logs at 
> the moment, which is painful to a trivial mistake like missing 
> {{HADOOP_HOME}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-1260) Add custom partitioning to documentation

2015-04-20 Thread Chesnay Schepler (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler updated FLINK-1260:

Assignee: (was: Chesnay Schepler)

> Add custom partitioning to documentation
> 
>
> Key: FLINK-1260
> URL: https://issues.apache.org/jira/browse/FLINK-1260
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Java API, Scala API
>Affects Versions: 0.7.0-incubating
>Reporter: Fabian Hueske
>Priority: Minor
>
> The APIs allow to define a custom partitioner to manually fix data skew.
> This feature is not documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502900#comment-14502900
 ] 

ASF GitHub Bot commented on FLINK-1297:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/605#issuecomment-94464757
  
Keeping the accumulators from each task separately is something that we 
want to change anyways in the main system. Then, this would not need special 
casing.

Until then, you should be able to encode the subtask into the accumulator 
name, like "taskname_i". Then, the accumulators do not get combined.


> Add support for tracking statistics of intermediate results
> ---
>
> Key: FLINK-1297
> URL: https://issues.apache.org/jira/browse/FLINK-1297
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Runtime
>Reporter: Alexander Alexandrov
>Assignee: Alexander Alexandrov
> Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...

2015-04-20 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/605#issuecomment-94464757
  
Keeping the accumulators from each task separately is something that we 
want to change anyways in the main system. Then, this would not need special 
casing.

Until then, you should be able to encode the subtask into the accumulator 
name, like "taskname_i". Then, the accumulators do not get combined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-1422) Missing usage example for "withParameters"

2015-04-20 Thread Chesnay Schepler (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler updated FLINK-1422:

Assignee: (was: Chesnay Schepler)

> Missing usage example for "withParameters"
> --
>
> Key: FLINK-1422
> URL: https://issues.apache.org/jira/browse/FLINK-1422
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 0.8.0
>Reporter: Alexander Alexandrov
>Priority: Trivial
> Fix For: 0.8.2
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I am struggling to find a usage example of the "withParameters" method in the 
> documentation. At the moment I only see this note:
> {quote}
> Note: As the content of broadcast variables is kept in-memory on each node, 
> it should not become too large. For simpler things like scalar values you can 
> simply make parameters part of the closure of a function, or use the 
> withParameters(...) method to pass in a configuration.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1799) Scala API does not support generic arrays

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502897#comment-14502897
 ] 

ASF GitHub Bot commented on FLINK-1799:
---

Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/582#discussion_r28691764
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/typeutils/ObjectArrayTypeInfo.java
 ---
@@ -143,6 +143,18 @@ else if (type instanceof Class && ((Class) 
type).isArray()
throw new InvalidTypesException("The given type is not a valid 
object array.");
}
 
+   /**
+* Creates a new {@link 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo} from a
+* {@link TypeInformation} for the component type.
+*
+* 
+* This must be used in cases where the complete type of the array is 
not available as a
+* {@link java.lang.reflect.Type} or {@link java.lang.Class}.
+*/
+   public static  ObjectArrayTypeInfo 
getInfoFor(TypeInformation componentInfo) {
+   return new ObjectArrayTypeInfo(Object[].class, 
componentInfo.getTypeClass(), componentInfo);
--- End diff --

Is `Object[].class` actually correct? Why not give the class of the real 
array type? There are a few ways to get that, for example 
`Array.newInstance(clazz, 0).getClass();`


> Scala API does not support generic arrays
> -
>
> Key: FLINK-1799
> URL: https://issues.apache.org/jira/browse/FLINK-1799
> Project: Flink
>  Issue Type: Bug
>Reporter: Till Rohrmann
>Assignee: Aljoscha Krettek
>
> The Scala API does not support generic arrays at the moment. It throws a 
> rather unhelpful error message ```InvalidTypesException: The given type is 
> not a valid object array```.
> Code to reproduce the problem is given below:
> {code}
> def main(args: Array[String]) {
>   foobar[Double]
> }
> def foobar[T: ClassTag: TypeInformation]: DataSet[Block[T]] = {
>   val tpe = createTypeInformation[Array[T]]
>   null
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1799][scala] Fix handling of generic ar...

2015-04-20 Thread StephanEwen
Github user StephanEwen commented on a diff in the pull request:

https://github.com/apache/flink/pull/582#discussion_r28691764
  
--- Diff: 
flink-java/src/main/java/org/apache/flink/api/java/typeutils/ObjectArrayTypeInfo.java
 ---
@@ -143,6 +143,18 @@ else if (type instanceof Class && ((Class) 
type).isArray()
throw new InvalidTypesException("The given type is not a valid 
object array.");
}
 
+   /**
+* Creates a new {@link 
org.apache.flink.api.java.typeutils.ObjectArrayTypeInfo} from a
+* {@link TypeInformation} for the component type.
+*
+* 
+* This must be used in cases where the complete type of the array is 
not available as a
+* {@link java.lang.reflect.Type} or {@link java.lang.Class}.
+*/
+   public static  ObjectArrayTypeInfo 
getInfoFor(TypeInformation componentInfo) {
+   return new ObjectArrayTypeInfo(Object[].class, 
componentInfo.getTypeClass(), componentInfo);
--- End diff --

Is `Object[].class` actually correct? Why not give the class of the real 
array type? There are a few ways to get that, for example 
`Array.newInstance(clazz, 0).getClass();`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1758) Extend Gelly's neighborhood methods

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502885#comment-14502885
 ] 

ASF GitHub Bot commented on FLINK-1758:
---

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/576#issuecomment-94461019
  
Hi @vasia ,

The inline comments have been addressed, I still have the big comment to 
look at :)

One thing(or two ^^) that could be discussed: 
- the inconsistency between the iterateOn* and the reduce* method names 
comes from the fact that the two methods have different behaviour: reduce takes 
the elements two by two and aggregates them while iterateOn uses an iterator to 
go through the edges. Hence the different names. 
Imagine something like this:
Tuple2> iterateOnEdges(Tuple2> firstEdge, 
Tuple2> secondEdge);
when you're not iterating on anything(or vice-versa reduceOnEdges(Iterable 
edges)).

- I know ReduceEdgesFunction is not a pretty name, but I don't have another 
one, ReduceFunction is already taken :P. When you think of a better name, tell 
me and I can do the refactoring.
 



> Extend Gelly's neighborhood methods
> ---
>
> Key: FLINK-1758
> URL: https://issues.apache.org/jira/browse/FLINK-1758
> Project: Flink
>  Issue Type: Improvement
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Vasia Kalavri
>Assignee: Andra Lungu
>
> Currently, the neighborhood methods only allow returning a single value per 
> vertex. In many cases, it is desirable to return several or no value per 
> vertex. This is the case in clustering coefficient computation, 
> vertex-centric jaccard, algorithms where a vertex computes a value per edge 
> or when a vertex computes a value only for some of its neighbors.
> This issue proposes to 
> - change the current reduceOnEdges/reduceOnNeighbors methods to use 
> combinable reduce operations where possible
> - provide groupReduce-versions, which will use a Collector and allow 
> returning none or more values per vertex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1914) Wrong FS while starting YARN session without correct HADOOP_HOME

2015-04-20 Thread JIRA
Zoltán Zvara created FLINK-1914:
---

 Summary: Wrong FS while starting YARN session without correct 
HADOOP_HOME
 Key: FLINK-1914
 URL: https://issues.apache.org/jira/browse/FLINK-1914
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Reporter: Zoltán Zvara
Priority: Trivial


When YARN session invoked ({{yarn-session.sh}}) without a correct 
{{HADOOP_HOME}} (AM still deployed to - for example to {{0.0.0.0:8032}}), but 
the deployed AM fails with an {{IllegalArgumentException}}:

{code}
java.lang.IllegalArgumentException: Wrong FS: 
file:/home/.../flink-dist-0.9-SNAPSHOT.jar, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:181)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:92)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.flink.yarn.Utils.registerLocalResource(Utils.java:105)
at 
org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:436)
at 
org.apache.flink.yarn.ApplicationMasterActor$$anonfun$org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession$2.apply(ApplicationMasterActor.scala:371)
at scala.util.Try$.apply(Try.scala:161)
at 
org.apache.flink.yarn.ApplicationMasterActor$class.org$apache$flink$yarn$ApplicationMasterActor$$startYarnSession(ApplicationMasterActor.scala:371)
at 
org.apache.flink.yarn.ApplicationMasterActor$$anonfun$receiveYarnMessages$1.applyOrElse(ApplicationMasterActor.scala:155)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:94)
{code}

IMO this {{IllegalArgumentException}} should get handled in 
{{org.apache.flink.yarn.Utils.registerLocalResource}} or on an upper level to 
provide a better error message. This needs to be looked up from YARN logs at 
the moment, which is painful to a trivial mistake like missing {{HADOOP_HOME}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1758][gelly] Neighborhood Methods Exten...

2015-04-20 Thread andralungu
Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/576#issuecomment-94461019
  
Hi @vasia ,

The inline comments have been addressed, I still have the big comment to 
look at :)

One thing(or two ^^) that could be discussed: 
- the inconsistency between the iterateOn* and the reduce* method names 
comes from the fact that the two methods have different behaviour: reduce takes 
the elements two by two and aggregates them while iterateOn uses an iterator to 
go through the edges. Hence the different names. 
Imagine something like this:
Tuple2> iterateOnEdges(Tuple2> firstEdge, 
Tuple2> secondEdge);
when you're not iterating on anything(or vice-versa reduceOnEdges(Iterable 
edges)).

- I know ReduceEdgesFunction is not a pretty name, but I don't have another 
one, ReduceFunction is already taken :P. When you think of a better name, tell 
me and I can do the refactoring.
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1906] [docs] Add tip to work around pla...

2015-04-20 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/611#issuecomment-94457892
  
Good addition, +1 to merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1906) Add tip to work around plain Tuple return type of project operator

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502870#comment-14502870
 ] 

ASF GitHub Bot commented on FLINK-1906:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/611#issuecomment-94457892
  
Good addition, +1 to merge


> Add tip to work around plain Tuple return type of project operator
> --
>
> Key: FLINK-1906
> URL: https://issues.apache.org/jira/browse/FLINK-1906
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 0.9
>Reporter: Fabian Hueske
>Assignee: Chiwan Park
>Priority: Minor
>  Labels: starter
>
> The Java compiler is not able to infer the return type of the {{project}} 
> operator and defaults to {{Tuple}}. This can cause problems if another 
> operator is immediately called on the result of a {{project}} operator such 
> as:
> {code}
> DataSet> ds = 
> DataSet> ds2 = ds.project(0).distinct(0);
> {code} 
> This problem can be overcome by hinting the return type of {{project}} like 
> this:
> {code}
> DataSet> ds2 = ds.project(0).distinct(0);
> {code}
> We should add this description to the documentation of the project operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502792#comment-14502792
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user DarkKnightCZ commented on a diff in the pull request:

https://github.com/apache/flink/pull/609#discussion_r28688289
  
--- Diff: flink-dist/src/main/flink-bin/bin/start-cluster.sh ---
@@ -37,6 +37,26 @@ fi
 # cluster mode, bring up job manager locally and a task manager on every 
slave host
 "$FLINK_BIN_DIR"/jobmanager.sh start cluster
 
+# wait until jobmanager starts
+JOBMANAGER_ADDR=$(readFromConfig ${KEY_JOBM_RPC_ADDR} 
"${DEFAULT_JOBM_RPC_ADDR}" "${YAML_CONF}")
+JOBMANAGER_PORT=$(readFromConfig ${KEY_JOBM_RPC_PORT} 
"${DEFAULT_JOBM_RPC_PORT}" "${YAML_CONF}")
+
+echo "Waiting for job manager"
+for i in {1..30}; do
+  nc -z "${JOBMANAGER_ADDR}" $JOBMANAGER_PORT
--- End diff --

@rmetzger 
Since "-z" does only port-pinging (no actual payload is sent), nothing is 
visible in logs (if you send some data, its correctly logged as WARN "incorrect 
header" by org.apache.flink.runtime.ipc.Server)


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread DarkKnightCZ
Github user DarkKnightCZ commented on a diff in the pull request:

https://github.com/apache/flink/pull/609#discussion_r28688289
  
--- Diff: flink-dist/src/main/flink-bin/bin/start-cluster.sh ---
@@ -37,6 +37,26 @@ fi
 # cluster mode, bring up job manager locally and a task manager on every 
slave host
 "$FLINK_BIN_DIR"/jobmanager.sh start cluster
 
+# wait until jobmanager starts
+JOBMANAGER_ADDR=$(readFromConfig ${KEY_JOBM_RPC_ADDR} 
"${DEFAULT_JOBM_RPC_ADDR}" "${YAML_CONF}")
+JOBMANAGER_PORT=$(readFromConfig ${KEY_JOBM_RPC_PORT} 
"${DEFAULT_JOBM_RPC_PORT}" "${YAML_CONF}")
+
+echo "Waiting for job manager"
+for i in {1..30}; do
+  nc -z "${JOBMANAGER_ADDR}" $JOBMANAGER_PORT
--- End diff --

@rmetzger 
Since "-z" does only port-pinging (no actual payload is sent), nothing is 
visible in logs (if you send some data, its correctly logged as WARN "incorrect 
header" by org.apache.flink.runtime.ipc.Server)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502783#comment-14502783
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user DarkKnightCZ commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94451193
  
@tillrohrmann 
The problem that occurred was that JM bound the IP:PORT with some delay, so 
TMs failed to start, since they couldn't connect.

When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM 
wasn't ready there. There was no subsequential checking done, TMs just stopped. 
I agree that TM should indeed try to check several times, if the JM is 
available, so i will try to look at it also.


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread DarkKnightCZ
Github user DarkKnightCZ commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94451193
  
@tillrohrmann 
The problem that occurred was that JM bound the IP:PORT with some delay, so 
TMs failed to start, since they couldn't connect.

When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM 
wasn't ready there. There was no subsequential checking done, TMs just stopped. 
I agree that TM should indeed try to check several times, if the JM is 
available, so i will try to look at it also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1746) Add linear discriminant analysis to machine learning library

2015-04-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502763#comment-14502763
 ] 

Till Rohrmann commented on FLINK-1746:
--

Hi Raghav, cool that you picked the topic up :-) Let me know if you need any 
help implementing the feature.

> Add linear discriminant analysis to machine learning library
> 
>
> Key: FLINK-1746
> URL: https://issues.apache.org/jira/browse/FLINK-1746
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> Linear discriminant analysis (LDA) [1] is used for dimensionality reduction 
> prior to classification. But it can also be used to calculate a linear 
> classifier on its own. Since dimensionality reduction is an important 
> preprocessing step, a distributed LDA implementation is a valuable addition 
> to flink-ml.
> Resources:
> [1] [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5946724]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1913) Document how to access data in HCatalog

2015-04-20 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-1913:
-

 Summary: Document how to access data in HCatalog
 Key: FLINK-1913
 URL: https://issues.apache.org/jira/browse/FLINK-1913
 Project: Flink
  Issue Type: Bug
  Components: flink-hcatalog, Documentation
Reporter: Robert Metzger


Reading from HCatalog was added in FLINK-1466, but not documented

We should document how to use the code in {{flink-hcatalog}}.
Also, there should be an example on how to write to HCatalog using the Hadoop 
wrappers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502736#comment-14502736
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-9121
  
The TaskManager's maximum registration duration is configured by the config 
value ```taskmanager.maxRegistrationDuration```. The default value is set to 
infinity for a dedicated Flink cluster.

Therefore, I'm wondering what exactly the problem with the startup delay 
is? @DarkKnightCZ maybe you can elaborate a little bit more on the problem you 
had.


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502737#comment-14502737
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-9159
  
As far as I understood it, the goal of the change is to wait until the JM 
has been started before starting the TMs.
So the TMs would not start if the JM failed to start.



> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-9159
  
As far as I understood it, the goal of the change is to wait until the JM 
has been started before starting the TMs.
So the TMs would not start if the JM failed to start.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-9121
  
The TaskManager's maximum registration duration is configured by the config 
value ```taskmanager.maxRegistrationDuration```. The default value is set to 
infinity for a dedicated Flink cluster.

Therefore, I'm wondering what exactly the problem with the startup delay 
is? @DarkKnightCZ maybe you can elaborate a little bit more on the problem you 
had.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502734#comment-14502734
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user rmetzger commented on a diff in the pull request:

https://github.com/apache/flink/pull/609#discussion_r28685121
  
--- Diff: flink-dist/src/main/flink-bin/bin/start-cluster.sh ---
@@ -37,6 +37,26 @@ fi
 # cluster mode, bring up job manager locally and a task manager on every 
slave host
 "$FLINK_BIN_DIR"/jobmanager.sh start cluster
 
+# wait until jobmanager starts
+JOBMANAGER_ADDR=$(readFromConfig ${KEY_JOBM_RPC_ADDR} 
"${DEFAULT_JOBM_RPC_ADDR}" "${YAML_CONF}")
+JOBMANAGER_PORT=$(readFromConfig ${KEY_JOBM_RPC_PORT} 
"${DEFAULT_JOBM_RPC_PORT}" "${YAML_CONF}")
+
+echo "Waiting for job manager"
+for i in {1..30}; do
+  nc -z "${JOBMANAGER_ADDR}" $JOBMANAGER_PORT
--- End diff --

Is akka logging anything for this requests? (I suspect its logging a 
WARNING that an invalid client tried to connect?)


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread rmetzger
Github user rmetzger commented on a diff in the pull request:

https://github.com/apache/flink/pull/609#discussion_r28685121
  
--- Diff: flink-dist/src/main/flink-bin/bin/start-cluster.sh ---
@@ -37,6 +37,26 @@ fi
 # cluster mode, bring up job manager locally and a task manager on every 
slave host
 "$FLINK_BIN_DIR"/jobmanager.sh start cluster
 
+# wait until jobmanager starts
+JOBMANAGER_ADDR=$(readFromConfig ${KEY_JOBM_RPC_ADDR} 
"${DEFAULT_JOBM_RPC_ADDR}" "${YAML_CONF}")
+JOBMANAGER_PORT=$(readFromConfig ${KEY_JOBM_RPC_PORT} 
"${DEFAULT_JOBM_RPC_PORT}" "${YAML_CONF}")
+
+echo "Waiting for job manager"
+for i in {1..30}; do
+  nc -z "${JOBMANAGER_ADDR}" $JOBMANAGER_PORT
--- End diff --

Is akka logging anything for this requests? (I suspect its logging a 
WARNING that an invalid client tried to connect?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502733#comment-14502733
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94443599
  
Can you give me a +1 in the discussion on the ML as well?

On Mon, Apr 20, 2015 at 2:46 PM, Aljoscha Krettek 
wrote:

> I just ran it on the cluster. Works like a charm. [image: :smile:]
>
> For word count, python takes 12 minutes, java about 2:40. But this should
> be expected, I guess.
>
> Good to merge now, in my opinion.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



-- 
Robert Metzger, Kontakt: metzg...@web.de, Mobil: 0171/7424461



> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-377] [FLINK-671] Generic Interface / PA...

2015-04-20 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94443599
  
Can you give me a +1 in the discussion on the ML as well?

On Mon, Apr 20, 2015 at 2:46 PM, Aljoscha Krettek 
wrote:

> I just ran it on the cluster. Works like a charm. [image: :smile:]
>
> For word count, python takes 12 minutes, java about 2:40. But this should
> be expected, I guess.
>
> Good to merge now, in my opinion.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



-- 
Robert Metzger, Kontakt: metzg...@web.de, Mobil: 0171/7424461



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (FLINK-1745) Add k-nearest-neighbours algorithm to machine learning library

2015-04-20 Thread Chiwan Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chiwan Park reassigned FLINK-1745:
--

Assignee: Chiwan Park

> Add k-nearest-neighbours algorithm to machine learning library
> --
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Chiwan Park
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression.
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502729#comment-14502729
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94443054
  
I just ran it on the cluster. Works like a charm. :smile: 

For word count, python takes 12 minutes, java about 2:40. But this should 
be expected, I guess.

Good to merge now, in my opinion.


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-377] [FLINK-671] Generic Interface / PA...

2015-04-20 Thread aljoscha
Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94443054
  
I just ran it on the cluster. Works like a charm. :smile: 

For word count, python takes 12 minutes, java about 2:40. But this should 
be expected, I guess.

Good to merge now, in my opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502728#comment-14502728
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94442619
  
The TaskManager uses an exponential backoff strategy to resolve connection
problems with the JobManager.

On Mon, Apr 20, 2015 at 11:07 AM, Max  wrote:

> Thanks for the pull request. Seems to work fine. I was wondering,
> shouldn't the task managers repeatably try to build up a connection to the
> job manager? For me, that seems to be a nicer way to solve this problem.
> That way, the startup script doesn't need to be aware of the job manager's
> rpc port.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94442619
  
The TaskManager uses an exponential backoff strategy to resolve connection
problems with the JobManager.

On Mon, Apr 20, 2015 at 11:07 AM, Max  wrote:

> Thanks for the pull request. Seems to work fine. I was wondering,
> shouldn't the task managers repeatably try to build up a connection to the
> job manager? For me, that seems to be a nicer way to solve this problem.
> That way, the startup script doesn't need to be aware of the job manager's
> rpc port.
>
> —
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1883) Add Min Vertex ID Propagation Library Method and Example

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502712#comment-14502712
 ] 

ASF GitHub Bot commented on FLINK-1883:
---

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/596#issuecomment-94438358
  
@vasia ,
The library method matches Spargel now. I also refactored the randomised 
edges test a bit. 

Tell me what you think!


> Add Min Vertex ID Propagation Library Method and Example
> 
>
> Key: FLINK-1883
> URL: https://issues.apache.org/jira/browse/FLINK-1883
> Project: Flink
>  Issue Type: Task
>  Components: Gelly
>Affects Versions: 0.9
>Reporter: Andra Lungu
>Assignee: Andra Lungu
>Priority: Trivial
>
> Port the example 
> here:http://ci.apache.org/projects/flink/flink-docs-release-0.6/spargel_guide.html#example:-propagate-minimum-vertex-id-in-graph
>  to Gelly
> FLINK-1871 uses this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1883][gelly] Added Min Vertex Id Propag...

2015-04-20 Thread andralungu
Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/596#issuecomment-94438358
  
@vasia ,
The library method matches Spargel now. I also refactored the randomised 
edges test a bit. 

Tell me what you think!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1871) Add a Spargel to Gelly migration guide

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502694#comment-14502694
 ] 

ASF GitHub Bot commented on FLINK-1871:
---

Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/600#issuecomment-94432522
  
Hi @vasia ,

The general guidelines were added.


> Add a Spargel to Gelly migration guide
> --
>
> Key: FLINK-1871
> URL: https://issues.apache.org/jira/browse/FLINK-1871
> Project: Flink
>  Issue Type: Task
>  Components: docs, Gelly, Spargel
>Affects Versions: 0.9
>Reporter: Vasia Kalavri
>Assignee: Andra Lungu
>Priority: Minor
>
> The guide should explain how users can port their existing Spargel 
> applications to Gelly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1871][gelly] Added Spargel to Gelly mig...

2015-04-20 Thread andralungu
Github user andralungu commented on the pull request:

https://github.com/apache/flink/pull/600#issuecomment-94432522
  
Hi @vasia ,

The general guidelines were added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1745) Add k-nearest-neighbours algorithm to machine learning library

2015-04-20 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502686#comment-14502686
 ] 

Chiwan Park commented on FLINK-1745:


I suggest a scenario of KNN like following:

{code}
val trainingDS: DataSet[LabeledVector] = ... // training data
val distanceMeasure = ... // we need some distance measure model to calculate 
distance between two vector such as Euclidean distance, Cosine distance, 
Manhattan distance)

val knn= KNN().setK(10).setDistanceMeasure(distanceMeasure)
val model: KNNModel = knn.fit(trainingDS)

val testingDS = ... // testing data
val predictionDS: DataSet[LabeledVector] = model.transform(testingDS) // and we 
can provide KNNModel.transform(Vector) also for prediction of single vector.
{code}

The name of methods and classes are inspired by CoCoA implementation of 
flink-ml. :)
The concept of distance measure is inspired by mahout implementation. 
(https://github.com/apache/mahout/blob/master/mr/src/main/java/org/apache/mahout/common/distance/DistanceMeasure.java)
 I think we need another issue for distance measure.

How about this scenario?

> Add k-nearest-neighbours algorithm to machine learning library
> --
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression.
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-1733) Add PCA to machine learning library

2015-04-20 Thread Raghav Chalapathy (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Chalapathy reassigned FLINK-1733:


Assignee: Raghav Chalapathy

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add k-nearest-neighbours algorithm to machine learning library

2015-04-20 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502620#comment-14502620
 ] 

Chiwan Park commented on FLINK-1745:


[~raghav.chalapa...@gmail.com] Hi. We can collaborate for the development of 
this feature. I want to discuss in detail also.

> Add k-nearest-neighbours algorithm to machine learning library
> --
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression.
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-1911) DataStream and DataSet projection is out of sync

2015-04-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/FLINK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Péter Szabó reassigned FLINK-1911:
--

Assignee: Péter Szabó

> DataStream and DataSet projection is out of sync
> 
>
> Key: FLINK-1911
> URL: https://issues.apache.org/jira/browse/FLINK-1911
> Project: Flink
>  Issue Type: Bug
>  Components: Java API, Streaming
>Reporter: Gyula Fora
>Assignee: Péter Szabó
>
> Since the DataSet projection has been reworked to not require the .types(...) 
> call the Streaming and Batch methods are out of sync.
> The streaming api projection needs to be modified accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1746) Add linear discriminant analysis to machine learning library

2015-04-20 Thread Raghav Chalapathy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502581#comment-14502581
 ] 

Raghav Chalapathy commented on FLINK-1746:
--

thanks 
raghav


> Add linear discriminant analysis to machine learning library
> 
>
> Key: FLINK-1746
> URL: https://issues.apache.org/jira/browse/FLINK-1746
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> Linear discriminant analysis (LDA) [1] is used for dimensionality reduction 
> prior to classification. But it can also be used to calculate a linear 
> classifier on its own. Since dimensionality reduction is an important 
> preprocessing step, a distributed LDA implementation is a valuable addition 
> to flink-ml.
> Resources:
> [1] [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5946724]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-1746) Add linear discriminant analysis to machine learning library

2015-04-20 Thread Raghav Chalapathy (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raghav Chalapathy reassigned FLINK-1746:


Assignee: Raghav Chalapathy

> Add linear discriminant analysis to machine learning library
> 
>
> Key: FLINK-1746
> URL: https://issues.apache.org/jira/browse/FLINK-1746
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>  Labels: ML
>
> Linear discriminant analysis (LDA) [1] is used for dimensionality reduction 
> prior to classification. But it can also be used to calculate a linear 
> classifier on its own. Since dimensionality reduction is an important 
> preprocessing step, a distributed LDA implementation is a valuable addition 
> to flink-ml.
> Resources:
> [1] [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5946724]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1746) Add linear discriminant analysis to machine learning library

2015-04-20 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502580#comment-14502580
 ] 

Robert Metzger commented on FLINK-1746:
---

Hi,
I gave you "Contributor" permissions in our JIRA. You can now assign issues to 
yourself.
I've assigned this one to you.

> Add linear discriminant analysis to machine learning library
> 
>
> Key: FLINK-1746
> URL: https://issues.apache.org/jira/browse/FLINK-1746
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML
>
> Linear discriminant analysis (LDA) [1] is used for dimensionality reduction 
> prior to classification. But it can also be used to calculate a linear 
> classifier on its own. Since dimensionality reduction is an important 
> preprocessing step, a distributed LDA implementation is a valuable addition 
> to flink-ml.
> Resources:
> [1] [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5946724]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1746) Add linear discriminant analysis to machine learning library

2015-04-20 Thread Raghav Chalapathy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502576#comment-14502576
 ] 

Raghav Chalapathy commented on FLINK-1746:
--

Hi Till

I would like to start working on this issue , could you please assign this 
issue to me 

with regards
Raghav


> Add linear discriminant analysis to machine learning library
> 
>
> Key: FLINK-1746
> URL: https://issues.apache.org/jira/browse/FLINK-1746
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML
>
> Linear discriminant analysis (LDA) [1] is used for dimensionality reduction 
> prior to classification. But it can also be used to calculate a linear 
> classifier on its own. Since dimensionality reduction is an important 
> preprocessing step, a distributed LDA implementation is a valuable addition 
> to flink-ml.
> Resources:
> [1] [http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5946724]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add k-nearest-neighbours algorithm to machine learning library

2015-04-20 Thread Raghav Chalapathy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502562#comment-14502562
 ] 

Raghav Chalapathy commented on FLINK-1745:
--

Hi Chiwan

Would like to colloborate with you for the development of this new feature 
shall we discuss in detail how we can go about the same 

with regards
Raghav


> Add k-nearest-neighbours algorithm to machine learning library
> --
>
> Key: FLINK-1745
> URL: https://issues.apache.org/jira/browse/FLINK-1745
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML, Starter
>
> Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
> it is still used as a mean to classify data and to do regression.
> Could be a starter task.
> Resources:
> [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
> [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502558#comment-14502558
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94403584
  
yes that is correct.


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-377] [FLINK-671] Generic Interface / PA...

2015-04-20 Thread zentol
Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94403584
  
yes that is correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502555#comment-14502555
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94402978
  
I was referring to the way that communication is handled between the java 
host and the generic language client: Communication between them is not based 
on a fixed set of Messages (for example, messages defined using something like 
Protobuf or Avro) but instead the knowledge about how messages are structured 
is implicit in the code that does the messaging. So the java side expects a 
sequence of primitives (integers, strings) in a certain order and the python 
side knows that order and sends them in this order.


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-377] [FLINK-671] Generic Interface / PA...

2015-04-20 Thread aljoscha
Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94402978
  
I was referring to the way that communication is handled between the java 
host and the generic language client: Communication between them is not based 
on a fixed set of Messages (for example, messages defined using something like 
Protobuf or Avro) but instead the knowledge about how messages are structured 
is implicit in the code that does the messaging. So the java side expects a 
sequence of primitives (integers, strings) in a certain order and the python 
side knows that order and sends them in this order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502551#comment-14502551
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94402062
  
@aljoscha Timeout is removed. Data transfer is still done with mapped file, 
access to these files is synchronized using TCP. Im not sure what you mean with 
your last sentence.


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-377] [FLINK-671] Generic Interface / PA...

2015-04-20 Thread zentol
Github user zentol commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94402062
  
@aljoscha Timeout is removed. Data transfer is still done with mapped file, 
access to these files is synchronized using TCP. Im not sure what you mean with 
your last sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502545#comment-14502545
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94401023
  
I wrote to the mailinglist to discuss the state of this PR


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1297) Add support for tracking statistics of intermediate results

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502547#comment-14502547
 ] 

ASF GitHub Bot commented on FLINK-1297:
---

Github user tammymendt commented on the pull request:

https://github.com/apache/flink/pull/605#issuecomment-94401122
  
Hey, thanks for the quick feedback. I agree with separating the 
contribution as a library, however I have a question. The reason I have used a 
specialized method in the RuntimeEnvironment, is that for the 
OperatorStatsAccumulator, we do not want to only track the accumulated 
statistics per job, but also keep an array of the statistics collected at each 
task. The subtaskIndex which part of the RuntimeEnvironment is passed as a 
parameter to the constructor of the accumulator so that we can track which 
subtask is associated to which stats. Any ideas as to how I could use the same 
accumulator infrastructure but be able to associate an accumulated value with a 
given subtask?


> Add support for tracking statistics of intermediate results
> ---
>
> Key: FLINK-1297
> URL: https://issues.apache.org/jira/browse/FLINK-1297
> Project: Flink
>  Issue Type: Improvement
>  Components: Distributed Runtime
>Reporter: Alexander Alexandrov
>Assignee: Alexander Alexandrov
> Fix For: 0.9
>
>   Original Estimate: 1,008h
>  Remaining Estimate: 1,008h
>
> One of the major problems related to the optimizer at the moment is the lack 
> of proper statistics.
> With the introduction of staged execution, it is possible to instrument the 
> runtime code with a statistics facility that collects the required 
> information for optimizing the next execution stage.
> I would therefore like to contribute code that can be used to gather basic 
> statistics for the (intermediate) result of dataflows (e.g. min, max, count, 
> count distinct) and make them available to the job manager.
> Before I start, I would like to hear some feedback form the other users.
> In particular, to handle skew (e.g. on grouping) it might be good to have 
> some sort of detailed sketch about the key distribution of an intermediate 
> result. I am not sure whether a simple histogram is the most effective way to 
> go. Maybe somebody would propose another lightweight sketch that provides 
> better accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1297] Added OperatorStatsAccumulator fo...

2015-04-20 Thread tammymendt
Github user tammymendt commented on the pull request:

https://github.com/apache/flink/pull/605#issuecomment-94401122
  
Hey, thanks for the quick feedback. I agree with separating the 
contribution as a library, however I have a question. The reason I have used a 
specialized method in the RuntimeEnvironment, is that for the 
OperatorStatsAccumulator, we do not want to only track the accumulated 
statistics per job, but also keep an array of the statistics collected at each 
task. The subtaskIndex which part of the RuntimeEnvironment is passed as a 
parameter to the constructor of the accumulator so that we can track which 
subtask is associated to which stats. Any ideas as to how I could use the same 
accumulator infrastructure but be able to associate an accumulated value with a 
given subtask?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1908) JobManager startup delay isn't considered when using start-cluster.sh script

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502546#comment-14502546
 ] 

ASF GitHub Bot commented on FLINK-1908:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94401080
  
Thanks for the pull request. Seems to work fine. I was wondering, shouldn't 
the task managers repeatably try to build up a connection to the job manager? 
For me, that seems to be a nicer way to solve this problem. That way, the 
startup script doesn't need to be aware of the job manager's rpc port.


> JobManager startup delay isn't considered when using start-cluster.sh script
> 
>
> Key: FLINK-1908
> URL: https://issues.apache.org/jira/browse/FLINK-1908
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Runtime
>Affects Versions: 0.9, 0.8.1
> Environment: Linux
>Reporter: Lukas Raska
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When starting Flink cluster via start-cluster.sh script, JobManager startup 
> can be delayed (as it's started asynchronously), which can result in failed 
> startup of several task managers.
> Solution is to wait certain amount of time and periodically check if RPC port 
> is accessible, then proceed with starting task managers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1908] JobManager startup delay isn't co...

2015-04-20 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/609#issuecomment-94401080
  
Thanks for the pull request. Seems to work fine. I was wondering, shouldn't 
the task managers repeatably try to build up a connection to the job manager? 
For me, that seems to be a nicer way to solve this problem. That way, the 
startup script doesn't need to be aware of the job manager's rpc port.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-377) Create a general purpose framework for language bindings

2015-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502544#comment-14502544
 ] 

ASF GitHub Bot commented on FLINK-377:
--

Github user aljoscha commented on the pull request:

https://github.com/apache/flink/pull/202#issuecomment-94400959
  
I'll test it again on a cluster. Could you please elaborate a bit. Is the 
timeout still in? Communication is through TCP instead of the mapped files. but 
still with the same basic interface of writing basic values for communication?


> Create a general purpose framework for language bindings
> 
>
> Key: FLINK-377
> URL: https://issues.apache.org/jira/browse/FLINK-377
> Project: Flink
>  Issue Type: Improvement
>Reporter: GitHub Import
>Assignee: Chesnay Schepler
>  Labels: github-import
> Fix For: pre-apache
>
>
> A general purpose API to run operators with arbitrary binaries. 
> This will allow to run Stratosphere programs written in Python, JavaScript, 
> Ruby, Go or whatever you like. 
> We suggest using Google Protocol Buffers for data serialization. This is the 
> list of languages that currently support ProtoBuf: 
> https://code.google.com/p/protobuf/wiki/ThirdPartyAddOns 
> Very early prototype with python: 
> https://github.com/rmetzger/scratch/tree/learn-protobuf (basically testing 
> protobuf)
> For Ruby: https://github.com/infochimps-labs/wukong
> Two new students working at Stratosphere (@skunert and @filiphaase) are 
> working on this.
> The reference binding language will be for Python, but other bindings are 
> very welcome.
> The best name for this so far is "stratosphere-lang-bindings".
> I created this issue to track the progress (and give everybody a chance to 
> comment on this)
>  Imported from GitHub 
> Url: https://github.com/stratosphere/stratosphere/issues/377
> Created by: [rmetzger|https://github.com/rmetzger]
> Labels: enhancement, 
> Assignee: [filiphaase|https://github.com/filiphaase]
> Created at: Tue Jan 07 19:47:20 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >