[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596381#comment-14596381 ] Robert Metzger commented on FLINK-1319: --- I guess we can close this issue? (and FLINK-536 as well) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576642#comment-14576642 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-109875609 Great news :) Thanks Ufuk! Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574191#comment-14574191 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31798470 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SemanticPropUtil.java --- @@ -309,15 +308,20 @@ public static DualInputSemanticProperties getSemanticPropsDual( getSemanticPropsDualFromString(result, forwardedFirst, forwardedSecond, nonForwardedFirst, nonForwardedSecond, readFirst, readSecond, inType1, inType2, outType); return result; - } else { - return new DualInputSemanticProperties(); } + return null; + } + + public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, + String[] forwarded, String[] nonForwarded, String[] readSet, + TypeInformation? inType, TypeInformation? outType) { + getSemanticPropsSingleFromString(result, forwarded, nonForwarded, readSet, inType, outType, false); } public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, String[] forwarded, String[] nonForwarded, String[] readSet, - TypeInformation? inType, TypeInformation? outType) - { + TypeInformation? inType, TypeInformation? outType, + boolean skipIncompatibleTypes) { --- End diff -- Sometimes the analyzer works better than required. E.g. the analyzer outputs @ForwardedFields(*-record.customer.name) but if customer is a GenericType output type, the types are incompatible. I thought it is better to reuse the type compatibility checking of the PropUtil than reimplement everything, but skip types that are incompatible without throwing Exceptions. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574204#comment-14574204 ] ASF GitHub Bot commented on FLINK-1319: --- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31799272 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SemanticPropUtil.java --- @@ -309,15 +308,20 @@ public static DualInputSemanticProperties getSemanticPropsDual( getSemanticPropsDualFromString(result, forwardedFirst, forwardedSecond, nonForwardedFirst, nonForwardedSecond, readFirst, readSecond, inType1, inType2, outType); return result; - } else { - return new DualInputSemanticProperties(); } + return null; + } + + public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, + String[] forwarded, String[] nonForwarded, String[] readSet, + TypeInformation? inType, TypeInformation? outType) { + getSemanticPropsSingleFromString(result, forwarded, nonForwarded, readSet, inType, outType, false); } public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, String[] forwarded, String[] nonForwarded, String[] readSet, - TypeInformation? inType, TypeInformation? outType) - { + TypeInformation? inType, TypeInformation? outType, + boolean skipIncompatibleTypes) { --- End diff -- OK, got it. Thanks for explaining. :-) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572454#comment-14572454 ] ASF GitHub Bot commented on FLINK-1319: --- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31706246 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/SingleInputUdfOperator.java --- @@ -54,8 +54,11 @@ private MapString, DataSet? broadcastVariables; + // NOTE: only set this variable via setSemanticProperties() --- End diff -- Just a quick question (haven't checked the code). Does the analyzer also respect semantic information provided via the Operator API (withForwardedFields()), i.e., not via function annotations? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572472#comment-14572472 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31707016 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/SingleInputUdfOperator.java --- @@ -54,8 +54,11 @@ private MapString, DataSet? broadcastVariables; + // NOTE: only set this variable via setSemanticProperties() --- End diff -- Yes, I have also added a test for that (see SemanticPropertiesPrecendenceTest). Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572455#comment-14572455 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-108808990 The build succeeds. :) I will have a look at the changes. Thanks for not force updating this PR. I will test it in a distributed setup and if everything runs fine, we can merge this. :-) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572561#comment-14572561 ] ASF GitHub Bot commented on FLINK-1319: --- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31711766 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/functions/SemanticPropUtil.java --- @@ -309,15 +308,20 @@ public static DualInputSemanticProperties getSemanticPropsDual( getSemanticPropsDualFromString(result, forwardedFirst, forwardedSecond, nonForwardedFirst, nonForwardedSecond, readFirst, readSecond, inType1, inType2, outType); return result; - } else { - return new DualInputSemanticProperties(); } + return null; + } + + public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, + String[] forwarded, String[] nonForwarded, String[] readSet, + TypeInformation? inType, TypeInformation? outType) { + getSemanticPropsSingleFromString(result, forwarded, nonForwarded, readSet, inType, outType, false); } public static void getSemanticPropsSingleFromString(SingleInputSemanticProperties result, String[] forwarded, String[] nonForwarded, String[] readSet, - TypeInformation? inType, TypeInformation? outType) - { + TypeInformation? inType, TypeInformation? outType, + boolean skipIncompatibleTypes) { --- End diff -- Can you explain a bit, why you introduced this flag? Why should is be possible to skip the compatibility checks? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571809#comment-14571809 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-108640926 Hey Ufuk, thank you very much for reviewing my code and all others for the feedback! I tried to consider all your feedback (I hope I didn't forget anything). I did a large refactoring again, added some comments to important parts of the code and fixed some bugs. I also added some additional test cases. I hope the PR is now ready to be merged (if the build succeeds) :) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571743#comment-14571743 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-108630545 Great review, Ufuk. I agree with @uce and @rmetzger to add a comment how to disable it (in case). Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570367#comment-14570367 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31597259 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/SingleInputUdfOperator.java --- @@ -54,8 +54,11 @@ private MapString, DataSet? broadcastVariables; + // NOTE: only set this variable via setSemanticProperties() --- End diff -- Manual annotations should always trump optimizer annotations. The analyzer can not determine all semantic properties. E.g. when using KeySelectors. The user should still have the possibility to override semantic properties to add more properties. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569669#comment-14569669 ] ASF GitHub Bot commented on FLINK-1319: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-108081964 How about extending the `UDF contains obvious errors` message with some notes on how to completely disable the SCA. I fear that the message appears (blocks the program execution) due to a bug in the SCA and then users don't know how to get their stuff to run. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569723#comment-14569723 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-108093699 I agree with Robert, but for the initial version it won't matter as it should be disabled anyways. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14568988#comment-14568988 ] ASF GitHub Bot commented on FLINK-1319: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107924813 I understand that concern. But using sysoutput will also be interleaved with the Client sysout printing and the regular system logging. Also, I think its very bad practice to print stuff using systemout, because its not controllable in any way. With log4j we can configure the analysis output the way we want. if you want the messages to look like regular sysout text, we can specify a custom output schema for the classes in the sca java package. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569150#comment-14569150 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31527840 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/UdfAnalysisMode.java --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.common; + +/** + * Specifies to which extent user-defined functions are analyzed in order + * to give the Flink optimizer an insight of UDF internals and inform + * the user about common implementation mistakes. + * + */ +public enum UdfAnalysisMode { --- End diff -- This is user-facing. I vote to rename it. @rmetzger agrees that in his experience the UDF part can be misleading. I understand why you chose this though... the operators make use of the UDF term all over the place. What about `CodeAnalysisMode`? After all both this PR and the package are called *code analysis* and not UDF analysis. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569154#comment-14569154 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31528084 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/UdfAnalysisMode.java --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.common; + +/** + * Specifies to which extent user-defined functions are analyzed in order --- End diff -- - I would make this more concrete. What about the list from your initial PR comment? - line 25: empty Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569152#comment-14569152 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31527970 --- Diff: flink-core/src/main/java/org/apache/flink/api/common/UdfAnalysisMode.java --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.common; + +/** + * Specifies to which extent user-defined functions are analyzed in order + * to give the Flink optimizer an insight of UDF internals and inform + * the user about common implementation mistakes. + * + */ +public enum UdfAnalysisMode { + + /** +* UDF analysis does not take place. +*/ + DISABLED, + + /** +* Hints for improvement of the program are printed to the log. +*/ + HINTING_ENABLED, + + /** +* The program will be automatically optimized with knowledge from UDF +* analysis. +*/ + OPTIMIZING_ENABLED; + +} --- End diff -- Since the user will have to set this, what about keeping it short? `DISABLE`, `HINT`, `OPTIMIZE`? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569202#comment-14569202 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31530742 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/SingleInputUdfOperator.java --- @@ -54,8 +54,11 @@ private MapString, DataSet? broadcastVariables; + // NOTE: only set this variable via setSemanticProperties() --- End diff -- I think this refactoring is quite fragile. The semantic properties utility is not returning an empty properties object, but null and you take care of setting it correctly here depending on whether the forwarded fields have been set manually or not. If optimize is enabled and there are manual annotations, they will be overriden. I am wondering if it is better to have manual annotations trump optimizer annotations. What's your opinion on this? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569042#comment-14569042 ] ASF GitHub Bot commented on FLINK-1319: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107941730 Naming UDF. The feedback of committers giving talks about Flink some time ago was that the name UDF was sometimes confusing. @rmetzger, can you confirm this? We might take this into account and rename the UdfAnalysisMode to something else, for example just CodeAnalysisMode. That's right. People associate UDFs with SQL databases that allow to pass in custom functions (which is right, but they start thinking Flink is a SQL database). In this case, its not super critical because its internal code. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569242#comment-14569242 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31532974 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/UdfOperatorUtils.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.java.operators; + +import org.apache.flink.api.common.UdfAnalysisMode; +import org.apache.flink.api.common.functions.Function; +import org.apache.flink.api.common.functions.InvalidTypesException; +import org.apache.flink.api.common.operators.DualInputSemanticProperties; +import org.apache.flink.api.common.operators.SingleInputSemanticProperties; +import org.apache.flink.api.java.sca.UdfAnalyzer; +import org.apache.flink.api.java.sca.UdfAnalyzerException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public abstract class UdfOperatorUtils { + + private static final Logger LOG = LoggerFactory.getLogger(UdfOperatorUtils.class); + + public static void analyzeSingleInputUdf(SingleInputUdfOperator?, ?, ? operator, Class? udfBaseClass, + Function udf, Keys? key) { + final UdfAnalysisMode mode = operator.getExecutionEnvironment().getConfig().getUdfAnalysisMode(); + if (mode != UdfAnalysisMode.DISABLED) { + try { + final UdfAnalyzer analyzer = new UdfAnalyzer(udfBaseClass, udf.getClass(), operator.getInputType(), null, + operator.getResultType(), key, null, mode == UdfAnalysisMode.OPTIMIZING_ENABLED); + final boolean success = analyzer.analyze(); + if (success) { + if (mode == UdfAnalysisMode.OPTIMIZING_ENABLED + !operator.udfWithForwardedFieldsAnnotation(udf.getClass())) { + operator.setSemanticProperties((SingleInputSemanticProperties) analyzer.getSemanticProperties()); + operator.setAnalyzedUdfSemanticsFlag(); --- End diff -- I think it would make sense to also print the inferred forwarded fields (at least for debugging purposes). Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569218#comment-14569218 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31531758 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/UdfOperatorUtils.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.java.operators; + +import org.apache.flink.api.common.UdfAnalysisMode; +import org.apache.flink.api.common.functions.Function; +import org.apache.flink.api.common.functions.InvalidTypesException; +import org.apache.flink.api.common.operators.DualInputSemanticProperties; +import org.apache.flink.api.common.operators.SingleInputSemanticProperties; +import org.apache.flink.api.java.sca.UdfAnalyzer; +import org.apache.flink.api.java.sca.UdfAnalyzerException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public abstract class UdfOperatorUtils { + + private static final Logger LOG = LoggerFactory.getLogger(UdfOperatorUtils.class); + + public static void analyzeSingleInputUdf(SingleInputUdfOperator?, ?, ? operator, Class? udfBaseClass, --- End diff -- I vote to pass the name of the operator as well. The log output will then be more consistent. Currently the class name is used. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569271#comment-14569271 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31535397 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/operators/UdfOperatorUtils.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.java.operators; + +import org.apache.flink.api.common.UdfAnalysisMode; +import org.apache.flink.api.common.functions.Function; +import org.apache.flink.api.common.functions.InvalidTypesException; +import org.apache.flink.api.common.operators.DualInputSemanticProperties; +import org.apache.flink.api.common.operators.SingleInputSemanticProperties; +import org.apache.flink.api.java.sca.UdfAnalyzer; +import org.apache.flink.api.java.sca.UdfAnalyzerException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public abstract class UdfOperatorUtils { + + private static final Logger LOG = LoggerFactory.getLogger(UdfOperatorUtils.class); + + public static void analyzeSingleInputUdf(SingleInputUdfOperator?, ?, ? operator, Class? udfBaseClass, + Function udf, Keys? key) { + final UdfAnalysisMode mode = operator.getExecutionEnvironment().getConfig().getUdfAnalysisMode(); + if (mode != UdfAnalysisMode.DISABLED) { + try { + final UdfAnalyzer analyzer = new UdfAnalyzer(udfBaseClass, udf.getClass(), operator.getInputType(), null, + operator.getResultType(), key, null, mode == UdfAnalysisMode.OPTIMIZING_ENABLED); + final boolean success = analyzer.analyze(); + if (success) { + if (mode == UdfAnalysisMode.OPTIMIZING_ENABLED + !operator.udfWithForwardedFieldsAnnotation(udf.getClass())) { + operator.setSemanticProperties((SingleInputSemanticProperties) analyzer.getSemanticProperties()); + operator.setAnalyzedUdfSemanticsFlag(); + } + else if (mode == UdfAnalysisMode.HINTING_ENABLED) { + analyzer.addSemanticPropertiesHints(); + } + LOG.info(analyzer.getHintsString()); + } + } + catch (InvalidTypesException e) { + LOG.debug(Unable to do UDF analysis due to missing type information., e); + } + catch (UdfAnalyzerException e) { + LOG.debug(UDF analysis failed., e); + } + } + } + + public static void analyzeDualInputUdf(TwoInputUdfOperator?, ?, ?, ? operator, Class? udfBaseClass, + Function udf, Keys? key1, Keys? key2) { + final UdfAnalysisMode mode = operator.getExecutionEnvironment().getConfig().getUdfAnalysisMode(); + if (mode != UdfAnalysisMode.DISABLED) { --- End diff -- We could log that the analysis is disabled as well. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569232#comment-14569232 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/729#discussion_r31532320 --- Diff: flink-java/src/main/java/org/apache/flink/api/java/sca/UdfAnalyzer.java --- @@ -0,0 +1,431 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.api.java.sca; + +import org.apache.flink.api.common.functions.CoGroupFunction; +import org.apache.flink.api.common.functions.CrossFunction; +import org.apache.flink.api.common.functions.FilterFunction; +import org.apache.flink.api.common.functions.FlatJoinFunction; +import org.apache.flink.api.common.functions.FlatMapFunction; +import org.apache.flink.api.common.functions.GroupReduceFunction; +import org.apache.flink.api.common.functions.JoinFunction; +import org.apache.flink.api.common.functions.MapFunction; +import org.apache.flink.api.common.functions.ReduceFunction; +import org.apache.flink.api.common.operators.DualInputSemanticProperties; +import org.apache.flink.api.common.operators.SemanticProperties; +import org.apache.flink.api.common.operators.SingleInputSemanticProperties; +import org.apache.flink.api.common.typeinfo.TypeInformation; +import org.apache.flink.api.java.functions.SemanticPropUtil; +import org.apache.flink.api.java.operators.Keys; +import org.apache.flink.api.java.operators.Keys.ExpressionKeys; +import org.apache.flink.api.java.sca.TaggedValue.Input; +import org.objectweb.asm.Type; +import org.objectweb.asm.tree.MethodNode; + +import java.lang.reflect.Method; +import java.util.ArrayList; +import java.util.List; + +import static org.apache.flink.api.java.sca.UdfAnalyzerUtils.convertTypeInfoToTaggedValue; +import static org.apache.flink.api.java.sca.UdfAnalyzerUtils.findMethodNode; +import static org.apache.flink.api.java.sca.UdfAnalyzerUtils.mergeReturnValues; +import static org.apache.flink.api.java.sca.UdfAnalyzerUtils.removeUngroupedInputsFromContainer; + +public class UdfAnalyzer { + // exclusion to suppress hints for API operators + private static final String EXCLUDED_CLASSPATH = org/apache/flink; --- End diff -- Instead of excluding this class path, the consensus from reviews so far is to add an `@SkipCodeAnalysis` annotation. This will allow new users to play around with the Flink examples etc. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14569312#comment-14569312 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107996871 Hey Timo, I've tested this locally so far and it's working smoothly! Let's write a blog post about this very soon. :-) As soon as I have access to more machines, I will test it on a cluster. The stuff denoted with [RB] should be fixed in any case before merging imo. The rest we can also do afterwards. **USER FACING** 1. Disable the analysis by default. [RB] 2. This is cumbersome, but we should go for Annotation-based exclusions instead of package based (see inline comments). [RB] 3. Currently manual annotations trump automatic ones, which is a good thing, because people won't have unexpected results. Could you add a test for this? 4. I would give the analyzer util the call location based function name (String callLocation = Utils.getCallLocationName()). I think that will have better output than just `Function 'Job$2' has been analyzed with the following result: ...` 4. I would rename `UdfAnalyisMode` (see inline comments). 5. The hints are currently logged on a singe line w/o a whitespace after each hint, like this: ``` Function modifies static fields. This can lead to unexpected behaviour during runtime.Function returns 'null' values. This can lead to errors during runtime.A need for forwarded fields annotations could not be found. ``` 6. Some analysis like a wrong tuple index access or returning null lead fail the program before submitting it, which is very nice. I would actually like to log these problems on a log level WARN (instead of INFO) when only hints are enabled. 7. When you have optimize enabled and run a filter, which changes the input, the error you get is this. You can't tell what's wrong. ``` Exception in thread main org.apache.flink.api.java.sca.UdfErrorException: UDF contains obvious errors. at org.apache.flink.api.java.sca.UdfAnalyzer.analyze(UdfAnalyzer.java:300) ... ``` For a wrong tuple index access, it is correct: ``` Exception in thread main org.apache.flink.api.java.sca.UdfErrorException: UDF contains obvious errors. at org.apache.flink.api.java.sca.UdfAnalyzer.analyze(UdfAnalyzer.java:300) ... Caused by: org.apache.flink.api.java.sca.UdfErrorException: Function contains tuple accesses with invalid indexes. This can lead to errors during runtime. at org.apache.flink.api.java.sca.UdfAnalyzer.addHintOrThrowException(UdfAnalyzer.java:413) ... ``` I'm not sure if the thrown Execption should be an `UdfErrorException` or an `UdfAnalysisException`? 8. When optimizing is enabled, I would still print the inferred forwarded fields. 9. Regarding the result messages: personally, I think the hints/messages could be more consise, e.g. for example skip the `(should be kept to a minimum)` in the number of object creations msg or just say `Forwarded fields: none` instead of `A need for forwarded fields annotations could not be found.` **INTERNALS** 1. I think the internals could use more comments. It's easy to get the general idea, but it would be nice to also get some high-level comments about the ASM-related stuff. I didn't have time to dig deep into it. --- Will report back after cluster tests. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far.
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567225#comment-14567225 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107419997 OK, I will review this. I vote to stick to Stephan's suggested approach instead of package based exclusions: analyze everything and allow exclusions with a `@SkipCodeAnalysis` annotation. Any further opinions on the output of the analysis (stdout vs. logging question)? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567244#comment-14567244 ] ASF GitHub Bot commented on FLINK-1319: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107422340 +1 for the annotation. I'm against using stdout. Logging frameworks are much better at controlling the output. The quickstart mvn archetype provides a log4j.properties file, so we can configure it the way we want it to. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567146#comment-14567146 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107393190 Let us merge this for 0.9 and have it deactivated by default. Let's gradually activate it in the next releases as it gets exposure Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14567304#comment-14567304 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-107482998 The concern about logging is that, when using the local mode inside the IDE, the system logs a lot and the hints get lost. If you don't want sysoutput, you could always deactivate the analysis. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562512#comment-14562512 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106232545 I second Ufuk's comments. Merging it and deactivating it by default. I can see a 0.9.1 or 0.10.0 release coming in very soon afterwards, because we have a big set of issues still in the pipeline. Initially activating hinting in the local environment (what people use during debigging anyways) and having it deactivated in the production environments (remote and context). Other comments: - How about printing the hints to sysout? I can see them getting lost among the logging statements. Also, people often have logging not activated in the IDE. - Package based exclusions never worked, it was always an issue with the quickstarts. I assume you want the exclusion to make sure you do not analyze the built-in default join function, for example? What you can do is add an annotation that says DoNotAnalyze to that functions, and then simply analyze everything. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562532#comment-14562532 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106236797 +1 for activating hinting locally. I thought logging is the standard way to print, but I can change it to a sysout Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562547#comment-14562547 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106241536 - I agree with hints going to sysout and activating by default. - For simple functions, most of the transitive allocations will be Flink internal, right? E.g. after calling the collect method. Would it make sense to exclude transitive allocations reached via collect? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562523#comment-14562523 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106233384 I disabled the analyzer for all classes starting with org.apache.flink. Because I wanted to reduce the output to the user for build-in UDFs (e.g. `org.apache.flink.api.java.Utils$CollectHelper` or UDFs within the Graph API). Initially I thought about an annotation `@SkipCodeAnalysis` but there are too many UDFs where this annotation should then be placed at. I think we can assume that UDFs shipped with Flink are already implemented effcient or unefficient for example purposes only. Object creations in method mean that these objects are created directly in e.g. `map()`. The analyzer also follows method calls. transitively created objects are objects created in the nested method calls. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562553#comment-14562553 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106243626 The `collect()` is a special case, the analyzer does not follow it. But after thinking about it I recognized that I forgot `getRuntimeContext()` I will fix this ;) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560695#comment-14560695 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105840358 The tests are failing with checkstyle violations. Can you fix those? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560609#comment-14560609 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105814322 Indeed, this looks like an impressive addition. Let's get it into 0.9 as a beta feature! Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560614#comment-14560614 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105816818 Looks very nice, seems to have a good test coverage as well. How well does it work with bytecode generated by the Scala Compiler? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560921#comment-14560921 ] ASF GitHub Bot commented on FLINK-1319: --- Github user twalthr commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105897738 The bytecode generated from Scala Compiler is the same. That is not a problem for the analyzer. But Scala is not fully supported yet, because of the different Java/Scala Tuples (fields starting with _ instead of f etc.). I will add support for that in the near future. If any exceptions are thrown, they are only visible in the debug log. So we could merge it into the 0.9 release without any disadvantages. It builds now ;) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561039#comment-14561039 ] ASF GitHub Bot commented on FLINK-1319: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105926378 Okay, pending a proper distributed test, you have my vote to add this! Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14561838#comment-14561838 ] ASF GitHub Bot commented on FLINK-1319: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-106093530 This is a very cool! Thanks for all the effort you put into this. I think this will be a great addition to the project. :-) --- I've skimmed over the code and checked it out. Some initial feedback: - **Naming UDF**. The feedback of committers giving talks about Flink some time ago was that the name UDF was sometimes confusing. @rmetzger, can you confirm this? We might take this into account and rename the `UdfAnalysisMode` to something else, for example just `CodeAnalysisMode`. - **Default mode**: the default mode is currently set to hinting. Running the WordCount example, it didn't work out of the box though. After some debugging I found this in the `UdfAnalyzer`, which prevented the analyis: ```java if (internalUdfClassName.startsWith(EXCLUDED_CLASSPATH) !internalUdfClassName.startsWith(org/apache/flink/api/java/sca)) { return false; } ``` After commenting the second condition out, it worked fine. For a WordCount with a reducer, which only touches the second field of the Tuple2, it worked like a charm. The produced output looks like this: ``` 23:23:49,207 INFO org.apache.flink.api.java.operators.UdfOperatorUtils - Function 'org.apache.flink.examples.java.wordcount.WordCount$Tokenizer' has been analyzed with the following result: Number of object creations (should be kept to a minimum): 1 in method / 36 transitively A need for forwarded fields annotations could not be found. 23:23:49,217 INFO org.apache.flink.api.java.operators.UdfOperatorUtils - Function 'org.apache.flink.examples.java.wordcount.WordCount$1' has been analyzed with the following result: Number of object creations (should be kept to a minimum): 1 in method / 1 transitively You could use the following annotation: @ForwardedFields(f0-f0;) 23:23:49,243 INFO org.apache.flink.api.java.operators.UdfOperatorUtils - Function 'org.apache.flink.api.java.Utils$CollectHelper' has been analyzed with the following result: Number of object creations (should be kept to a minimum): 0 in method / 6 transitively A need for forwarded fields annotations could not be found. ``` Very nice to see this working. This is great news for the optimizer. :-) What do the transitive object creations refer to exactly? I'm wondering how a user could influence them? Reusing a result object in the WordCount sum reducer is correctly detected as 0 creations in the method as well. I was wondering whether it could be possible to configure code analysis on a per-function level. For example, all library related functions should not print the hints imo. --- I very much like this. Regarding merging this before the release: I would vote to only merge it if we disable it by default. This will essentially affect every program written in the upcoming Flink release and merging it without proper review, testing, and exposure seems rushed to me. But I wouldn't veto it, if everyone wants it in asap. I think this feature alone would warrant a new 0.10 release soon after 0.9. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560564#comment-14560564 ] ASF GitHub Bot commented on FLINK-1319: --- Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/729#issuecomment-105799678 Thanks a lot for this great work! I'll soon try out the code to see how it works. Given that we can disable it and that its not automatically setting semantic properties, I would vote to merge it soon to include it into the 0.9 release. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559700#comment-14559700 ] ASF GitHub Bot commented on FLINK-1319: --- GitHub user twalthr opened a pull request: https://github.com/apache/flink/pull/729 [FLINK-1319][core] Add static code analysis for UDFs This PR implements a Static Code Analyzer (SCA) that uses the ASM framework for interpreting Java bytecode of Flink UDFs. The analyzer is build on top of ASM's `BasicInterpreter`. Instead of ASM's `BasicValue`s, I introduced `TaggedValue`s which extends `BasicValue` and allows for appending interesting information to values. Interesting values such as inputs, collectors, or constants are tagged such that a tracking of atomic input fields through the entire UDF (until the function returns or calls `collect()`) is possible. The implementation is as conservative as possible meaning that for cases or bytecode instructions that haven't been considered the analyzer will fallback to the ASM library (which removes TaggedValues). 61 JUnit tests are testing the basic functionality. 18 JUnit tests with code examples from the real world are testing the analyzer even more. The analyzer has 3 modes: DISABLED, OPTIMIZE, HINTS The interpretation takes some time. It is possible that an analysis of an UDF takes up to 1 second. Therefore, I didn't enable the analyzer in TestEnvironment by default to reduce the build times, but if you uncomment the lines the analyzer supports all 280 UDFs within the entire Flink code. The analyzer gives hints about: - Main feature: ForwardedFields semantic properties for all types of Functions except for MapPartition and Combine - Warnings if static fields are modified by a Function - Warnings if a FilterFunction modifies its input objects - Warnings if a Function returns `null` - Warnings if a tuple access uses a wrong index - Information about the number of object creations within a UDF (for manual optimization) You can merge this pull request into a Git repository by running: $ git pull https://github.com/twalthr/flink sca Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #729 commit c384fc9740013ec1ae89a2817695078542c47dfe Author: twalthr twal...@apache.org Date: 2015-05-26T18:22:03Z [FLINK-1319][core] Add static code analysis for UDFs Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer:
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485131#comment-14485131 ] Stephan Ewen commented on FLINK-1319: - In principle this is good, but I would put it in a utils class (the logic to read the config mode and then call the analysis). Otherwise you have probably a lot of duplicated code. Another thing to think about is to but the UDF code analysis in the operator constructor. Then it is eagerly executed at the point when the user calls flatMap(). That is sometimes easier to understand / debug, compared to the lazy evaluation of getSemanticProperties() and translateToDataflow(). Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483480#comment-14483480 ] Timo Walther commented on FLINK-1319: - I have integrated the analyzer in the FlatMapOperator overriding the getSemanticProperties() method: https://github.com/twalthr/flink/blob/a01b8334795d530b28d25f6bb20b09fca5c3cf27/flink-java/src/main/java/org/apache/flink/api/java/operators/FlatMapOperator.java Do you think that's the right way to go? (before I implement that in all operators ;) ) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382269#comment-14382269 ] Stephan Ewen commented on FLINK-1319: - I think there is no reason they are not available in the Scala API. They absolutely should be ;-) I vote to move them to the core project. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14382288#comment-14382288 ] Timo Walther commented on FLINK-1319: - Sorry, I was a little bit confused yesterday because the SemanticPropUtil is in flink-java and has dependencies to Keys and Tuple. My analyzer also has dependencies to Tuple, Pojo, Keys. I think I need to put the analyzer in flink-java. I also have to implement Scala support. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379653#comment-14379653 ] Timo Walther commented on FLINK-1319: - Thanks for your comments! The analyzer also supports group-at-a-time functions. Supported: CoGroupFunction CrossFunction FlatCombineFunction FlatJoinFunction FlatMapFunction GroupReduceFunction JoinFunction MapFunction MapPartitionFunction ReduceFunction Not supported yet: FilterFunction CombineFunction I will implement Ufuks summary. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379764#comment-14379764 ] Timo Walther commented on FLINK-1319: - Is there a reason why Semantic Properties are not available in the Scala API? Does it make sense to also move it to core / org.apache.flink.api.common? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377902#comment-14377902 ] Fabian Hueske commented on FLINK-1319: -- (Copying my response from the mailing list) I agree with Stephan. A separate repository is not necessary because this feature is visible for users (except for the activation switch) and could therefore be added to {{flink-core}} without problems, IMO. The handling of forwarded fields for group-wise operators in the optimizer is not fully sorted out, yet. So that might need to be adapted (see FLINK-1656, and PR #525) For the switch we could offer three options: - deactivated - activated hinting (write extracted semantic information to log) - activated optimizing (use extracted semantic info in optimizer) Regarding additional checks we could: - detect whether a Filter function modifies the record - check if a Reduce function returns a new record or the first? input record. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377761#comment-14377761 ] Maximilian Michels commented on FLINK-1319: --- This looks like a very promising way to automatically optimize Flink jobs. +1 for including it in {{flink-staging}}. +1 for a switch in the {{ExecutionEnvironment}} to manually turn it on. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14378021#comment-14378021 ] Ufuk Celebi commented on FLINK-1319: I vote for a combination of Stephan's and Fabian's suggestion: 1. Core 2. ExecutionConfig 3. Three options (deactivated by default (expect for tests), hint, active) Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376003#comment-14376003 ] Ufuk Celebi commented on FLINK-1319: Hey Timo, great news! :-) 1. What about adding it to staging? 2 I would very much like to have this on by default in the future. But I agree that we should not do this now. We really need to be certain that we don't introduce wrong annotations. They might cause some hard-to-understand problems for new users when enabled by default. As a first step it makes sense make this as explicit as possible, for example with a {{optimizeUdf()}} method as you propose. Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376899#comment-14376899 ] Stephan Ewen commented on FLINK-1319: - Very nice result, a very much anticipated feature. Can you tell us how many functions are currently analyzed by this? Does the basic mechanism work with record-at-a-time functions only, or also with group-at-a-time functions? To proceed: - Do we nee an extra project for this? I would actually not mind having this in core / java. It is sort of lightweight and we have the ASM dependency anyways (closure cleaning). - To activate or deactivate it, I would use the ExecutionConfig in the ExecutionEnvironment. From my experience with users, no one bothers to call any of the parametrization methods ever (withForwardFields, withName, analyzeUdf, ...). If we make it dependent on that, it will effectively not be used. - I would have it deactivated by default initially. Users can activate it globally with the ExecutionConfig. We should have it activated it in all test to give the code coverage with our test UDFs. This can be done centralized, where the test context environments are created. - We can activate it by default in the next release, once we have given this some testing and exposure. Other comments: - I would vote to throw an exception (or at least print a warning) if you detect that any path in the program returns a null value. - ASM dependency versions needs to be set by a variable (defined in root pom, interaction with shading) - Can you format the POM xml like the other POMs (tabs) ? Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1319) Add static code analysis for UDFs
[ https://issues.apache.org/jira/browse/FLINK-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307203#comment-14307203 ] Timo Walther commented on FLINK-1319: - Actually, I don't like the drop-in approach. I think it would be much better if the code analysis can be included in the release. Especially once the code is stable enough, it would be great to enable it by default and speed up jobs automatically. I did some research about other frameworks we could use instead. Soot is the best framework, however, I think we can also build the code analysis on top of the ObjectWeb ASM library[1]. It provides some functionality for data flow analysis[2]. The examples for BasicInterpreter and BasicVerifier look promising. Other projects use it for determine types[3]. Using ASM requires us to implement more but it gives us full flexibility for further analysis use cases. I would try implement a simple proof-of-concept prototype. What do you think? [1] http://asm.ow2.org/ [2] http://download.forge.objectweb.org/asm/asm4-guide.pdf, 115ff [3] https://github.com/hraberg/enumerable/blob/master/src/main/java/org/enumerable/lambda/support/expression/ExpressionInterpreter.java Add static code analysis for UDFs - Key: FLINK-1319 URL: https://issues.apache.org/jira/browse/FLINK-1319 Project: Flink Issue Type: New Feature Components: Java API, Scala API Reporter: Stephan Ewen Assignee: Timo Walther Priority: Minor Flink's Optimizer takes information that tells it for UDFs which fields of the input elements are accessed, modified, or frwarded/copied. This information frequently helps to reuse partitionings, sorts, etc. It may speed up programs significantly, as it can frequently eliminate sorts and shuffles, which are costly. Right now, users can add lightweight annotations to UDFs to provide this information (such as adding {{@ConstandFields(0-3, 1, 2-1)}}. We worked with static code analysis of UDFs before, to determine this information automatically. This is an incredible feature, as it magically makes programs faster. For record-at-a-time operations (Map, Reduce, FlatMap, Join, Cross), this works surprisingly well in many cases. We used the Soot toolkit for the static code analysis. Unfortunately, Soot is LGPL licensed and thus we did not include any of the code so far. I propose to add this functionality to Flink, in the form of a drop-in addition, to work around the LGPL incompatibility with ALS 2.0. Users could simply download a special flink-code-analysis.jar and drop it into the lib folder to enable this functionality. We may even add a script to tools that downloads that library automatically into the lib folder. This should be legally fine, since we do not redistribute LGPL code and only dynamically link it (the incompatibility with ASL 2.0 is mainly in the patentability, if I remember correctly). Prior work on this has been done by [~aljoscha] and [~skunert], which could provide a code base to start with. *Appendix* Hompage to Soot static analysis toolkit: http://www.sable.mcgill.ca/soot/ Papers on static analysis and for optimization: http://stratosphere.eu/assets/papers/EnablingOperatorReorderingSCA_12.pdf and http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf Quick introduction to the Optimizer: http://stratosphere.eu/assets/papers/2014-VLDBJ_Stratosphere_Overview.pdf (Section 6) Optimizer for Iterations: http://stratosphere.eu/assets/papers/spinningFastIterativeDataFlows_12.pdf (Sections 4.3 and 5.3) -- This message was sent by Atlassian JIRA (v6.3.4#6332)