Re: Review Request 61265: PIG-5256: Bytecode generation for POFilter and POForeach
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61265/ --- (Updated Aug. 9, 2017, 3:59 a.m.) Review request for pig, Daniel Dai and Koji Noguchi. Changes --- During performance testing on large data, found that were lot of spills. Fixed that by avoiding materialization of the full databag in nested foreach similar to how it happens in regular processing. Also in addition to udfs, any repeated POSort and PODistinct is now processed only once. Also took care of clearing those values for garbage collection as soon as the last foreach plan referring to them is done instead of at the end when input was detached. Bugs: PIG-5256 https://issues.apache.org/jira/browse/PIG-5256 Repository: pig Description --- For each POForEach or POFilter operator in the plan, a corresponding class is created by inlining code for the input plans. Bytecode is generated directly through asm APIs and the generated class files are shipped to the tasks as a jar. On the frontend, the generated classes are dynamically added to the PigContext classloader. The explain command now decompiles the class files in the explain output directory. Refer to the golden files in the test for examples of generated code. License: The newly added fernflower.jar used for decompilation is from IntelliJ (java decompiler used by IntelliJ IDEA) https://github.com/JetBrains/intellij-community/blob/master/plugins/java-decompiler/engine/src/org/jetbrains/java/decompiler/main/Fernflower.java and the license of that is Apache 2.0 ithttps://github.com/JetBrains/intellij-community/blob/master/LICENSE.txt TODO items to be addressed in a separate jira: 1) PIG-5279 - Support for MR and Spark. Currently only done for Tez. Will also add documentation in this jira 2) Support for Accumulator 3) Support for CROSS 4) Run the optimizer on combiner plans 5) Fix for test failures - TestScriptLanguage.runParallelTest2, Jython_CompileBindRun_3 java.lang.LinkageError: loader (instance of org/apache/pig/impl/PigContext$ContextClassLoader): attempted duplicate class definition for name: "org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach_scope_6" Since NodeIdGenerator is ThreadLocal, running same script in parallel with different parameters using embedded Python causes conflict. Requires ThreadLocal classloaders in PigContext. Will address in a separate jira. Other jiras fixed as part of this: 1) PIG-4515: org.apache.pig.builtin.Distinct throws ClassCastException 2) When pig.opt.bytecode=true, UDFs defined as an alias inside nested foreach are executed only once. This was actually the primary goal of doing bytecode generation PIG-3000: Optimize nested foreach PIG-1633: Using an alias withing Nested Foreach causes indeterminate behaviour Diffs (updated) - http://svn.apache.org/repos/asf/pig/trunk/build.xml 1804148 http://svn.apache.org/repos/asf/pig/trunk/ivy.xml 1804148 http://svn.apache.org/repos/asf/pig/trunk/ivy/libraries.properties 1804148 http://svn.apache.org/repos/asf/pig/trunk/shade/pom.xml PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigServer.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRUtil.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/AsmUtil.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/CodeGenerator.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/FilterCodeGenerator.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/ForEachCodeGenerator.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/BagInputOperator.java PRE-CREATION http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/Add.java 1804148 http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/Divide.java 1804148
[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend
[ https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119287#comment-16119287 ] liyunzhang_intel commented on PIG-5283: --- [~szita]: I can understand the reason why need set {{CommonConfigurationKeys.IO_SERIALIZATIONS_KEY}}. why need set {{PigConfiguration.PIG_COMPRESS_INPUT_SPLITS}} in the configuration? > Configuration is not passed to SparkPigSplits on the backend > > > Key: PIG-5283 > URL: https://issues.apache.org/jira/browse/PIG-5283 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Adam Szita >Assignee: Adam Szita > Attachments: PIG-5283.0.patch, PIG-5283.1.patch > > > When a Hadoop ObjectWritable is created during a Spark job, the instantiated > PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration > instance. > This happens > [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend
[ https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118306#comment-16118306 ] Adam Szita commented on PIG-5283: - Attached [^PIG-5283.1.patch] with the feature of only writing out the necessary keys of the configuration. Unfortunately I don't see any way to write the config only once (instead of per split), as I need to have it ready at the very first stages of Spark task execution: the deseralization of the task. [~kellyzly] yes, the PigInputFormatSpark#createRecordReader part comes much later in the execution, and will work just like before. In a way it is irrelevant of the current issue, because it will set the full configuration on each split, but it's too late for this issue since we need the configuration during task deseralization time already. > Configuration is not passed to SparkPigSplits on the backend > > > Key: PIG-5283 > URL: https://issues.apache.org/jira/browse/PIG-5283 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Adam Szita >Assignee: Adam Szita > Attachments: PIG-5283.0.patch, PIG-5283.1.patch > > > When a Hadoop ObjectWritable is created during a Spark job, the instantiated > PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration > instance. > This happens > [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend
[ https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Szita updated PIG-5283: Attachment: PIG-5283.1.patch > Configuration is not passed to SparkPigSplits on the backend > > > Key: PIG-5283 > URL: https://issues.apache.org/jira/browse/PIG-5283 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Adam Szita >Assignee: Adam Szita > Attachments: PIG-5283.0.patch, PIG-5283.1.patch > > > When a Hadoop ObjectWritable is created during a Spark job, the instantiated > PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration > instance. > This happens > [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend
[ https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118075#comment-16118075 ] liyunzhang_intel commented on PIG-5283: --- [~szita]: {quote} My only question is that if we should only write those properties that are required for a PigSplit instead of writing the full jobConf (6-700 entries) for optimization. {quote} not initialize all the items. it is ok to just initialize few items to make it work. Will PigInputFormatSpark#createRecordReader initialize all items after bypassing current issue? > Configuration is not passed to SparkPigSplits on the backend > > > Key: PIG-5283 > URL: https://issues.apache.org/jira/browse/PIG-5283 > Project: Pig > Issue Type: Bug > Components: spark >Reporter: Adam Szita >Assignee: Adam Szita > Attachments: PIG-5283.0.patch > > > When a Hadoop ObjectWritable is created during a Spark job, the instantiated > PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration > instance. > This happens > [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (35 issues) Subscriber: pigdaily Key Summary PIG-5268Review of org.apache.pig.backend.hadoop.datastorage.HDataStorage https://issues-test.apache.org/jira/browse/PIG-5268 PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream https://issues-test.apache.org/jira/browse/PIG-5267 PIG-5264Remove deprecated keys from PigConfiguration https://issues-test.apache.org/jira/browse/PIG-5264 PIG-5246Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading spark to 2 https://issues-test.apache.org/jira/browse/PIG-5246 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues-test.apache.org/jira/browse/PIG-5160 PIG-5157Upgrade to Spark 2.0 https://issues-test.apache.org/jira/browse/PIG-5157 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues-test.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues-test.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues-test.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues-test.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues-test.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues-test.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues-test.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues-test.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues-test.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues-test.apache.org/jira/browse/PIG-4750 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues-test.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues-test.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues-test.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues-test.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues-test.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues-test.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues-test.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues-test.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues-test.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues-test.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues-test.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues-test.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues-test.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues-test.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues-test.apache.org/jira/browse/PIG-3864 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues-test.apache.org/jira/browse/PIG-3668 PIG-3655BinStorage and InterStorage approach to record markers is broken https://issues-test.apache.org/jira/browse/PIG-3655 PIG-3587add functionality for rolling over dates https://issues-test.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues-test.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues-test.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (33 issues) Subscriber: pigdaily Key Summary PIG-5283Configuration is not passed to SparkPigSplits on the backend https://issues.apache.org/jira/browse/PIG-5283 PIG-5268Review of org.apache.pig.backend.hadoop.datastorage.HDataStorage https://issues.apache.org/jira/browse/PIG-5268 PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream https://issues.apache.org/jira/browse/PIG-5267 PIG-5256Bytecode generation for POFilter and POForeach https://issues.apache.org/jira/browse/PIG-5256 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues.apache.org/jira/browse/PIG-5160 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues.apache.org/jira/browse/PIG-4750 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange handling of Daylight Saving Time with location based timezones https://issues.apache.org/jira/browse/PIG-3864 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384