Re: Review Request 61265: PIG-5256: Bytecode generation for POFilter and POForeach

2017-08-08 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61265/
---

(Updated Aug. 9, 2017, 3:59 a.m.)


Review request for pig, Daniel Dai and Koji Noguchi.


Changes
---

During performance testing on large data, found that were lot of spills. Fixed 
that by avoiding materialization of the full databag in nested foreach similar 
to how it happens in regular processing.  

Also in addition to udfs, any repeated POSort and PODistinct is now processed 
only once. Also took care of clearing those values for garbage collection as 
soon as the last foreach plan referring to them is done instead of at the end 
when input was detached.


Bugs: PIG-5256
https://issues.apache.org/jira/browse/PIG-5256


Repository: pig


Description
---

For each POForEach or POFilter operator in the plan, a corresponding class is 
created by inlining code for the input plans. 
Bytecode is generated directly through asm APIs and the generated class files 
are shipped to the tasks as a jar. On the frontend, the generated classes are 
dynamically added to the PigContext classloader. The explain command now 
decompiles the class files in the explain output directory. Refer to the golden 
files in the test for examples of generated code.

License:
The newly added fernflower.jar used for decompilation is from IntelliJ (java 
decompiler used by IntelliJ IDEA)  
https://github.com/JetBrains/intellij-community/blob/master/plugins/java-decompiler/engine/src/org/jetbrains/java/decompiler/main/Fernflower.java
  and the license of that is Apache 2.0 
ithttps://github.com/JetBrains/intellij-community/blob/master/LICENSE.txt

TODO items to be addressed in a separate jira:
1) PIG-5279 - Support for MR and Spark. Currently only done for Tez. Will also 
add documentation in this jira
2) Support for Accumulator
3) Support for CROSS
4) Run the optimizer on combiner plans
5) Fix for test failures - TestScriptLanguage.runParallelTest2, 
Jython_CompileBindRun_3
 java.lang.LinkageError: loader (instance of  
org/apache/pig/impl/PigContext$ContextClassLoader): attempted  duplicate class 
definition for name: 
"org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach_scope_6"
 Since NodeIdGenerator is ThreadLocal, running same script in parallel with 
different parameters using embedded Python causes conflict. Requires 
ThreadLocal classloaders in PigContext. Will address in a separate jira.

Other jiras fixed as part of this:
1) PIG-4515: org.apache.pig.builtin.Distinct throws ClassCastException
2) When pig.opt.bytecode=true, UDFs defined as an alias inside nested foreach 
are executed only once. This was actually the primary goal of doing bytecode 
generation
PIG-3000: Optimize nested foreach
PIG-1633: Using an alias withing Nested Foreach causes indeterminate behaviour


Diffs (updated)
-

  http://svn.apache.org/repos/asf/pig/trunk/build.xml 1804148 
  http://svn.apache.org/repos/asf/pig/trunk/ivy.xml 1804148 
  http://svn.apache.org/repos/asf/pig/trunk/ivy/libraries.properties 1804148 
  http://svn.apache.org/repos/asf/pig/trunk/shade/pom.xml PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigConfiguration.java
 1804148 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/PigServer.java 
1804148 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
 1804148 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRUtil.java
 1804148 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/AsmUtil.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/CodeGenerator.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/FilterCodeGenerator.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/optimizer/ForEachCodeGenerator.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/BagInputOperator.java
 PRE-CREATION 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java
 1804148 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/Add.java
 1804148 
  
http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/Divide.java
 1804148 
  

[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119287#comment-16119287
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]:  I can understand the reason why need set 
{{CommonConfigurationKeys.IO_SERIALIZATIONS_KEY}}. why need set 
{{PigConfiguration.PIG_COMPRESS_INPUT_SPLITS}} in the configuration?

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch, PIG-5283.1.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118306#comment-16118306
 ] 

Adam Szita commented on PIG-5283:
-

Attached [^PIG-5283.1.patch] with the feature of only writing out the necessary 
keys of the configuration.
Unfortunately I don't see any way to write the config only once (instead of per 
split), as I need to have it ready at the very first stages of Spark task 
execution: the deseralization of the task.

[~kellyzly] yes, the PigInputFormatSpark#createRecordReader part comes much 
later in the execution, and will work just like before. In a way it is 
irrelevant of the current issue, because it will set the full configuration on 
each split, but it's too late for this issue since we need the configuration 
during task deseralization time already.


> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch, PIG-5283.1.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread Adam Szita (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated PIG-5283:

Attachment: PIG-5283.1.patch

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch, PIG-5283.1.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5283) Configuration is not passed to SparkPigSplits on the backend

2017-08-08 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118075#comment-16118075
 ] 

liyunzhang_intel commented on PIG-5283:
---

[~szita]:  
{quote}
My only question is that if we should only write those properties that are 
required for a PigSplit instead of writing the full jobConf (6-700 entries) for 
optimization.

{quote}

not initialize all the items. it is ok to just initialize few items to make it 
work. Will PigInputFormatSpark#createRecordReader initialize all items after 
bypassing current issue?

> Configuration is not passed to SparkPigSplits on the backend
> 
>
> Key: PIG-5283
> URL: https://issues.apache.org/jira/browse/PIG-5283
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: PIG-5283.0.patch
>
>
> When a Hadoop ObjectWritable is created during a Spark job, the instantiated 
> PigSplit (wrapped into a SparkPigSplit) is given an empty Configuration 
> instance.
> This happens 
> [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SerializableWritable.scala#L44]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] Subscription: PIG patch available

2017-08-08 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-5268Review of org.apache.pig.backend.hadoop.datastorage.HDataStorage
https://issues-test.apache.org/jira/browse/PIG-5268
PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream
https://issues-test.apache.org/jira/browse/PIG-5267
PIG-5264Remove deprecated keys from PigConfiguration
https://issues-test.apache.org/jira/browse/PIG-5264
PIG-5246Modify bin/pig about SPARK_HOME, SPARK_ASSEMBLY_JAR after upgrading 
spark to 2
https://issues-test.apache.org/jira/browse/PIG-5246
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues-test.apache.org/jira/browse/PIG-5160
PIG-5157Upgrade to Spark 2.0
https://issues-test.apache.org/jira/browse/PIG-5157
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues-test.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues-test.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues-test.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues-test.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues-test.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues-test.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues-test.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues-test.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues-test.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues-test.apache.org/jira/browse/PIG-4750
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues-test.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues-test.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues-test.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues-test.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues-test.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues-test.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues-test.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues-test.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues-test.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues-test.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues-test.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues-test.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues-test.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues-test.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues-test.apache.org/jira/browse/PIG-3864
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues-test.apache.org/jira/browse/PIG-3668
PIG-3655BinStorage and InterStorage approach to record markers is broken
https://issues-test.apache.org/jira/browse/PIG-3655
PIG-3587add functionality for rolling over dates
https://issues-test.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues-test.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues-test.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384


[jira] Subscription: PIG patch available

2017-08-08 Thread jira
Issue Subscription
Filter: PIG patch available (33 issues)

Subscriber: pigdaily

Key Summary
PIG-5283Configuration is not passed to SparkPigSplits on the backend
https://issues.apache.org/jira/browse/PIG-5283
PIG-5268Review of org.apache.pig.backend.hadoop.datastorage.HDataStorage
https://issues.apache.org/jira/browse/PIG-5268
PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream
https://issues.apache.org/jira/browse/PIG-5267
PIG-5256Bytecode generation for POFilter and POForeach
https://issues.apache.org/jira/browse/PIG-5256
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328=12322384