[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097845#comment-16097845
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~szita]: 
{quote}
it looks like you've missed adding an entry to CHANGES.txt upon commit. I've 
added it now: 
{quote}
 thanks for catch that.

[~szita] or [~nkollar]: spend some time to review PIG-5246 if have time, thanks!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-22 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097247#comment-16097247
 ] 

Adam Szita commented on PIG-5157:
-

[~nkollar] thanks for adding this feature!
[~kellyzly] it looks like you've missed adding an entry to CHANGES.txt upon 
commit. I've added it now: 
https://github.com/apache/pig/commit/ce8aa41d16dbb3e7c3b97b23c18fd2473e9a5938

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-19 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092811#comment-16092811
 ] 

Nandor Kollar commented on PIG-5157:


Thanks [~rohini], [~kellyzly] and [~szita] for your help to resolve this 
feature!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157_15.patch, PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-18 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16092460#comment-16092460
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: commit PIG-5157_15.patch to the trunk, thanks for your development 
as upgrading to spark2 is a big feature. and also thanks the review of [~szita] 
and [~rohini].

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-17 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089381#comment-16089381
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: left comment on review board, just a small fix! meanwhile , please 
help review PIG-5246, thanks!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085443#comment-16085443
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: attached  SkewedJoinInput1.txt,  SkewedJoinInput2.txt which i used 
in the testJoin.pig. Please help view whether there is similar error in your 
env.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch, SkewedJoinInput1.txt, 
> SkewedJoinInput2.txt
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-07-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079868#comment-16079868
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: sorry to reply so late.

here is the result after solving the exception i mentioned last time
in spark1 with yarn-client mode:
{noformat}
export 
SPARK_JAR=hdfs://zly1.sh.intel.com:8020/user/root/spark-assembly-1.6.1-hadoop2.6.0.jar
export SPARK_HOME=$SPARK161 #donwload the spark1.6.1
export HADOOP_USER_CLASSPATH_FIRST="true"
$PIG_HOME/bin/pig -x spark  $PIG_HOME/bin/testJoin.pig
{noformat}

pig.properties
{noformat}
pig.sort.readonce.loadfuncs=org.apache.pig.backend.hadoop.hbase.HBaseStorage,org.apache.pig.backend.hadoop.accumulo.AccumuloStorage
spark.master=yarn-client
{noformat}

testJoin.pig
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name) parallel 10; 
store D into './testJoin.out';
{code} 

the script fails to generate result  and exception found in the log is
{noformat}

[task-result-getter-0] 2017-07-10 12:16:45,667 WARN  scheduler.TaskSetManager 
(Logging.scala:logWarning(70)) - Lost task 0.0 in stage 0.0 (TID 0, 
zly1.sh.intel.com): java.lang.IllegalStateException: unread block data
at 
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{noformat}

can you verify whether there is the same problem your cluster in yarn-client 
mode(in my cluster, it passed on local mode but failed in yarn-client mode)? 
the error seems like a problem about datanode but i verified the environment 
with spark branch code and it passed. So I guess the problem is caused by the 
patch.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-30 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069791#comment-16069791
 ] 

Adam Szita commented on PIG-5157:
-

[~kellyzly] can you check what Spark jars are on the classpath of the process 
which is throwing this exception? It seems like a mismatch between Spark 
classes (what Pig was built with and what's on the cluster now used for 
execution)

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-30 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069728#comment-16069728
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: with PIG-5157_13.patch on review board.
the test result :
pass on spark1 or spark2 on local mode
failed on yarn-client mode( add spark.master=yarn-client in conf/pig.properties)
exception message found in log
{noformat}
 [shuffle-server-0] 2017-06-30 14:24:25,501 WARN  
server.TransportChannelHandler 
(TransportChannelHandler.java:exceptionCaught(79)) - Excepti on in connection 
from /10.239.47.58:58214
1264 java.lang.NoSuchMethodError: 
org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;
{noformat}

If you can not reproduce in your env, also tell me, will check whether it is 
caused by configuraion or not.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-26 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062815#comment-16062815
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly] ok, I'm now executing the full e2e test suite on Spark 1.x, waiting 
for the results.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062805#comment-16062805
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: with PIG-5157_11.patch , pass in TestGrunt unit test. But has 
problem in yarn-cluster env, will investigate more.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-23 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060633#comment-16060633
 ] 

Nandor Kollar commented on PIG-5157:


I'll update on RB soon, ran with my latest patch:
{code}
ant  -Dtest.junit.output.format=xml clean  -Dtestcase=TestGrunt  
-Dexectype=spark  -Dhadoopversion=2  test
...
[junit] Tests run: 67, Failures: 0, Errors: 0, Skipped: 4, Time elapsed: 
64.505 sec
   [delete] Deleting directory 
/var/folders/0n/97lfzsrs3dj1nlgfghj3221wgp/T/pig_junit_tmp1871324592

BUILD SUCCESSFUL
Total time: 3 minutes 8 seconds
{code}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-23 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060622#comment-16060622
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly] I'll have look at it. It seems that there's an other issue noticed 
while trying to execute a simple script (not from unit test), Pig hangs. The 
problem is with synchronized methods in JobMetricsListener and SparkListener: 
since these are now two different objects, synchronized will lock on different 
object, and wait/notify is also called on different instances. I'll update the 
patch so that every method in JobMetricsListener synchronizes on the its 
sparkListener private field, and calls wait() on this instance.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-23 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060574#comment-16060574
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:looks good, but met some problem when testing in local and 
yarn-client, give me more time to verify the problem is caused by the 
configuration or others. thanks!
after using this patch,  the result of unit test
{code}
 ant  -Dtest.junit.output.format=xml clean  -Dtestcase=TestGrunt  
-Dexectype=spark  -Dhadoopversion=2  test

{code}
the result:
{noformat}
Tests run: 67, Failures: 1, Errors: 5, Skipped: 4, Time elapsed: 138.459 sec

{noformat}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-22 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059727#comment-16059727
 ] 

Rohini Palaniswamy commented on PIG-5157:
-

+1 for https://reviews.apache.org/r/59530/diff/7/ .  

[~nkollar],
  Can you upload the final patch here?

[~kellyzly],
Can you retry with the new patch and verify if it works for you? Please go 
ahead and commit it you are +1 on the patch. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16057075#comment-16057075
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  after using the patch and test a simple query in yarn-client env.
build jar:
{noformat}ant clean -v -Dhadoopversion=2 jar-spark12{noformat}
testJoin.pig
{code}
A = load './SkewedJoinInput1.txt' as (id,name,n);
B = load './SkewedJoinInput2.txt' as (id,name);
D = join A by (id,name), B by (id,name) parallel 10; 
store D into './testJoin.out';
{code}

spark1:
export SPARK_HOME=
export export 
SPARK_JAR=hdfs://:8020/user/root/spark-assembly-1.6.1-hadoop2.6.0.jar
$PIG_HOME/bin/pig -x spark -logfile $PIG_HOME/logs/pig.log testJoin.pig
error in logs/pig
{noformat}
java.lang.NoClassDefFoundError: 
org/apache/spark/scheduler/SparkListenerInterface
at 
org.apache.pig.backend.hadoop.executionengine.spark.SparkExecutionEngine.(SparkExecutionEngine.java:35)
at 
org.apache.pig.backend.hadoop.executionengine.spark.SparkExecType.getExecutionEngine(SparkExecType.java:42)
at org.apache.pig.impl.PigContext.(PigContext.java:269)
at org.apache.pig.impl.PigContext.(PigContext.java:256)
at org.apache.pig.Main.run(Main.java:389)
at org.apache.pig.Main.main(Main.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.scheduler.SparkListenerInterface
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 12 more
{noformat}

spark2( patch PIG-5246_2.patch)
export SPARK_HOME=
$PIG_HOME/bin/pig -x spark -logfile $PIG_HOME/logs/pig.log testJoin.pig
error in logs/pig
{noformat}
[main] 2017-06-21 14:14:05,791 ERROR spark.JobGraphBuilder 
(JobGraphBuilder.java:sparkOperToRDD(187)) - throw exception in sparkOperToRDD: 
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:763)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:762)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:762)
at 
org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:166)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.mapPartitions(JavaRDDLike.scala:45)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter.convert(ForEachConverter.java:64)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter.convert(ForEachConverter.java:45)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:292)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.physicalToRDD(JobGraphBuilder.java:248)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.sparkOperToRDD(JobGraphBuilder.java:182)
at 
org.apache.pig.backend.hadoop.executionengine.spark.JobGraphBuilder.visitSparkOp(JobGraphBuilder.java:112)
at 
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkOperator.visit(SparkOperator.java:140)
at 
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkOperator.visit(SparkOperator.java:37)
  

[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16053644#comment-16053644
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  It builds successful in my env when using ant clean jar-spark12, 
but give me more time to test it on spark1 and spark2.
  

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-15 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050198#comment-16050198
 ] 

Adam Szita commented on PIG-5157:
-

[~nkollar], [~kellyzly]: I checked latest patch on RB and built the jar 
successfully by:
{code}
ant clean jar-spark12
{code}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16050123#comment-16050123
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: after i download the latest patch from RB, how to compile now?
when i use following command 
{code}
ant clean -v -Dhadoopversion=2 jar-spark12
{code}

I got following err
{noformat}
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.7
94937 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:29:
 error: cannot find symbol
94938 [javac] import org.apache.spark.api.java.Optional;
94939 [javac] ^
94940 [javac]   symbol:   class Optional
94941 [javac]   location: package org.apache.spark.api.java
94942 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:99:
 error: no interface expected   here
94943 [javac] private static class JobMetricsListener extends 
SparkListener {
94944 [javac] ^
94945 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-client-1.2.4.jar(org/apache/hadoop/hbase/filter/FilterList.class):
 warning: Canno  t find annotation method 'value()' in type 
'SuppressWarnings': class file for 
edu.umd.cs.findbugs.annotations.SuppressWarnings not found
94946 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-client-1.2.4.jar(org/apache/hadoop/hbase/filter/FilterList.class):
 warning: Canno  t find annotation method 'justification()' in type 
'SuppressWarnings'
94947 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-common-1.2.4.jar(org/apache/hadoop/hbase/io/ImmutableBytesWritable.class):
 warnin  g: Cannot find annotation method 'value()' in type 
'SuppressWarnings'
94948 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-common-1.2.4.jar(org/apache/hadoop/hbase/io/ImmutableBytesWritable.class):
 warnin  g: Cannot find annotation method 'justification()' in type 
'SuppressWarnings'
94949 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-server-1.2.4.jar(org/apache/hadoop/hbase/mapreduce/TableInputFormat.class):
 warni  ng: Cannot find annotation method 'value()' in type 
'SuppressWarnings'
94950 [javac] 
/home/zly/prj/oss/pig/build/ivy/lib/Pig/hbase-server-1.2.4.jar(org/apache/hadoop/hbase/mapreduce/TableInputFormat.class):
 warni  ng: Cannot find annotation method 'justification()' in type 
'SuppressWarnings'
94951 [javac] 
/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java:49:
 error:  is not abstract and does 
not override abstract method call(T) in FlatMapFunction
94952 [javac] return new FlatMapFunction() {
94953 [javac]^

{noformat}

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-14 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048906#comment-16048906
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly] updated the patch on RB, can you apply it now?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047531#comment-16047531
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:  leave some comments on review board.
 can you update patch with latest code?
latest code
{noformat}
* 5c55102 - (origin/trunk, origin/HEAD) PIG-4700: Enable progress reporting for 
Tasks in Tez (satishsaley via rohini) (7 days ago) 
{noformat}  
when i download the patch from review board and patch like following
{code}
 patch -p0 Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-12 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047080#comment-16047080
 ] 

Nandor Kollar commented on PIG-5157:


Updated RB with Rohini's proposal (see discussion on PIG-5246 for details) 
[~szita], [~rohini], [~kellyzly] could you please review? Let me know your 
ideas about it.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-06-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16032514#comment-16032514
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: have tested that we can remove JobLogger in spark16.  

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030991#comment-16030991
 ] 

Nandor Kollar commented on PIG-5157:


Ok, thanks, then I'll uncomment that and test it. As for 
{{spark.eventLog.enabled}}, it requires {{spark.eventLog.dir}} defined too, I 
think for Spark 2.x we don't have to care about it, since the user can set 
these if required. Though my change removed the logger completely, and it seems 
these property are not available for Spark 1.x. My question is: do we need this 
for Spark 1.x? If so, I'm afraid this should be included into the shims too.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030799#comment-16030799
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-31 Thread hsj (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030795#comment-16030795
 ] 

hsj commented on PIG-5157:
--

[~nkollar]:
bq. in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
the reason to modify it is because [~rohini] suggested that [memory| is used a 
lot if we update metric info in onTaskEnd()(suppose there are thousand tasks)
in org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener of 
spark21, we should use code like following 
notice: not fully test, can not guarantee it is right.
{code}
  public void onStageCompleted(SparkListenerStageCompleted stageCompleted) {
if we update taskMetrics in onTaskEnd(), it consumes lot of memory.
int stageId = stageCompleted.stageInfo().stageId();
int stageAttemptId = stageCompleted.stageInfo().attemptId();
String stageIdentifier = stageId + "_" + stageAttemptId;
Integer jobId = stageIdToJobId.get(stageId);
if (jobId == null) {
LOG.warn("Cannot find job id for stage[" + stageId + "].");
} else {
Map jobMetrics = 
allJobMetrics.get(jobId);
if (jobMetrics == null) {
jobMetrics = Maps.newHashMap();
allJobMetrics.put(jobId, jobMetrics);
}
List stageMetrics = jobMetrics.get(stageIdentifier);
if (stageMetrics == null) {
stageMetrics = Lists.newLinkedList();
jobMetrics.put(stageIdentifier, stageMetrics);
}
 
 stageMetrics.add(stageCompleted.stageInfo().taskMetrics());
}
}
public synchronized void onTaskEnd(SparkListenerTaskEnd taskEnd) {
}
{code}
bq. I removed JobLogger, do we need it? It seems that a property called 
'spark.eventLog.enabled' is the proper replacement for this class, should we 
use it instead? It looks like JobLogger became deprecated and was removed from 
Spark 2.
It seems we can remove JobLogger and enable {{spark.eventLog.enabled}} in spark2


> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-29 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16028651#comment-16028651
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: will review tomorrow as Monday and Tuesday, i am out of office.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16028477#comment-16028477
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly] could you please have a look at my patch? There are two 
questionable change:
- in JobMetricsListener.java there's a huge code section commented out 
(uncomment and remove the code onTaskEnd until we fix PIG-5157). Should we 
enable that?
- I didn't find a proper replacement for JobLogger, hence it is removed. What 
was it used for? It looks like it became deprecated and was removed from Spark.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.17.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-26 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16026932#comment-16026932
 ] 

Rohini Palaniswamy commented on PIG-5157:
-

+1. [~szita], could you take care of committing this as well after the spark 
branch merge along with PIG-5207 and PIG-5194?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
> Attachments: PIG-5157.patch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022694#comment-16022694
 ] 

Jeff Zhang commented on PIG-5157:
-

I think we can move to DataFrame for spark both spark 1.6 and 2.x

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022529#comment-16022529
 ] 

Nandor Kollar commented on PIG-5157:


Thanks Jeff! Looks like you're more familiar with Spark than me. :) What do you 
think, should we keep RDDs for Spark 1.6, or we should move it to DataFrames 
too?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022499#comment-16022499
 ] 

Jeff Zhang commented on PIG-5157:
-

bq.  I think (and correct me if I'm wrong) we don't have to change physical and 
logical plan, but we've to modify how the plan is mapped to Spark: modify the 
converters from RDD converter to DataSet converter.
That's correct.

bq. we should try to migrate to DataSet API only for spark 2.1. As far as I 
know Spark 1.6 has DataFrames API, but since it was experimental that time, I 
think we shouldn't change that, RDDs are fine for Spark 1.6
DataFrame API is not experimental for spark 1.6, it is pretty stable for 1.6. I 
guess you mean DataSet API instead of DataFrame API.  In Spark 2.x DataFrame is 
a just a alias of DataSet[Row].  I think pig don't need DataSet, it only needs 
DataFrame, DataSet is for strong typing such as java beans, but seems pig only 
use Tuple, so pig don't needs the feature of DataSet, DataFrame is sufficient 
for pig. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022491#comment-16022491
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly], [~jeffzhang] I think (and correct me if I'm wrong) we don't have 
to change physical and logical plan, but we've to modify how the plan is mapped 
to Spark: modify the converters from RDD converter to DataSet converter.
I'd recommend to split this into two task. First is upgrading to Spark 2.1 
while still being able to compile with Spark 1.6. I'm close to finish this, 
there were few API changes, I'll attaching the patch soon for comments. Once 
this is done, we should try to migrate to DataSet API only for spark 2.1. As 
far as I know Spark 1.6 has DataFrames API, but since it was experimental that 
time, I think we shouldn't change that, RDDs are fine for Spark 1.6. Any 
thoughts?
[~pallavi.rao] I saw you investigated DataFrames API for PoS before, but didn't 
find it suitable. What was the issue with it?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022455#comment-16022455
 ] 

Jeff Zhang commented on PIG-5157:
-

I think pig still have LogicalPlan & PhysicalPlan for spark engine. But there's 
no difference between spark's LogicalPlan & PhysicalPlan, because that is 
delegated to spark's dataframe.



> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022446#comment-16022446
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]:
bq. the optimizations offered (project Tungsten and Catalyst optimizer) looks 
promising
If use catalyst optimizer, do we need 
{{org.apache.pig.newplan.logical.relational.LogicalPlan}},{{org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan}}?
 {{Catalyst optimizer}} optimizes the spark plan generated by spark sql.
bq. however it seems that it is build around Java beans
  I guess DataSet/DataFrame api provide row-based operation. see the 
[patch|https://issues.apache.org/jira/secure/attachment/12847623/PIG-5080-1.patch]
 of PIG-5080
 {code}
  SparkContext context = SparkContext.getOrCreate();
SQLContext sqlContext = SQLContext.getOrCreate(context);
DataFrame df = sqlContext.table("complex_data");
Row[] rows = df.collect();
assertEquals(10, rows.length);
for (int i = 0; i < rows.length; i++) {
  assertEquals(i, rows[i].getJavaMap(0).get("key_" + i));
}
{code}

[~zjffdu]: appreciate if you can give us your suggetion as you are more 
familiar with spark.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-23 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022041#comment-16022041
 ] 

Jeff Zhang commented on PIG-5157:
-

[~nkollar] [~kellyzly] IMO We should use DataFrame (aka DataSet[Row]) which 
match with Tuple of pig perfectly.



> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-23 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021051#comment-16021051
 ] 

Nandor Kollar commented on PIG-5157:


[~kellyzly] what you think, should we try Spark's DataFrames or DataSets API? I 
read a couple of blog posts, and the optimizations offered (project Tungsten 
and Catalyst optimizer) looks promising, however it seems that it is build 
around Java beans, not sure if this fits well into our generic Tuple-based data 
model.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: 0.18.0
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-21 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16019070#comment-16019070
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~zjffdu] and [~rohini]: thanks for your suggestion.
[~zjffdu]: 
bq.Supporting to spark2 could be done in the next release, maybe also changing 
from the rdd api to dataframe api in the next release.
yes, we will definitely not support spark2 in the first release.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017893#comment-16017893
 ] 

Jeff Zhang commented on PIG-5157:
-

A lot of users are still using spark 1.x as spark 2 is incompatible with spark 
1.x. And I don't think spark 1.x will be dropped in short time. So I think we 
should still support spark 1.x. And actually I would suggest to use spark 1.x 
as the only supported version of pig on spark. Because I think pig on spark has 
already behind the schedule, and lots of people are looking forward that. 
Adding support for spark 2 would take more time and effort, and may bring in 
some issues, so I would suggest to only support spark 1.x in the first release 
of pig on spark. For users, it is transparent and it is easy to upgrade from 
spark1 to spark2.

Supporting to spark2 could be done in the next release, maybe also changing 
from the rdd api to dataframe api in the next release. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017712#comment-16017712
 ] 

Rohini Palaniswamy commented on PIG-5157:
-

I am fine with supporting Spark 2.x or supporting both versions.

This depends on two things.
1) How well Spark 2 is adopted and how many distributions or users are 
still on Spark 1.x
2) When is spark community planning to deprecate or EOL Spark 1.x

[~szita] and [~nkollar] might be able to make a better call based on their 
users. 

[~zjffdu], 
   Do you have knowledge when support for Spark 1.x will be dropped by Spark 
community?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017009#comment-16017009
 ] 

Nandor Kollar commented on PIG-5157:


We can use reflection, or we can also use shims instead.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16017007#comment-16017007
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~rohini],[~xuefuz],[~zjffdu]: Should we support spark2 or support both 
spark1.6 and spark2?  It may use reflection to support both version(still 
investigation).  Please give us your opinion, in my view, we don't suppport 
spark1.6 if we upgrade to spark2.0.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-03-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940012#comment-15940012
 ] 

liyunzhang_intel commented on PIG-5157:
---

after discussion with [~nkollar], [~szita] ,[~kexianda], we don't plan to 
support spark2.0 before the first release in April.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-03-24 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939967#comment-15939967
 ] 

Jeff Zhang commented on PIG-5157:
-

Yeah, that's what I mean. So for this ticket, we need to run test for 2 spark 
versions.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-03-24 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939939#comment-15939939
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~zjffdu]:  it is better to let user to choose spark version when building pig. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-03-23 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15939467#comment-15939467
 ] 

Jeff Zhang commented on PIG-5157:
-

BTW, spark 2.1.1 will be release soon.

Another thing I want to bring is that does this ticket mean spark 1.6 is not 
supported, or user could choose spark version when building pig ?



> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-03-02 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892269#comment-15892269
 ] 

Nandor Kollar commented on PIG-5157:


Thanks Jeff!

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-02-28 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889267#comment-15889267
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~zjffdu]: thanks your suggestion.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-02-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888079#comment-15888079
 ] 

Jeff Zhang commented on PIG-5157:
-

I would suggest to use spark 2.0.2 or spark 2.1.0 which is much more stable. 

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-02-28 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887655#comment-15887655
 ] 

Nandor Kollar commented on PIG-5157:


I don't know which would make more sense, but my guess is upgrading the the 
latest one is better unless we have a good reason not to do it.

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-02-27 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886570#comment-15886570
 ] 

liyunzhang_intel commented on PIG-5157:
---

[~nkollar]: it is sensible to upgrade to latest spark version, 2.0 or 2.1 which 
is more sensible or either is ok?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5157) Upgrade to Spark 2.0

2017-02-27 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885596#comment-15885596
 ] 

Nandor Kollar commented on PIG-5157:


Once PIG-5132 is completed we should upgrade to the latest Spark version. 
[~kellyzly] what do you think?

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)