pyspark.ml Pipeline stages are corrupted under multi-threaded access - is this a bug?

2017-01-24 Thread Vinayak Joshi5
Hi, The code we're executing constructs pyspark.ml.Pipeline objects concurrently in separate python threads. We observe that the stages fed to the pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in

Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
) at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source) Regards, Vinayak Joshi From: Vinayak Joshi5/India/IBM@IBMIN To: "user.spark" <user@spark.apache.org> Date: 01/12/2016 10:53 PM Subject:Spark 2.x Pyspark Spark SQL createDataframe Error Wi

Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-01 Thread Vinayak Joshi5
With a local spark instance built with hive support, (-Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver) The following script/sequence works in Pyspark without any error against 1.6.x, but fails with 2.x. people = sc.parallelize(["Michael,30", "Andy,12", "Justin,19"])

Re: Spark 2.x Pyspark Spark SQL createDataframe Error

2016-12-02 Thread Vinayak Joshi5
Thanks Michal. I have submitted a Spark issue and PR based on my understanding of why this changed in Spark 2.0. If interested you can follow it on https://issues.apache.org/jira/browse/SPARK-18687 Regards, Vinayak. From: Michal Šenkýř <bina...@gmail.com> To: Vinayak Joshi5/Ind