[jira] [Commented] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581273#comment-15581273
 ] 

Reynold Xin commented on SPARK-17969:
-

+1

It would be good to have a mode in which each file is a single JSON object.


> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17969:

Assignee: (was: Reynold Xin)

> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-17969:
---

Assignee: Reynold Xin

> I think it's user unfriendly to process standard json file with DataFrame 
> --
>
> Key: SPARK-17969
> URL: https://issues.apache.org/jira/browse/SPARK-17969
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Jianfei Wang
>Assignee: Reynold Xin
>Priority: Minor
>
> Currently, with DataFrame API,  we can't load standard json file directly, 
> maybe we can provide an override method to process this, the logic is as 
> below:
> ```
> val df = spark.sparkContext.wholeTextFiles("data/test.json") 
>  val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
>   val index = x.indexOf(',') 
>   x.substring(index + 1, x.length - 1) 
> } 
> val json_df = spark.read.json(json_rdd) 
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-16 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581268#comment-15581268
 ] 

Sun Rui edited comment on SPARK-11524 at 10/17/16 5:31 AM:
---

for cluster mode, the R script needs to transferred to the slave node chosen to 
run the spark driver. For YARN, this is done by adding it to the local 
resources for the application. For Mesos, I am not sure how to do it


was (Author: sunrui):
for cluster mode, the R script needs to transferred to the slave node chosen to 
run the spark driver. For YARN, this is done by adding it to the local 
resources for the application. For Mesos, I think the file server mechanism can 
be utilized.

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-16 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581268#comment-15581268
 ] 

Sun Rui commented on SPARK-11524:
-

for cluster mode, the R script needs to transferred to the slave node chosen to 
run the spark driver. For YARN, this is done by adding it to the local 
resources for the application. For Mesos, I think the file server mechanism can 
be utilized.

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10590) Spark with YARN build is broken

2016-10-16 Thread Nirman Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581244#comment-15581244
 ] 

Nirman Narang commented on SPARK-10590:
---

Hi Sean, my FS is not encrypted.
I upgraded Maven to 3.3.9 from 3.0.5, and the build is success now.
Still no pointers if it was a Maven dependency issue. I will try the same with 
a different machine with same environment to confirm.

> Spark with YARN build is broken
> ---
>
> Key: SPARK-10590
> URL: https://issues.apache.org/jira/browse/SPARK-10590
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
> Environment: CentOS 6.5
> Oracle JDK 1.7.0_75
> Maven 3.3.3
> Hadoop 2.6.0
> Spark 1.5.0
>Reporter: Kevin Tsai
>
> Hi, After upgrade to v1.5.0 and trying to build it.
> It shows:
> [ERROR] missing or invalid dependency detected while loading class file 
> 'WebUI.class'
> It was working on Spark 1.4.1
> Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive 
> -Phive-thriftserver -Dscala-2.11 -DskipTests clean package
> Hope it helps.
> Regards,
> Kevin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17819:

Fix Version/s: 2.0.2

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Assignee: Dongjoon Hyun
> Fix For: 2.0.2, 2.1.0
>
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.633 seconds)
> Beeline 

[jira] [Comment Edited] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-16 Thread Vitaly Gerasimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147
 ] 

Vitaly Gerasimov edited comment on SPARK-17954 at 10/17/16 4:50 AM:


I figured out this issue. The problem is spark executor port listening 
localhost:
{code}
~# netstat -ntlp
tcp6   0  0 127.0.0.1:46721 :::*LISTEN  
11294/java
{code}

Are there some changes in configuration that makes executor listen only 
localhost? When I run spark 1.6.2 executor listens any port. I know it may 
depends on my /etc/hosts file, but this change seems unexpectedly to me.


was (Author: v-gerasimov):
I figured out this issue. The problem is spark executor port listening 
localhost:
{code}
~# netstat -ntlp
tcp6   0  0 127.0.0.1:46721 :::*LISTEN  
11294/java
{code}

Are there some changes in configuration that makes executor listen only 
localhost? When I run spark 1.6.2 executor listens any port.

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581167#comment-15581167
 ] 

Reynold Xin commented on SPARK-10915:
-

It is indeed very complicated to implement UDAF in Python. That's why we have 
been asking.

One thing is we can think about allowing defining UDAFs using existing SQL 
expressions, and then it might be easier.


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-16 Thread Vitaly Gerasimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147
 ] 

Vitaly Gerasimov edited comment on SPARK-17954 at 10/17/16 4:12 AM:


I figured out this issue. The problem is spark executor port listening 
localhost:
{code}
~# netstat -ntlp
tcp6   0  0 127.0.0.1:46721 :::*LISTEN  
11294/java
{code}

Are there some changes in configuration that makes executor listen only 
localhost? When I run spark 1.6.2 executor listens any port.


was (Author: v-gerasimov):
I figured out this issue. The problem is spark executor port listening 
localhost:
{conf}
~# netstat -ntlp
tcp6   0  0 127.0.0.1:46721 :::*LISTEN  
11294/java
{conf}

Are there some changes in configuration that makes executor listen only 
localhost? When I run spark 1.6.2 executor listens any port.

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> 

[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor

2016-10-16 Thread Vitaly Gerasimov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147
 ] 

Vitaly Gerasimov commented on SPARK-17954:
--

I figured out this issue. The problem is spark executor port listening 
localhost:
{conf}
~# netstat -ntlp
tcp6   0  0 127.0.0.1:46721 :::*LISTEN  
11294/java
{conf}

Are there some changes in configuration that makes executor listen only 
localhost? When I run spark 1.6.2 executor listens any port.

> FetchFailedException executor cannot connect to another worker executor
> ---
>
> Key: SPARK-17954
> URL: https://issues.apache.org/jira/browse/SPARK-17954
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>
> I have standalone mode spark cluster wich have three nodes:
> master.test
> worker1.test
> worker2.test
> I am trying to run the next code in spark shell:
> {code}
> val json = spark.read.json("hdfs://master.test/json/a.js.gz", 
> "hdfs://master.test/json/b.js.gz")
> json.createOrReplaceTempView("messages")
> spark.sql("select count(*) from messages").show()
> {code}
> and I am getting the following exception:
> {code}
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to 
> worker1.test/x.x.x.x:51029
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> Caused by: java.net.ConnectException: Connection refused: 
> 

[jira] [Resolved] (SPARK-17947) Document the impact of `spark.sql.debug`

2016-10-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17947.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15494
[https://github.com/apache/spark/pull/15494]

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17947) Document the impact of `spark.sql.debug`

2016-10-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17947:

Assignee: Xiao Li

> Document the impact of `spark.sql.debug`
> 
>
> Key: SPARK-17947
> URL: https://issues.apache.org/jira/browse/SPARK-17947
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Just document the impact of `spark.sql.debug`
>  When enabling the debug, Spark SQL internal table properties are not 
> filtered out; however, some related DDL commands (e.g., Analyze Table and 
> CREATE TABLE LIKE) might not work properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581130#comment-15581130
 ] 

Apache Spark commented on SPARK-17819:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15507

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> 

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-16 Thread Tobi Bosede (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581127#comment-15581127
 ] 

Tobi Bosede commented on SPARK-10915:
-

It is complicated to implement a UDAF in python? If you read the email thread 
at the link, you will see people are interested in using a UDAF in sql 
statements as well as spark sql methods. In my case I wanted to do a pivot and 
have a UDAF for inter-quartile (IQR) range applied. Do you think 
df.rdd.reduceByKey would make sense here?

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused

2016-10-16 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581119#comment-15581119
 ] 

Guoqiang Li commented on SPARK-17930:
-


TPC-DS 2T data (Parquet) and the SQL(query 2) => 
{noformat}
select  i_item_id, 
avg(ss_quantity) agg1,
avg(ss_list_price) agg2,
avg(ss_coupon_amt) agg3,
avg(ss_sales_price) agg4 
 from store_sales, customer_demographics, date_dim, item, promotion
 where ss_sold_date_sk = d_date_sk and
   ss_item_sk = i_item_sk and
   ss_cdemo_sk = cd_demo_sk and
   ss_promo_sk = p_promo_sk and
   cd_gender = 'M' and 
   cd_marital_status = 'M' and
   cd_education_status = '4 yr Degree' and
   (p_channel_email = 'N' or p_channel_event = 'N') and
   d_year = 2001 
 group by i_item_id
 order by i_item_id
 limit 100;
{noformat}

spark-defaults.conf =>

{noformat}
spark.master   yarn-client
spark.executor.instances   20
spark.driver.memory16g
spark.executor.memory  30g
spark.executor.cores   5
spark.default.parallelism  100 
spark.sql.shuffle.partitions   10 
spark.serializer   
org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize  0
spark.rpc.netty.dispatcher.numThreads   8
spark.executor.extraJavaOptions  -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
{noformat}


Performance test results are as follows => 

||[SPARK-17930|https://github.com/witgo/spark/tree/SPARK-17930]||[ed14633|https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3]||
|54.5 s|231.7 s|


> The SerializerInstance instance used when deserializing a TaskResult is not 
> reused 
> ---
>
> Key: SPARK-17930
> URL: https://issues.apache.org/jira/browse/SPARK-17930
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
>Reporter: Guoqiang Li
>
> The following code is called when the DirectTaskResult instance is 
> deserialized
> {noformat}
>   def value(): T = {
> if (valueObjectDeserialized) {
>   valueObject
> } else {
>   // Each deserialization creates a new instance of SerializerInstance, 
> which is very time-consuming
>   val resultSer = SparkEnv.get.serializer.newInstance()
>   valueObject = resultSer.deserialize(valueBytes)
>   valueObjectDeserialized = true
>   valueObject
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-16 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581113#comment-15581113
 ] 

Low Chin Wei commented on SPARK-13747:
--

It is running on Akka, with forkjoin dispatcher. There are 2 actors running 
concurrently doing different Spark job using the same SparkSession. I can't 
give the full stack, but here is the outline:

java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]

<-- Here is the code that call the df.count  -->

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.actor.ActorCell.invoke(ActorCell.scala:495) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.run(Mailbox.scala:224) 
[akka-actor_2.11-2.4.8.jar:na]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234) 
[akka-actor_2.11-2.4.8.jar:na]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 [scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[scala-library-2.11.8.jar:na]
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [scala-library-2.11.8.jar:na]




> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581094#comment-15581094
 ] 

Shixiong Zhu commented on SPARK-13747:
--

Could you post the full stack track, please? It would be helpful to know who 
calls `count`.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581072#comment-15581072
 ] 

Hyukjin Kwon commented on SPARK-17963:
--

Thanks. Then, I will work on this.

> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame

2016-10-16 Thread Jianfei Wang (JIRA)
Jianfei Wang created SPARK-17969:


 Summary: I think it's user unfriendly to process standard json 
file with DataFrame 
 Key: SPARK-17969
 URL: https://issues.apache.org/jira/browse/SPARK-17969
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.0.1
Reporter: Jianfei Wang
Priority: Minor


Currently, with DataFrame API,  we can't load standard json file directly, 
maybe we can provide an override method to process this, the logic is as below:
```
val df = spark.sparkContext.wholeTextFiles("data/test.json") 
 val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => 
  val index = x.indexOf(',') 
  x.substring(index + 1, x.length - 1) 
} 
val json_df = spark.read.json(json_rdd) 
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17819.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> 

[jira] [Commented] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver

2016-10-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581049#comment-15581049
 ] 

Dongjoon Hyun commented on SPARK-17819:
---

Hi, [~rxin].
Could you review the PR? In fact, it was an addition of 3 lines.

> Specified database in JDBC URL is ignored when connecting to thriftserver
> -
>
> Key: SPARK-17819
> URL: https://issues.apache.org/jira/browse/SPARK-17819
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0, 2.0.1
>Reporter: Todd Nemet
>
> Filing this based on a email thread with Reynold Xin. From the 
> [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server],
>  the JDBC connection URL to the thriftserver looks like:
> {code}
> beeline> !connect 
> jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path=
> {code}
> However, anything specified in  results in being put in default 
> schema. I'm running these with -e commands, but the shell shows the same 
> behavior.
> In 2.0.1, I created a table foo in schema spark_jira:
> {code}
> [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.558 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10006/spark_jira
> 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected (0.664 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10006/spark_jira
> {code}
> I also see this in Spark 1.6.2:
> {code}
> [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.653 seconds)
> Beeline version 1.2.1.spark2 by Apache Hive
> Closing: 0: jdbc:hive2://localhost:10005/spark_jira
> [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u 
> jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira"
> Connecting to jdbc:hive2://localhost:10005/spark_jira
> 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> | foo| false|
> ++--+--+
> 1 row selected 

[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581022#comment-15581022
 ] 

Reynold Xin commented on SPARK-17963:
-

It's definitely useful to do, but I don't think we need to have examples for 
every function (e.g. count).


> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
> cardinality by HyperLogLog++
>   with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581031#comment-15581031
 ] 

Reynold Xin commented on SPARK-10915:
-

What's the use case? Is it not possible to just run df.rdd.reduceByKey? The 
main issue is that it is actually very convoluted and complex to implement this 
properly.


> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-16 Thread Susan X. Huynh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581023#comment-15581023
 ] 

Susan X. Huynh commented on SPARK-11524:


Thanks for the advice and for breaking down the sub-issues. For mesos _cluster_ 
mode, is there additional work? Or is it just the same problem to locate the 
sparkr.zip at slave nodes?

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17968) Support using 3rd-party R packages on Mesos

2016-10-16 Thread Sun Rui (JIRA)
Sun Rui created SPARK-17968:
---

 Summary: Support using 3rd-party R packages on Mesos
 Key: SPARK-17968
 URL: https://issues.apache.org/jira/browse/SPARK-17968
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2016-10-16 Thread Low Chin Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580989#comment-15580989
 ] 

Low Chin Wei commented on SPARK-13747:
--

java.lang.IllegalArgumentException: spark.sql.execution.id is already set
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199)
 ~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at 
org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]
at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) 
~[spark-sql_2.11-2.0.1.jar:2.0.1]


> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Shixiong Zhu
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster

2016-10-16 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580960#comment-15580960
 ] 

Sun Rui commented on SPARK-11524:
-

great, go ahead. Please look at the linking JIRA issues as references.
Basically, to enable SparkR with mesos client mode, to problem is to locate the 
sparkr.zip at slave nodes, and set the environment for R so that it can load 
the SparkR package. (No need to pass the sparkr.zip from the driver node to 
slave nodes, because the file is already part of the spark distribution.) When 
locating sparkr.zip, two cases should be considered: locally installed spark 
distribution or downloaded from a public-accessible place. (check 
spark.mesos.executor.home )

I just broke this issue into sub-issues, we can address it one by one

> Support SparkR with Mesos cluster
> -
>
> Key: SPARK-11524
> URL: https://issues.apache.org/jira/browse/SPARK-11524
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17967) Support for list or other types as an option for datasources

2016-10-16 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17967:


 Summary: Support for list or other types as an option for 
datasources
 Key: SPARK-17967
 URL: https://issues.apache.org/jira/browse/SPARK-17967
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Hyukjin Kwon


This was discussed in SPARK-17878

For other datasources, it seems okay with string/long/boolean/double value as 
an option but it seems it is not enough for the datasource such as CSV. As it 
is an interface for other external datasources, I guess it'd affect several 
ones out there.

I took a look a first but it seems it'd be difficult to support this (need to 
change a lot).

One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources

2016-10-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580958#comment-15580958
 ] 

Hyukjin Kwon commented on SPARK-17967:
--

I am leaving SPARK-17878 as a related one but it does not mean this one blocks 
that JIRA.

> Support for list or other types as an option for datasources
> 
>
> Key: SPARK-17967
> URL: https://issues.apache.org/jira/browse/SPARK-17967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Hyukjin Kwon
>
> This was discussed in SPARK-17878
> For other datasources, it seems okay with string/long/boolean/double value as 
> an option but it seems it is not enough for the datasource such as CSV. As it 
> is an interface for other external datasources, I guess it'd affect several 
> ones out there.
> I took a look a first but it seems it'd be difficult to support this (need to 
> change a lot).
> One suggestion is support this as a JSON array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17966) Support Spark packages with R code on Mesos

2016-10-16 Thread Sun Rui (JIRA)
Sun Rui created SPARK-17966:
---

 Summary: Support Spark packages with R code on Mesos
 Key: SPARK-17966
 URL: https://issues.apache.org/jira/browse/SPARK-17966
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17965) Enable SparkR with Mesos cluster mode

2016-10-16 Thread Sun Rui (JIRA)
Sun Rui created SPARK-17965:
---

 Summary: Enable SparkR with Mesos cluster mode
 Key: SPARK-17965
 URL: https://issues.apache.org/jira/browse/SPARK-17965
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17964) Enable SparkR with Mesos client mode

2016-10-16 Thread Sun Rui (JIRA)
Sun Rui created SPARK-17964:
---

 Summary: Enable SparkR with Mesos client mode
 Key: SPARK-17964
 URL: https://issues.apache.org/jira/browse/SPARK-17964
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.0.1
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17963:
-
Description: 
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
Function: approx_count_distinct
Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
HyperLogLog++.
approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
cardinality by HyperLogLog++
  with relativeSD, the maximum estimation error allowed.

Extended Usage:
No example for approx_count_distinct.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
  column `col` at the given percentage. The value of percentage must be 
between 0.0
  and 1.0. The `accuracy` parameter (default: 1) is a positive integer 
literal which
  controls approximation accuracy at the cost of memory. Higher value of 
`accuracy` yields
  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.

  percentile_approx(col, array(percentage1 [, percentage2]...) [, 
accuracy]) - Returns the approximate
  percentile array of column `col` at the given percentage array. Each 
value of the
  percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
(default: 1) is
   a positive integer literal which controls approximation accuracy at the 
cost of memory.
   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the 
relative error of
   the approximation.

Extended Usage:
No example for percentile_approx.
{code}

Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} 
and {{_FUNC_(a, b)}} (note the indentation between arguments.

It'd be nicer if most of them have a good example with possible argument types.

Suggested format is as below for multiple line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
  rand() - Returns a random column with i.i.d. uniformly distributed values 
in [0, 1].
seed is given randomly.

  rand(seed) - Returns a random column with i.i.d. uniformly distributed 
values in [0, 1].
seed should be an integer/long/NULL literal.

Extended Usage:
> SELECT rand();
 0.9629742951434543
> SELECT rand(0);
 0.8446490682263027
> SELECT rand(NULL);
 0.8446490682263027
{code}

For single line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after 
start_date.
Extended Usage:
> SELECT date_add('2016-07-30', 1);
 '2016-07-31'
{code}

  was:
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: 

[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17963:
-
Description: 
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
  column `col` at the given percentage. The value of percentage must be 
between 0.0
  and 1.0. The `accuracy` parameter (default: 1) is a positive integer 
literal which
  controls approximation accuracy at the cost of memory. Higher value of 
`accuracy` yields
  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.

  percentile_approx(col, array(percentage1 [, percentage2]...) [, 
accuracy]) - Returns the approximate
  percentile array of column `col` at the given percentage array. Each 
value of the
  percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
(default: 1) is
   a positive integer literal which controls approximation accuracy at the 
cost of memory.
   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the 
relative error of
   the approximation.

Extended Usage:
No example for percentile_approx.
{code}

Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} 
and {{_FUNC_(a, b)}} (note the indentation between arguments.

It'd be nicer if most of them have a good example with possible argument types.

Suggested format is as below for multiple line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
  rand() - Returns a random column with i.i.d. uniformly distributed values 
in [0, 1].
seed is given randomly.

  rand(seed) - Returns a random column with i.i.d. uniformly distributed 
values in [0, 1].
seed should be an integer/long/NULL literal.

Extended Usage:
> SELECT rand();
 0.9629742951434543
> SELECT rand(0);
 0.8446490682263027
> SELECT rand(NULL);
 0.8446490682263027
{code}

For single line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after 
start_date.
Extended Usage:
> SELECT date_add('2016-07-30', 1);
 '2016-07-31'
{code}

  was:
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the 

[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580942#comment-15580942
 ] 

Hyukjin Kwon commented on SPARK-17963:
--

Hi [~rxin] and [~srowen], I guess the PR would be pretty big. So, I would like 
you both confirm this first. Do you think it'd be sensible?

> Add examples (extend) in each function and improve documentation with 
> arguments
> ---
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> Currently, it seems function documentation is inconsistent and does not have 
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION last;
> Function: last
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
> Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a 
> group of rows.
> last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
> group of rows.
>   If isIgnoreNull is true, returns only non-null values.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows 
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied 
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which 
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class: 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
>   percentile_approx(col, percentage [, accuracy]) - Returns the 
> approximate percentile value of numeric
>   column `col` at the given percentage. The value of percentage must be 
> between 0.0
>   and 1.0. The `accuracy` parameter (default: 1) is a positive 
> integer literal which
>   controls approximation accuracy at the cost of memory. Higher value of 
> `accuracy` yields
>   better accuracy, `1.0/accuracy` is the relative error of the 
> approximation.
>   percentile_approx(col, array(percentage1 [, percentage2]...) [, 
> accuracy]) - Returns the approximate
>   percentile array of column `col` at the given percentage array. Each 
> value of the
>   percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
> (default: 1) is
>a positive integer literal which controls approximation accuracy at 
> the cost of memory.
>Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is 
> the relative error of
>the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example, 
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument 
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
>   rand() - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed is given randomly.
>   rand(seed) - Returns a random column with i.i.d. uniformly distributed 
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
>  0.9629742951434543
> > SELECT rand(0);
>  0.8446490682263027
> > SELECT rand(NULL);
>  0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days 
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
>  '2016-07-31'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-17963:
-
Description: 
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
  column `col` at the given percentage. The value of percentage must be 
between 0.0
  and 1.0. The `accuracy` parameter (default: 1) is a positive integer 
literal which
  controls approximation accuracy at the cost of memory. Higher value of 
`accuracy` yields
  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.

  percentile_approx(col, array(percentage1 [, percentage2]...) [, 
accuracy]) - Returns the approximate
  percentile array of column `col` at the given percentage array. Each 
value of the
  percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
(default: 1) is
   a positive integer literal which controls approximation accuracy at the 
cost of memory.
   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the 
relative error of
   the approximation.

Extended Usage:
No example for percentile_approx.
{code}

Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} 
and {{_FUNC_(a, b)}} (note the indentation between arguments.

It'd be nicer if most of them have a good example with possible argument types.

Suggested format is as below for multiple line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
  rand() - Returns a random column with i.i.d. uniformly distributed values 
in [0, 1].
seed is given randomly.

  rand(seed) - Returns a random column with i.i.d. uniformly distributed 
values in [0, 1].
seed should be an integer/long/NULL literal.

Extended Usage:
> SELECT rand();
 0.9629742951434543
> SELECT rand(0);
 0.8446490682263027
> SELECT rand(NULL);
 0.8446490682263027
{code}

For single line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after 
start_date.

Extended Usage:
> SELECT date_add('2016-07-30', 1);
 '2016-07-31'
{code}

  was:
Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the 

[jira] [Created] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments

2016-10-16 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17963:


 Summary: Add examples (extend) in each function and improve 
documentation with arguments
 Key: SPARK-17963
 URL: https://issues.apache.org/jira/browse/SPARK-17963
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Hyukjin Kwon


Currently, it seems function documentation is inconsistent and does not have 
examples ({{extend}} much.

For example, some functions have a bad indentation as below:

{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group 
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a 
group of rows.
  If isIgnoreNull is true, returns only non-null values.
{code}

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows 
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression 
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the 
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}

whereas some do have a pretty one

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
  percentile_approx(col, percentage [, accuracy]) - Returns the approximate 
percentile value of numeric
  column `col` at the given percentage. The value of percentage must be 
between 0.0
  and 1.0. The `accuracy` parameter (default: 1) is a positive integer 
literal which
  controls approximation accuracy at the cost of memory. Higher value of 
`accuracy` yields
  better accuracy, `1.0/accuracy` is the relative error of the 
approximation.

  percentile_approx(col, array(percentage1 [, percentage2]...) [, 
accuracy]) - Returns the approximate
  percentile array of column `col` at the given percentage array. Each 
value of the
  percentage array must be between 0.0 and 1.0. The `accuracy` parameter 
(default: 1) is
   a positive integer literal which controls approximation accuracy at the 
cost of memory.
   Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the 
relative error of
   the approximation.

Extended Usage:
No example for percentile_approx.
{code}

Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} 
and {{_FUNC_(a, b)}} (note the indentation between arguments.

It'd be nicer if most of them have a good example with possible argument types.

Suggested format is as below for multiple line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
  rand() - Returns a random column with i.i.d. uniformly distributed values 
in [0, 1].
seed is given randomly.

  rand(seed) - Returns a random column with i.i.d. uniformly distributed 
values in [0, 1].
seed should be an integer/long/NULL literal.

Extended Usage:
> SELECT rand();
 0.9629742951434543
> SELECT rand(0);
 0.8446490682263027
> SELECT rand(NULL);
 0.8446490682263027
{code}

For single line usage:

{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after 
start_date.
Extended Usage:
> SELECT date_add('2016-07-30', 1);
 '2016-07-31'
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-16 Thread Tobi Bosede (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580917#comment-15580917
 ] 

Tobi Bosede commented on SPARK-10915:
-

Thoughts [~davies] and [~mgummelt]? Refer to 
https://www.mail-archive.com/user@spark.apache.org/msg58125.html 

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17898) --repositories needs username and password

2016-10-16 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580857#comment-15580857
 ] 

lichenglin commented on SPARK-17898:


I have found a way to declaration the username and password:

--repositories 
http://username:passw...@wx.bjdv.com:8081/nexus/content/grou‌​ps/bigdata/

May be,we should add this feature to the document?



> --repositories  needs username and password
> ---
>
> Key: SPARK-17898
> URL: https://issues.apache.org/jira/browse/SPARK-17898
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> My private repositories need username and password to visit.
> I can't find a way to declaration  the username and password when submit 
> spark application
> {code}
> bin/spark-submit --repositories   
> http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages 
> com.databricks:spark-csv_2.10:1.2.0   --class 
> org.apache.spark.examples.SparkPi   --master local[8]   
> examples/jars/spark-examples_2.11-2.0.1.jar   100
> {code}
> The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username 
> and password



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-16 Thread Stephen Hankinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hankinson updated SPARK-17962:
--
Description: 
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark (The cluster name must be the same 
as the resource group name used for this to launch properly, login with 
username hadoop, and launch the shell with 
/home/hadoop/spark-2.0.0-bin-hadoop2.7/bin/spark-shell --master yarn 
--deploy-mode client --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.catalogImplementation=in-memory --driver-memory 10g --driver-cores 4)

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0| null|   null| null|

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => 
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = 
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we see 
the same results. Running Spark 2.0 locally, not on a cluster, does not have 
this issue. Also running spark in local mode on the master node of the Hadoop 
cluster also works. Only when running on top of YARN do we see this issue.

This also seems very similar to this issue: 
https://issues.apache.org/jira/browse/SPARK-10896

We have also determined this appears to be related to the memory settings of 
the cluster. The worker machines have 56000MB available, the node manager 
memory is set to 54784M and executor memory set to 48407M when we see this 
issue happen. Lowering the executor memory to something like 28407M removes the 
issue from happening.

  was:
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark (The cluster name must be the same 
as the resource group name used for this to launch properly)

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this


[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-16 Thread Stephen Hankinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hankinson updated SPARK-17962:
--
Description: 
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark (The cluster name must be the same 
as the resource group name used for this to launch properly)

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0| null|   null| null|

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => 
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = 
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we see 
the same results. Running Spark 2.0 locally, not on a cluster, does not have 
this issue. Also running spark in local mode on the master node of the Hadoop 
cluster also works. Only when running on top of YARN do we see this issue.

This also seems very similar to this issue: 
https://issues.apache.org/jira/browse/SPARK-10896

We have also determined this appears to be related to the memory settings of 
the cluster. The worker machines have 56000MB available, the node manager 
memory is set to 54784M and executor memory set to 48407M when we see this 
issue happen. Lowering the executor memory to something like 28407M removes the 
issue from happening.

  was:
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|


[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-16 Thread Stephen Hankinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hankinson updated SPARK-17962:
--
Description: 
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0| null|   null| null|

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => 
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = 
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we see 
the same results. Running Spark 2.0 locally, not on a cluster, does not have 
this issue. Also running spark in local mode on the master node of the Hadoop 
cluster also works. Only when running on top of YARN do we see this issue.

This also seems very similar to this issue: 
https://issues.apache.org/jira/browse/SPARK-10896

We have also determined this appears to be related to the memory settings of 
the cluster. The worker machines have 56000MB available, the node manager 
memory is set to 54784M and executor memory set to 48407M when we see this 
issue happen. Lowering the executor memory to something like 28407M removes the 
issue from happening.

  was:
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-16 Thread Stephen Hankinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hankinson updated SPARK-17962:
--
Description: 
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0|0|  1|  1.0|

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

|| user||dimension||score||
|12345|0|  1.0|

||dimension||cluster||score||
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

|| user||dimension||score||dimension||cluster||score||
|12345|0|  1.0| null|   null| null|

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => 
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = 
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we see 
the same results. Running Spark 2.0 locally, not on a cluster, does not have 
this issue. Also running spark in local mode on the master node of the Hadoop 
cluster also works. Only when running on top of YARN do we see this issue.

This also seems very similar to this issue: 
https://issues.apache.org/jira/browse/SPARK-10896

We have also determined this appears to be related to the memory settings of 
the cluster. The worker machines have 56000MB available, the node manager 
memory is set to 54784M and executor memory set to 48407M when we see this 
issue happen. Lowering the executor memory to something like 28407M removes the 
issue from happening.

  was:
Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

+-+-+-+ 
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0|0|  1|  1.0|
+-+-+-+-+---+-+
which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = 

[jira] [Created] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-16 Thread Stephen Hankinson (JIRA)
Stephen Hankinson created SPARK-17962:
-

 Summary: DataFrame/Dataset join not producing correct results in 
Spark 2.0/Yarn
 Key: SPARK-17962
 URL: https://issues.apache.org/jira/browse/SPARK-17962
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 2.0.0
 Environment: Centos 7.2, Hadoop 2.7.2, Spark 2.0.0
Reporter: Stephen Hankinson


Environment can be reproduced via this git repo using the Deploy to Azure 
button: https://github.com/shankinson/spark

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had 
written some new code using the Spark DataFrame/DataSet APIs but are noticing 
incorrect results on a join after writing and then reading data to Windows 
Azure Storage Blobs (The default HDFS location). I've been able to duplicate 
the issue with the following snippet of code running on the cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show
outputs

+-+-+-+ 
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0|0|  1|  1.0|
+-+-+-+-+---+-+
which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show
outputs

+-+-+-+ 
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0| null|   null| null|
+-+-+-+-+---+-+

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => 
(row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = 
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we see 
the same results. Running Spark 2.0 locally, not on a cluster, does not have 
this issue. Also running spark in local mode on the master node of the Hadoop 
cluster also works. Only when running on top of YARN do we see this issue.

This also seems very similar to this issue: 
https://issues.apache.org/jira/browse/SPARK-10896

We have also determined this appears to be related to the memory settings of 
the cluster. The worker machines have 56000MB available, the node manager 
memory is set to 54784M and executor memory set to 48407M when we see this 
issue happen. Lowering the executor memory to something like 28407M removes the 
issue from happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580502#comment-15580502
 ] 

Michael Schmeißer commented on SPARK-650:
-

What if I have a Hadoop InputFormat? Then, certain things happen before the 
first RDD exists, don't they?

I'll give the solution with the empty RDD a shot next week, this sounds a 
little bit better than what we have right now, but it still relies on certain 
internals of Spark which are most likely undocumented and might change in 
future? I've had the feeling that Spark basically has a functional approach 
with the RDDs and executing anything on an empty RDD could be optimized to just 
do nothing?

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580135#comment-15580135
 ] 

Sean Owen commented on SPARK-650:
-

But, why do you need to do it before you have an RDD? You can easily make this 
a library function. Or, just some static init that happens on demand whenever a 
certain class is loaded. The nice thing about that is that it's transparent, 
just like with any singleton / static init in the JVM.

If you really want, you can make an empty RDD and repartition it and use that 
as a dummy, but it only serves to do some initialization early that would 
happen transparently anyway.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17961) Add storageLevel to Dataset for SparkR

2016-10-16 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-17961:
---
Component/s: SQL
 SparkR

> Add storageLevel to Dataset for SparkR
> --
>
> Key: SPARK-17961
> URL: https://issues.apache.org/jira/browse/SPARK-17961
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add storageLevel to Dataset for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17961) Add storageLevel to Dataset for SparkR

2016-10-16 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-17961:
---
Issue Type: Improvement  (was: Bug)

> Add storageLevel to Dataset for SparkR
> --
>
> Key: SPARK-17961
> URL: https://issues.apache.org/jira/browse/SPARK-17961
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add storageLevel to Dataset for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17961) Add storageLevel to Dataset for SparkR

2016-10-16 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580131#comment-15580131
 ] 

Weichen Xu commented on SPARK-17961:


I am working on it and will create PR soon. 

> Add storageLevel to Dataset for SparkR
> --
>
> Key: SPARK-17961
> URL: https://issues.apache.org/jira/browse/SPARK-17961
> Project: Spark
>  Issue Type: Bug
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add storageLevel to Dataset for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17961) Add storageLevel to Dataset for SparkR

2016-10-16 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-17961:
--

 Summary: Add storageLevel to Dataset for SparkR
 Key: SPARK-17961
 URL: https://issues.apache.org/jira/browse/SPARK-17961
 Project: Spark
  Issue Type: Bug
Reporter: Weichen Xu


Add storageLevel to Dataset for SparkR




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580069#comment-15580069
 ] 

Michael Schmeißer commented on SPARK-650:
-

But I'll need to have an RDD to do this, I can't just do it during the 
SparkContext setup - right now, we have multiple sources of RDDs and every 
developer would still need to know that they have to run this code after 
creating an RDD, won't they? Or is there some way to use a "pseudo-RDD" right 
after creation of the SparkContext to execute the init code on the executors?

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580062#comment-15580062
 ] 

Sean Owen commented on SPARK-650:
-

This is still easy to do with mapPartitions, which can call 
{{initWithTheseParamsIfNotAlreadyInitialized(...)}} once per partition, which 
should guarantee it happens once per JVM before anything else proceeds. I don't 
think you need to bury it in serialization logic. I can see there are hard ways 
to implement this, but I believe an easy way is still readily available within 
the existing API mechanisms.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580055#comment-15580055
 ] 

Michael Schmeißer commented on SPARK-650:
-

Ok, let me explain the specific problems that we have encountered, which might 
help to understand the issue and possible solutions:

We need to run some code on the executors before anything gets processed, e.g. 
initialization of the log system or context setup. To do this, we need 
information that is present on the driver, but not on the executors. Our 
current solution is to provide a base class for Spark function implementations 
which contains the information from the driver and initializes everything in 
its readObject method. Since multiple narrow-dependent functions may be 
executed on the same executor JVM subsequently, this class needs to make sure 
that initialization doesn't run multiple times. Sure, that's not hard to do, 
but if you mix setup and cleanup logic for functions, partitions and/or the JVM 
itself, it can get quite confusing without explicit hooks.

So, our solution basically works, but with that approach, you can't use lambdas 
for Spark functions, which is quite inconvenient, especially for simple map 
operations. Even worse, if you use a lambda or otherwise forget to extend the 
required base class, the initialization doesn't occur and very weird exceptions 
follow, depending on which resource your function tries to access during its 
execution. Or if you have very bad luck, no exception will occur, but the log 
messages will get logged to an incorrect destination. It's very hard to prevent 
such cases without an explicit initialization mechanism and in a team with 
several developers, you can't expect everyone to know what is going on there.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579949#comment-15579949
 ] 

holdenk commented on SPARK-14141:
-

Ah sorry for the delay, so doing the cache + count together is done since if 
you just do a cache it won't actually do any caching until an action is 
performed on the rdd / dataframe and the count is used to force evaluation of 
the entire dataframe.

> Let user specify datatypes of pandas dataframe in toPandas()
> 
>
> Key: SPARK-14141
> URL: https://issues.apache.org/jira/browse/SPARK-14141
> Project: Spark
>  Issue Type: New Feature
>  Components: Input/Output, PySpark, SQL
>Reporter: Luke Miner
>Priority: Minor
>
> Would be nice to specify the dtypes of the pandas dataframe during the 
> toPandas() call. Something like:
> bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', 
> 'd': 'category'})
> Since dtypes like `category` are more memory efficient, you could potentially 
> load many more rows into a pandas dataframe with this option without running 
> out of memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579946#comment-15579946
 ] 

holdenk commented on SPARK-13534:
-

And now they have a release :) I'm not certain its at the stage where we can 
use it - but I'll do some poking over the next few weeks :)

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Wes McKinney
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579942#comment-15579942
 ] 

holdenk commented on SPARK-12753:
-

(oh as a follow up it appears the user answered their own question on stack 
overflow so all is well) :)

> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: log.txt, map.py, test_map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>script/
>   __init__.py 
>   map.py
>test/
>  __init.py__
>  test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-12753.
---
Resolution: Not A Problem

I don't believe this is a PySpark issue but rather it seems like a Python 
issue/question. If I've missunderstood feel free to re-open - but you might 
have better luck with the user list or stack overflow if this is a Python issue.

> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: log.txt, map.py, test_map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>script/
>   __init__.py 
>   map.py
>test/
>  __init.py__
>  test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579928#comment-15579928
 ] 

Sean Owen commented on SPARK-650:
-

Yeah that's a decent use case, because latency is an issue (streaming) and you 
potentially have time to set up before latency matters. 

You can still use this approach because empty RDDs arrive if no data has, and 
empty RDDs can still be repartitioned. Here's a way to, once, if the first RDD 
has no data, do something once per partition, which ought to amount to at least 
once per executor:

{code}
var first = true
lines.foreachRDD { rdd =>
  if (first) {
if (rdd.isEmpty) {
  rdd.repartition(sc.defaultParallelism).foreachPartition(_ => 
Thing.initOnce())
}
first = false
  }
}
{code}

"ought", because, there isn't actually a guarantee that it will put the empty 
partitions on different executors. In practice, it seems to, when I just tried 
it.

That's a partial solution, but it's an optimization anyway, and maybe it helps 
you right now. I am still not sure it means this needs a whole mechanism, if 
this is the only type of use case. Maybe there are others.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-11223.
-
Resolution: Fixed

Fixed in SPARK-12810 by [~vectorijk] :)

> PySpark CrossValidatorModel does not output metrics for every param in 
> paramGrid
> 
>
> Key: SPARK-11223
> URL: https://issues.apache.org/jira/browse/SPARK-11223
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Raela Wang
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579923#comment-15579923
 ] 

holdenk commented on SPARK-11223:
-

Oh wait it looks like we've already done this and I was looking at the wrong 
tuning model. Going to go ahead and resolve :)

> PySpark CrossValidatorModel does not output metrics for every param in 
> paramGrid
> 
>
> Key: SPARK-11223
> URL: https://issues.apache.org/jira/browse/SPARK-11223
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Raela Wang
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579921#comment-15579921
 ] 

holdenk commented on SPARK-11223:
-

This could be a good starter issue for someone interested in ML or PySpark. You 
will want to expose some additional getters on the Python CrossValidator model 
so users can get at the avgMetrics. Feel free to ping me for help if you want 
:) If no one takes this by December I'll probably give it a shot myself :)

> PySpark CrossValidatorModel does not output metrics for every param in 
> paramGrid
> 
>
> Key: SPARK-11223
> URL: https://issues.apache.org/jira/browse/SPARK-11223
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Raela Wang
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-11223:

Labels: starter  (was: )

> PySpark CrossValidatorModel does not output metrics for every param in 
> paramGrid
> 
>
> Key: SPARK-11223
> URL: https://issues.apache.org/jira/browse/SPARK-11223
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Raela Wang
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10635) pyspark - running on a different host

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579914#comment-15579914
 ] 

holdenk commented on SPARK-10635:
-

it would be a bit difficult, although as Py4J is speeding up the bridge it 
might not be impossible. I think realistically though this is probably a won't 
fix unless since there doesn't seem to be a lot of interest in this feature.

> pyspark - running on a different host
> -
>
> Key: SPARK-10635
> URL: https://issues.apache.org/jira/browse/SPARK-10635
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Ben Duffield
>
> At various points we assume we only ever talk to a driver on the same host.
> e.g. 
> https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615
> We use pyspark to connect to an existing driver (i.e. do not let pyspark 
> launch the driver itself, but instead construct the SparkContext with the 
> gateway and jsc arguments.
> There are a few reasons for this, but essentially it's to allow more 
> flexibility when running in AWS.
> Before 1.3.1 we were able to monkeypatch around this:  
> {code}
> def _load_from_socket(port, serializer):
> sock = socket.socket()
> sock.settimeout(3)
> try:
> sock.connect((host, port))
> rf = sock.makefile("rb", 65536)
> for item in serializer.load_stream(rf):
> yield item
> finally:
> sock.close()
> pyspark.rdd._load_from_socket = _load_from_socket
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10628) Add support for arbitrary RandomRDD generation to PySparkAPI

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579910#comment-15579910
 ] 

holdenk commented on SPARK-10628:
-

For someone who is interested in doing this, we might be able to do something 
based on how parallelize is for Python is special cased for generators.

> Add support for arbitrary RandomRDD generation to PySparkAPI
> 
>
> Key: SPARK-10628
> URL: https://issues.apache.org/jira/browse/SPARK-10628
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: holdenk
>Priority: Minor
>
> SPARK-2724 added support for specific RandomRDDs, add support for arbitrary 
> Random RDD generation for feature parity with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10525) Add Python example for VectorSlicer to user guide

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-10525.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Fixed in SPARK-14514 by [~podongfeng]

> Add Python example for VectorSlicer to user guide
> -
>
> Key: SPARK-10525
> URL: https://issues.apache.org/jira/browse/SPARK-10525
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10525) Add Python example for VectorSlicer to user guide

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579906#comment-15579906
 ] 

holdenk commented on SPARK-10525:
-

It looks like it does, I'm going to go ahead and resolve this. Thanks [~amicon] 
:)

> Add Python example for VectorSlicer to user guide
> -
>
> Key: SPARK-10525
> URL: https://issues.apache.org/jira/browse/SPARK-10525
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579901#comment-15579901
 ] 

holdenk commented on SPARK-10319:
-

Is this issue still occurring for you?

> ALS training using PySpark throws a StackOverflowError
> --
>
> Key: SPARK-10319
> URL: https://issues.apache.org/jira/browse/SPARK-10319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
> Environment: Windows 10, spark - 1.4.1,
>Reporter: Velu nambi
>
> When attempting to train a machine learning model using ALS in Spark's MLLib 
> (1.4) on windows, Pyspark always terminates with a StackoverflowError. I 
> tried adding the checkpoint as described in 
> http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help.
> Here's the training code and stack trace:
> {code:none}
> ranks = [8, 12]
> lambdas = [0.1, 10.0]
> numIters = [10, 20]
> bestModel = None
> bestValidationRmse = float("inf")
> bestRank = 0
> bestLambda = -1.0
> bestNumIter = -1
> for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
> ALS.checkpointInterval = 2
> model = ALS.train(training, rank, numIter, lmbda)
> validationRmse = computeRmse(model, validation, numValidation)
> if (validationRmse < bestValidationRmse):
>  bestModel = model
>  bestValidationRmse = validationRmse
>  bestRank = rank
>  bestLambda = lmbda
>  bestNumIter = numIter
> testRmse = computeRmse(bestModel, test, numTest)
> {code}
> Stacktrace:
> 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 
> 127)
> java.lang.StackOverflowError
> at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
> at java.io.ObjectInputStream.readHandle(Unknown Source)
> at java.io.ObjectInputStream.readClassDesc(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.readObject(Unknown Source)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17960) Upgrade to Py4J 0.10.4

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579903#comment-15579903
 ] 

Sean Owen commented on SPARK-17960:
---

It seems OK. If the changes aren't essential, maybe we don't have to update to 
every single maintenance release, but, getting a fix in and at least 
periodically updating this (and other components) seems like a win. I had 
thought to review the updates available from key dependencies before 2.1.0 and 
perform a few other updates like this too.

> Upgrade to Py4J 0.10.4
> --
>
> Key: SPARK-17960
> URL: https://issues.apache.org/jira/browse/SPARK-17960
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> In general we should try and keep up to date with Py4J's new releases. The 
> changes in this one are small ( 
> https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact 
> Spark in any significant way so I'm going to tag this as a starter issue for 
> someone looking to get a deeper understanding of how PySpark works.
> Upgrading Py4J can be a bit tricky compared to updating other packages in 
> general the steps are:
> 1) Upgrade the Py4J version on the Java side
> 2) Update the py4j src zip file we bundle with Spark
> 3) Make sure everything still works (especially the streaming tests because 
> we do weird things to make streaming work and its the most likely place to 
> break during a Py4J upgrade).
> You can see how these bits have been done in past releases by looking in the 
> git log for the last time we changed the Py4J version numbers. Sometimes even 
> for "compatible" releases like this one we may need to make some small code 
> changes in side of PySpark because we hook into Py4Js internals, but I don't 
> think this should be the case here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10223) Add takeOrderedByKey function to extract top N records within each group

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-10223.
---
Resolution: Won't Fix

I don't see this feature being particularly popular, especially since its 
relatively easy to implement outside of Spark its self. If you disagree feel 
free to re-open this.

> Add takeOrderedByKey function to extract top N records within each group
> 
>
> Key: SPARK-10223
> URL: https://issues.apache.org/jira/browse/SPARK-10223
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Ritesh Agrawal
>Priority: Minor
>
> Currently PySpark has takeOrdered function that returns top N records. 
> However often you want to extract top N records within each group. This can 
> be easily implemented using combineByKey operation and using fixed size heap 
> to capture top N within each group. A working solution can be found over 
> [here](https://ragrawal.wordpress.com/2015/08/25/pyspark-top-n-records-in-each-group/)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9965) Scala, Python SQLContext input methods' deprecation statuses do not match

2016-10-16 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk closed SPARK-9965.
--
   Resolution: Resolved
Fix Version/s: 2.0.0

These methods were removed in 77ab49b8575d2ebd678065fa70b0343d532ab9c2 by 
[~rxin].

> Scala, Python SQLContext input methods' deprecation statuses do not match
> -
>
> Key: SPARK-9965
> URL: https://issues.apache.org/jira/browse/SPARK-9965
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 2.0.0
>
>
> Scala's SQLContext has several methods for data input (jsonFile, jsonRDD, 
> etc.) deprecated.  These methods are not deprecated in Python's SQLContext.  
> They should be, but only after Python's DataFrameReader implements analogous 
> methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9931) Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. test_training_and_prediction

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579889#comment-15579889
 ] 

holdenk commented on SPARK-9931:


Is this still a test people are finding flakey or did [~josephkb]'s fix work?

> Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. 
> test_training_and_prediction
> 
>
> Key: SPARK-9931
> URL: https://issues.apache.org/jira/browse/SPARK-9931
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Reporter: Davies Liu
>Priority: Critical
>
> {code}
> FAIL: test_training_and_prediction 
> (__main__.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1250, in test_training_and_prediction
> self.assertTrue(errors[1] - errors[-1] > 0.3)
> AssertionError: False is not true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7653) ML Pipeline and meta-algs should take random seed param

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579884#comment-15579884
 ] 

holdenk commented on SPARK-7653:


I think the simplest workaround would be exposing HasSeed to the public. What 
do you think?

> ML Pipeline and meta-algs should take random seed param
> ---
>
> Key: SPARK-7653
> URL: https://issues.apache.org/jira/browse/SPARK-7653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> ML Pipelines and other meta-algorithms should implement HasSeed.  If the seed 
> is set, then the meta-alg will use that seed to generate new seeds to pass to 
> every component PipelineStage.
> * Note: This may require a little discussion about whether HasSeed should be 
> a public API.
> This will make it easier for users to have reproducible results for entire 
> pipelines (rather than setting the seed for each stage manually).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7941) Cache Cleanup Failure when job is killed by Spark

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579878#comment-15579878
 ] 

holdenk commented on SPARK-7941:


So if its ok - since I don't see other reports of this - unless this is an 
issue someone (including [~cqnguyen]) is still experience I'll go ahead and 
soft-close this at the end of next weak.

> Cache Cleanup Failure when job is killed by Spark 
> --
>
> Key: SPARK-7941
> URL: https://issues.apache.org/jira/browse/SPARK-7941
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 1.3.1
>Reporter: Cory Nguyen
> Attachments: screenshot-1.png
>
>
> Problem/Bug:
> If a job is running and Spark kills the job intentionally, the cache files 
> remains on the local/worker nodes and are not cleaned up properly. Over time 
> the old cache builds up and causes "No Space Left on Device" error. 
> The cache is cleaned up properly when the job succeeds. I have not verified 
> if the cached remains when the user intentionally kills the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7721) Generate test coverage report from Python

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579874#comment-15579874
 ] 

holdenk commented on SPARK-7721:


[~joshrosen]is this something your still looking at/interested in or would you 
have the review bandwidth for this be a good place for someone else to step up 
and help out?

> Generate test coverage report from Python
> -
>
> Key: SPARK-7721
> URL: https://issues.apache.org/jira/browse/SPARK-7721
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Reporter: Reynold Xin
>
> Would be great to have test coverage report for Python. Compared with Scala, 
> it is tricker to understand the coverage without coverage reports in Python 
> because we employ both docstring tests and unit tests in test files. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3981) Consider a better approach to initialize SerDe on executors

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579868#comment-15579868
 ] 

holdenk commented on SPARK-3981:


That's a good question. It seems like much of the code that was copied in 
SPARK-3971 is still around, but given our slow migration from MLlib to ML I'm 
thinking that we should maybe consider marking this as a "Won't Fix". What are 
your thoughts [~mengxr]?

> Consider a better approach to initialize SerDe on executors
> ---
>
> Key: SPARK-3981
> URL: https://issues.apache.org/jira/browse/SPARK-3981
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>
> In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize 
> MLlib types on executors as a hotfix. This is not ideal. We should find a way 
> to add hooks to the SerDe in Core to support MLlib types in a pluggable way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2868) Support named accumulators in Python

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579866#comment-15579866
 ] 

holdenk commented on SPARK-2868:


ping [~davies] - would you be available to review if I got this switched around?

> Support named accumulators in Python
> 
>
> Key: SPARK-2868
> URL: https://issues.apache.org/jira/browse/SPARK-2868
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Patrick Wendell
>
> SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to 
> make some additional changes. One potential path is to have a 1:1 
> correspondence with Scala accumulators (instead of a one-to-many). A 
> challenge is exposing the stringified values of the accumulators to the Scala 
> code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-10-16 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579847#comment-15579847
 ] 

holdenk commented on SPARK-9487:


Great, thanks for taking this issue on :)

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17960) Upgrade to Py4J 0.10.4

2016-10-16 Thread holdenk (JIRA)
holdenk created SPARK-17960:
---

 Summary: Upgrade to Py4J 0.10.4
 Key: SPARK-17960
 URL: https://issues.apache.org/jira/browse/SPARK-17960
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: holdenk
Priority: Trivial


In general we should try and keep up to date with Py4J's new releases. The 
changes in this one are small ( 
https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact 
Spark in any significant way so I'm going to tag this as a starter issue for 
someone looking to get a deeper understanding of how PySpark works.

Upgrading Py4J can be a bit tricky compared to updating other packages in 
general the steps are:
1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark
3) Make sure everything still works (especially the streaming tests because we 
do weird things to make streaming work and its the most likely place to break 
during a Py4J upgrade).

You can see how these bits have been done in past releases by looking in the 
git log for the last time we changed the Py4J version numbers. Sometimes even 
for "compatible" releases like this one we may need to make some small code 
changes in side of PySpark because we hook into Py4Js internals, but I don't 
think this should be the case here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join due to calculated size of LogicalRdd

2016-10-16 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-17959:

Summary: spark.sql.join.preferSortMergeJoin has no effect for simple join 
due to calculated size of LogicalRdd  (was: spark.sql.join.preferSortMergeJoin 
has no effect for simple join)

> spark.sql.join.preferSortMergeJoin has no effect for simple join due to 
> calculated size of LogicalRdd
> -
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy (never shuffle hash join) even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case my df has a specifc size and number of 
> partitions (200 which is btw far from optimal)...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join

2016-10-16 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-17959:

Description: 
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...
I think this shouldnt be the case my df has a specifc size and number of 
partitions (200 which is btw far from optimal)...

  was:
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...
I think this shouldnt be the case...


> spark.sql.join.preferSortMergeJoin has no effect for simple join
> 
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case my df has a specifc size and number of 
> partitions (200 which is btw far from optimal)...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join

2016-10-16 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-17959:

Description: 
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy (never shuffle hash join) even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...
I think this shouldnt be the case my df has a specifc size and number of 
partitions (200 which is btw far from optimal)...

  was:
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...
I think this shouldnt be the case my df has a specifc size and number of 
partitions (200 which is btw far from optimal)...


> spark.sql.join.preferSortMergeJoin has no effect for simple join
> 
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy (never shuffle hash join) even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case my df has a specifc size and number of 
> partitions (200 which is btw far from optimal)...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join

2016-10-16 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-17959:

Description: 
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...
I think this shouldnt be the case...

  was:
Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...


> spark.sql.join.preferSortMergeJoin has no effect for simple join
> 
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join

2016-10-16 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-17959:
---

 Summary: spark.sql.join.preferSortMergeJoin has no effect for 
simple join
 Key: SPARK-17959
 URL: https://issues.apache.org/jira/browse/SPARK-17959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Stavros Kontopoulos


Example code:   
val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
"dss@s2"),
  ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
"Email")

df.as("a").join(df.as("b"))
  .where($"a.Group" === $"b.Group")
  .explain()

I always get the SortMerge strategy even if i set 
spark.sql.join.preferSortMergeJoin to false since:
sinzeInBytes = 2^63-1
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
and thus:
condition here: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
is always false...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Olivier Armand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579737#comment-15579737
 ] 

Olivier Armand commented on SPARK-650:
--

Data doesn't arrives necessarily immediately, but we need to ensure that when 
it arrives, lazy initialization doesn't introduce latency.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579720#comment-15579720
 ] 

Sean Owen commented on SPARK-650:
-

It would work in this case to immediately schedule initialization on the 
executors because it sounds like data arrives immediately in your case. The 
part I am missing is how it can occur faster than this with another mechanism. 

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Olivier Armand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579710#comment-15579710
 ] 

Olivier Armand commented on SPARK-650:
--

> "just run a dummy mapPartitions at the outset on the same data that the first 
> job would touch"

But this wouldn't work for Spark Streaming? (our case).

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17958) Why I ran into issue " accumulator, copyandreset must be zero error"

2016-10-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17958.
---
Resolution: Invalid

Pleas read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Questions go to user@, not JIRA. You would also never set Blocker.

> Why I ran into issue " accumulator, copyandreset must be zero error"
> 
>
> Key: SPARK-17958
> URL: https://issues.apache.org/jira/browse/SPARK-17958
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.0.1
> Environment: linux. 
>Reporter: dianwei Han
>Priority: Blocker
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I used to run spark code under spark 1.5. No problem at all. right now, I ran 
> code on spark 2.0. see error 
> "Exception in thread "main" java.lang.AssertionError: assertion failed: 
> copyAndReset must return a zero value copy
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:157)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Please help or give suggestion on how to fix the issue?
> Really appreciate it,
> Dianwei



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2016-10-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579630#comment-15579630
 ] 

Sean Owen commented on SPARK-650:
-

Reopening doesn't do anything by itself, or cause anyone to consider this. If 
this just sits for another year, it will have been a tiny part of a larger 
problem. I would ask those asking to keep this open to advance the discussion, 
or else I think you'd agree it eventually should be closed. (Here, I'm really 
speaking about hundreds of issues like this here, not so much this one.)

Part of the problem is that I don't think the details of this feature request 
were ever elaborated. I think that if you dig into what it would mean, you'd 
find that a) it's kind of tricky to define and then implement all the right 
semantics, and b) almost any use case along these lines in my experience is 
resolved as I suggest, with a simple per-JVM initialization. If the response 
lately here is, well, we're not quite sure how that works, then we need to get 
to the bottom of that, not just insisting an issue stay open.

To your points:

- The executor is going to load user code into one classloader, so we do have 
that an executor = JVM = classloader. 
- You can fail things as fast as you like by invoking this init as soon as like 
in your app.
- It's clear where things execute, or else, we must assume app developers 
understand this or else all bets are off. The driver program executes things in 
the driver unless they're part of a distributed map() etc operation, which 
clearly execute on the executor.

These IMHO aren't reasons to design a new, different, bespoke mechanism. That 
has a cost too, if you're positing that it's hard to understand when things run 
where. 

The one catch I see is that, by design, we don't control which tasks run on 
what executors. We can't guarantee init code runs on all executors this way. 
But, is it meaningful to initialize an executor that never sees an app's tasks? 
it can't be. Lazy init is a good thing and compatible with the Spark model. If 
startup time is an issue (and I'm still not clear on the latency problem 
mentioned above), then it gets a little more complicated, but, that's also a 
little more niche: just run a dummy mapPartitions at the outset on the same 
data that the first job would touch, even asynchronously if you like with other 
driver activities. No need to wait; it just gives the init a head-start on the 
executors that will need it straight away.

That's just my opinion of course, but I think those are the questions that 
would need to be answered to argue something happens here.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

2016-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579590#comment-15579590
 ] 

Apache Spark commented on SPARK-17931:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/15505

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> When taskScheduler instantiates TaskDescription, it calls 
> `Task.serializeWithDependencies(task, sched.sc.addedFiles, 
> sched.sc.addedJars, ser)`.  It serializes task and its dependency. 
> But after SPARK-2521 has been merged into the master, the ResultTask class 
> and ShuffleMapTask  class no longer contain rdd and closure objects. 
> TaskDescription class can be changed as below:
> {noformat}
> class TaskDescription[T](
> val taskId: Long,
> val attemptNumber: Int,
> val executorId: String,
> val name: String,
> val index: Int, 
> val task: Task[T]) extends Serializable
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17931) taskScheduler has some unneeded serialization

2016-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17931:


Assignee: (was: Apache Spark)

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>
> When taskScheduler instantiates TaskDescription, it calls 
> `Task.serializeWithDependencies(task, sched.sc.addedFiles, 
> sched.sc.addedJars, ser)`.  It serializes task and its dependency. 
> But after SPARK-2521 has been merged into the master, the ResultTask class 
> and ShuffleMapTask  class no longer contain rdd and closure objects. 
> TaskDescription class can be changed as below:
> {noformat}
> class TaskDescription[T](
> val taskId: Long,
> val attemptNumber: Int,
> val executorId: String,
> val name: String,
> val index: Int, 
> val task: Task[T]) extends Serializable
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17931) taskScheduler has some unneeded serialization

2016-10-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17931:


Assignee: Apache Spark

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Apache Spark
>
> When taskScheduler instantiates TaskDescription, it calls 
> `Task.serializeWithDependencies(task, sched.sc.addedFiles, 
> sched.sc.addedJars, ser)`.  It serializes task and its dependency. 
> But after SPARK-2521 has been merged into the master, the ResultTask class 
> and ShuffleMapTask  class no longer contain rdd and closure objects. 
> TaskDescription class can be changed as below:
> {noformat}
> class TaskDescription[T](
> val taskId: Long,
> val attemptNumber: Int,
> val executorId: String,
> val name: String,
> val index: Int, 
> val task: Task[T]) extends Serializable
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org