[jira] [Commented] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581273#comment-15581273 ] Reynold Xin commented on SPARK-17969: - +1 It would be good to have a mode in which each file is a single JSON object. > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17969: Assignee: (was: Reynold Xin) > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-17969: --- Assignee: Reynold Xin > I think it's user unfriendly to process standard json file with DataFrame > -- > > Key: SPARK-17969 > URL: https://issues.apache.org/jira/browse/SPARK-17969 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.0.1 >Reporter: Jianfei Wang >Assignee: Reynold Xin >Priority: Minor > > Currently, with DataFrame API, we can't load standard json file directly, > maybe we can provide an override method to process this, the logic is as > below: > ``` > val df = spark.sparkContext.wholeTextFiles("data/test.json") > val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => > val index = x.indexOf(',') > x.substring(index + 1, x.length - 1) > } > val json_df = spark.read.json(json_rdd) > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11524) Support SparkR with Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581268#comment-15581268 ] Sun Rui edited comment on SPARK-11524 at 10/17/16 5:31 AM: --- for cluster mode, the R script needs to transferred to the slave node chosen to run the spark driver. For YARN, this is done by adding it to the local resources for the application. For Mesos, I am not sure how to do it was (Author: sunrui): for cluster mode, the R script needs to transferred to the slave node chosen to run the spark driver. For YARN, this is done by adding it to the local resources for the application. For Mesos, I think the file server mechanism can be utilized. > Support SparkR with Mesos cluster > - > > Key: SPARK-11524 > URL: https://issues.apache.org/jira/browse/SPARK-11524 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581268#comment-15581268 ] Sun Rui commented on SPARK-11524: - for cluster mode, the R script needs to transferred to the slave node chosen to run the spark driver. For YARN, this is done by adding it to the local resources for the application. For Mesos, I think the file server mechanism can be utilized. > Support SparkR with Mesos cluster > - > > Key: SPARK-11524 > URL: https://issues.apache.org/jira/browse/SPARK-11524 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10590) Spark with YARN build is broken
[ https://issues.apache.org/jira/browse/SPARK-10590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581244#comment-15581244 ] Nirman Narang commented on SPARK-10590: --- Hi Sean, my FS is not encrypted. I upgraded Maven to 3.3.9 from 3.0.5, and the build is success now. Still no pointers if it was a Maven dependency issue. I will try the same with a different machine with same environment to confirm. > Spark with YARN build is broken > --- > > Key: SPARK-10590 > URL: https://issues.apache.org/jira/browse/SPARK-10590 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0 > Environment: CentOS 6.5 > Oracle JDK 1.7.0_75 > Maven 3.3.3 > Hadoop 2.6.0 > Spark 1.5.0 >Reporter: Kevin Tsai > > Hi, After upgrade to v1.5.0 and trying to build it. > It shows: > [ERROR] missing or invalid dependency detected while loading class file > 'WebUI.class' > It was working on Spark 1.4.1 > Build command: mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Dscala-2.11 -DskipTests clean package > Hope it helps. > Regards, > Kevin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver
[ https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17819: Fix Version/s: 2.0.2 > Specified database in JDBC URL is ignored when connecting to thriftserver > - > > Key: SPARK-17819 > URL: https://issues.apache.org/jira/browse/SPARK-17819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Todd Nemet >Assignee: Dongjoon Hyun > Fix For: 2.0.2, 2.1.0 > > > Filing this based on a email thread with Reynold Xin. From the > [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server], > the JDBC connection URL to the thriftserver looks like: > {code} > beeline> !connect > jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path= > {code} > However, anything specified in results in being put in default > schema. I'm running these with -e commands, but the shell shows the same > behavior. > In 2.0.1, I created a table foo in schema spark_jira: > {code} > [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.558 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected (0.664 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > {code} > I also see this in Spark 1.6.2: > {code} > [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.653 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10005/spark_jira > [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected (0.633 seconds) > Beeline
[jira] [Comment Edited] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147 ] Vitaly Gerasimov edited comment on SPARK-17954 at 10/17/16 4:50 AM: I figured out this issue. The problem is spark executor port listening localhost: {code} ~# netstat -ntlp tcp6 0 0 127.0.0.1:46721 :::*LISTEN 11294/java {code} Are there some changes in configuration that makes executor listen only localhost? When I run spark 1.6.2 executor listens any port. I know it may depends on my /etc/hosts file, but this change seems unexpectedly to me. was (Author: v-gerasimov): I figured out this issue. The problem is spark executor port listening localhost: {code} ~# netstat -ntlp tcp6 0 0 127.0.0.1:46721 :::*LISTEN 11294/java {code} Are there some changes in configuration that makes executor listen only localhost? When I run spark 1.6.2 executor listens any port. > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at >
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581167#comment-15581167 ] Reynold Xin commented on SPARK-10915: - It is indeed very complicated to implement UDAF in Python. That's why we have been asking. One thing is we can think about allowing defining UDAFs using existing SQL expressions, and then it might be easier. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147 ] Vitaly Gerasimov edited comment on SPARK-17954 at 10/17/16 4:12 AM: I figured out this issue. The problem is spark executor port listening localhost: {code} ~# netstat -ntlp tcp6 0 0 127.0.0.1:46721 :::*LISTEN 11294/java {code} Are there some changes in configuration that makes executor listen only localhost? When I run spark 1.6.2 executor listens any port. was (Author: v-gerasimov): I figured out this issue. The problem is spark executor port listening localhost: {conf} ~# netstat -ntlp tcp6 0 0 127.0.0.1:46721 :::*LISTEN 11294/java {conf} Are there some changes in configuration that makes executor listen only localhost? When I run spark 1.6.2 executor listens any port. > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at >
[jira] [Commented] (SPARK-17954) FetchFailedException executor cannot connect to another worker executor
[ https://issues.apache.org/jira/browse/SPARK-17954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581147#comment-15581147 ] Vitaly Gerasimov commented on SPARK-17954: -- I figured out this issue. The problem is spark executor port listening localhost: {conf} ~# netstat -ntlp tcp6 0 0 127.0.0.1:46721 :::*LISTEN 11294/java {conf} Are there some changes in configuration that makes executor listen only localhost? When I run spark 1.6.2 executor listens any port. > FetchFailedException executor cannot connect to another worker executor > --- > > Key: SPARK-17954 > URL: https://issues.apache.org/jira/browse/SPARK-17954 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov > > I have standalone mode spark cluster wich have three nodes: > master.test > worker1.test > worker2.test > I am trying to run the next code in spark shell: > {code} > val json = spark.read.json("hdfs://master.test/json/a.js.gz", > "hdfs://master.test/json/b.js.gz") > json.createOrReplaceTempView("messages") > spark.sql("select count(*) from messages").show() > {code} > and I am getting the following exception: > {code} > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to > worker1.test/x.x.x.x:51029 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ... 3 more > Caused by: java.net.ConnectException: Connection refused: >
[jira] [Resolved] (SPARK-17947) Document the impact of `spark.sql.debug`
[ https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17947. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15494 [https://github.com/apache/spark/pull/15494] > Document the impact of `spark.sql.debug` > > > Key: SPARK-17947 > URL: https://issues.apache.org/jira/browse/SPARK-17947 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > Fix For: 2.1.0 > > > Just document the impact of `spark.sql.debug` > When enabling the debug, Spark SQL internal table properties are not > filtered out; however, some related DDL commands (e.g., Analyze Table and > CREATE TABLE LIKE) might not work properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17947) Document the impact of `spark.sql.debug`
[ https://issues.apache.org/jira/browse/SPARK-17947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17947: Assignee: Xiao Li > Document the impact of `spark.sql.debug` > > > Key: SPARK-17947 > URL: https://issues.apache.org/jira/browse/SPARK-17947 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Just document the impact of `spark.sql.debug` > When enabling the debug, Spark SQL internal table properties are not > filtered out; however, some related DDL commands (e.g., Analyze Table and > CREATE TABLE LIKE) might not work properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver
[ https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581130#comment-15581130 ] Apache Spark commented on SPARK-17819: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/15507 > Specified database in JDBC URL is ignored when connecting to thriftserver > - > > Key: SPARK-17819 > URL: https://issues.apache.org/jira/browse/SPARK-17819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Todd Nemet >Assignee: Dongjoon Hyun > Fix For: 2.1.0 > > > Filing this based on a email thread with Reynold Xin. From the > [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server], > the JDBC connection URL to the thriftserver looks like: > {code} > beeline> !connect > jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path= > {code} > However, anything specified in results in being put in default > schema. I'm running these with -e commands, but the shell shows the same > behavior. > In 2.0.1, I created a table foo in schema spark_jira: > {code} > [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.558 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected (0.664 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > {code} > I also see this in Spark 1.6.2: > {code} > [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.653 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10005/spark_jira > [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | >
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581127#comment-15581127 ] Tobi Bosede commented on SPARK-10915: - It is complicated to implement a UDAF in python? If you read the email thread at the link, you will see people are interested in using a UDAF in sql statements as well as spark sql methods. In my case I wanted to do a pivot and have a UDAF for inter-quartile (IQR) range applied. Do you think df.rdd.reduceByKey would make sense here? > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17930) The SerializerInstance instance used when deserializing a TaskResult is not reused
[ https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581119#comment-15581119 ] Guoqiang Li commented on SPARK-17930: - TPC-DS 2T data (Parquet) and the SQL(query 2) => {noformat} select i_item_id, avg(ss_quantity) agg1, avg(ss_list_price) agg2, avg(ss_coupon_amt) agg3, avg(ss_sales_price) agg4 from store_sales, customer_demographics, date_dim, item, promotion where ss_sold_date_sk = d_date_sk and ss_item_sk = i_item_sk and ss_cdemo_sk = cd_demo_sk and ss_promo_sk = p_promo_sk and cd_gender = 'M' and cd_marital_status = 'M' and cd_education_status = '4 yr Degree' and (p_channel_email = 'N' or p_channel_event = 'N') and d_year = 2001 group by i_item_id order by i_item_id limit 100; {noformat} spark-defaults.conf => {noformat} spark.master yarn-client spark.executor.instances 20 spark.driver.memory16g spark.executor.memory 30g spark.executor.cores 5 spark.default.parallelism 100 spark.sql.shuffle.partitions 10 spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.maxResultSize 0 spark.rpc.netty.dispatcher.numThreads 8 spark.executor.extraJavaOptions -XX:+UseG1GC -XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M spark.cleaner.referenceTracking.blocking true spark.cleaner.referenceTracking.blocking.shuffle true {noformat} Performance test results are as follows => ||[SPARK-17930|https://github.com/witgo/spark/tree/SPARK-17930]||[ed14633|https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3]|| |54.5 s|231.7 s| > The SerializerInstance instance used when deserializing a TaskResult is not > reused > --- > > Key: SPARK-17930 > URL: https://issues.apache.org/jira/browse/SPARK-17930 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.1 >Reporter: Guoqiang Li > > The following code is called when the DirectTaskResult instance is > deserialized > {noformat} > def value(): T = { > if (valueObjectDeserialized) { > valueObject > } else { > // Each deserialization creates a new instance of SerializerInstance, > which is very time-consuming > val resultSer = SparkEnv.get.serializer.newInstance() > valueObject = resultSer.deserialize(valueBytes) > valueObjectDeserialized = true > valueObject > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581113#comment-15581113 ] Low Chin Wei commented on SPARK-13747: -- It is running on Akka, with forkjoin dispatcher. There are 2 actors running concurrently doing different Spark job using the same SparkSession. I can't give the full stack, but here is the outline: java.lang.IllegalArgumentException: spark.sql.execution.id is already set at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) ~[spark-sql_2.11-2.0.1.jar:2.0.1] <-- Here is the code that call the df.count --> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [akka-actor_2.11-2.4.8.jar:na] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [akka-actor_2.11-2.4.8.jar:na] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [akka-actor_2.11-2.4.8.jar:na] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [akka-actor_2.11-2.4.8.jar:na] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [akka-actor_2.11-2.4.8.jar:na] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [scala-library-2.11.8.jar:na] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [scala-library-2.11.8.jar:na] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [scala-library-2.11.8.jar:na] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [scala-library-2.11.8.jar:na] > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Andrew Or > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581094#comment-15581094 ] Shixiong Zhu commented on SPARK-13747: -- Could you post the full stack track, please? It would be helpful to know who calls `count`. > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Andrew Or > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581072#comment-15581072 ] Hyukjin Kwon commented on SPARK-17963: -- Thanks. Then, I will work on this. > Add examples (extend) in each function and improve documentation with > arguments > --- > > Key: SPARK-17963 > URL: https://issues.apache.org/jira/browse/SPARK-17963 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Hyukjin Kwon > > Currently, it seems function documentation is inconsistent and does not have > examples ({{extend}} much. > For example, some functions have a bad indentation as below: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct; > Function: approx_count_distinct > Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus > Usage: approx_count_distinct(expr) - Returns the estimated cardinality by > HyperLogLog++. > approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated > cardinality by HyperLogLog++ > with relativeSD, the maximum estimation error allowed. > Extended Usage: > No example for approx_count_distinct. > {code} > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED count; > Function: count > Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count > Usage: count(*) - Returns the total number of retrieved rows, including rows > containing NULL values. > count(expr) - Returns the number of rows for which the supplied > expression is non-NULL. > count(DISTINCT expr[, expr...]) - Returns the number of rows for which > the supplied expression(s) are unique and non-NULL. > Extended Usage: > No example for count. > {code} > whereas some do have a pretty one > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; > Function: percentile_approx > Class: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > Usage: > percentile_approx(col, percentage [, accuracy]) - Returns the > approximate percentile value of numeric > column `col` at the given percentage. The value of percentage must be > between 0.0 > and 1.0. The `accuracy` parameter (default: 1) is a positive > integer literal which > controls approximation accuracy at the cost of memory. Higher value of > `accuracy` yields > better accuracy, `1.0/accuracy` is the relative error of the > approximation. > percentile_approx(col, array(percentage1 [, percentage2]...) [, > accuracy]) - Returns the approximate > percentile array of column `col` at the given percentage array. Each > value of the > percentage array must be between 0.0 and 1.0. The `accuracy` parameter > (default: 1) is >a positive integer literal which controls approximation accuracy at > the cost of memory. >Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is > the relative error of >the approximation. > Extended Usage: > No example for percentile_approx. > {code} > Also, there are several inconsistent indentation, for example, > {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. > It'd be nicer if most of them have a good example with possible argument > types. > Suggested format is as below for multiple line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED rand; > Function: rand > Class: org.apache.spark.sql.catalyst.expressions.Rand > Usage: > rand() - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed is given randomly. > rand(seed) - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed should be an integer/long/NULL literal. > Extended Usage: > > SELECT rand(); > 0.9629742951434543 > > SELECT rand(0); > 0.8446490682263027 > > SELECT rand(NULL); > 0.8446490682263027 > {code} > For single line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED date_add; > Function: date_add > Class: org.apache.spark.sql.catalyst.expressions.DateAdd > Usage: date_add(start_date, num_days) - Returns the date that is num_days > after start_date. > Extended Usage: > > SELECT date_add('2016-07-30', 1); > '2016-07-31' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17969) I think it's user unfriendly to process standard json file with DataFrame
Jianfei Wang created SPARK-17969: Summary: I think it's user unfriendly to process standard json file with DataFrame Key: SPARK-17969 URL: https://issues.apache.org/jira/browse/SPARK-17969 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.0.1 Reporter: Jianfei Wang Priority: Minor Currently, with DataFrame API, we can't load standard json file directly, maybe we can provide an override method to process this, the logic is as below: ``` val df = spark.sparkContext.wholeTextFiles("data/test.json") val json_rdd = df.map( x => x.toString.replaceAll("\\s+","")).map{ x => val index = x.indexOf(',') x.substring(index + 1, x.length - 1) } val json_df = spark.read.json(json_rdd) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver
[ https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17819. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.1.0 > Specified database in JDBC URL is ignored when connecting to thriftserver > - > > Key: SPARK-17819 > URL: https://issues.apache.org/jira/browse/SPARK-17819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Todd Nemet >Assignee: Dongjoon Hyun > Fix For: 2.1.0 > > > Filing this based on a email thread with Reynold Xin. From the > [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server], > the JDBC connection URL to the thriftserver looks like: > {code} > beeline> !connect > jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path= > {code} > However, anything specified in results in being put in default > schema. I'm running these with -e commands, but the shell shows the same > behavior. > In 2.0.1, I created a table foo in schema spark_jira: > {code} > [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.558 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected (0.664 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > {code} > I also see this in Spark 1.6.2: > {code} > [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.653 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10005/spark_jira > [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| >
[jira] [Commented] (SPARK-17819) Specified database in JDBC URL is ignored when connecting to thriftserver
[ https://issues.apache.org/jira/browse/SPARK-17819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581049#comment-15581049 ] Dongjoon Hyun commented on SPARK-17819: --- Hi, [~rxin]. Could you review the PR? In fact, it was an addition of 3 lines. > Specified database in JDBC URL is ignored when connecting to thriftserver > - > > Key: SPARK-17819 > URL: https://issues.apache.org/jira/browse/SPARK-17819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0, 2.0.1 >Reporter: Todd Nemet > > Filing this based on a email thread with Reynold Xin. From the > [docs|http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server], > the JDBC connection URL to the thriftserver looks like: > {code} > beeline> !connect > jdbc:hive2://:/?hive.server2.transport.mode=http;hive.server2.thrift.http.path= > {code} > However, anything specified in results in being put in default > schema. I'm running these with -e commands, but the shell shows the same > behavior. > In 2.0.1, I created a table foo in schema spark_jira: > {code} > [558|18:01:20] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:28 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:28 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > ++--+--+ > No rows selected (0.558 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > [559|18:01:30] ~/Documents/spark/spark$ bin/beeline -u > jdbc:hive2://localhost:10006/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10006/spark_jira > 16/10/06 18:01:34 INFO jdbc.Utils: Supplied authorities: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.Utils: Resolved authority: localhost:10006 > 16/10/06 18:01:34 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10006/spark_jira > Connected to: Spark SQL (version 2.0.1) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected (0.664 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10006/spark_jira > {code} > I also see this in Spark 1.6.2: > {code} > [555|18:13:32] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:37 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:37 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > +--+--+--+ > | tableName | isTemporary | > +--+--+--+ > | all_types| false| > | order_items | false| > | orders | false| > | users| false| > +--+--+--+ > 4 rows selected (0.653 seconds) > Beeline version 1.2.1.spark2 by Apache Hive > Closing: 0: jdbc:hive2://localhost:10005/spark_jira > [556|18:13:39] ~/Documents/spark/spark16$ bin/beeline -u > jdbc:hive2://localhost:10005/spark_jira -n hive -e "show tables in spark_jira" > Connecting to jdbc:hive2://localhost:10005/spark_jira > 16/10/06 18:13:45 INFO jdbc.Utils: Supplied authorities: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.Utils: Resolved authority: localhost:10005 > 16/10/06 18:13:45 INFO jdbc.HiveConnection: Will try to open client transport > with JDBC Uri: jdbc:hive2://localhost:10005/spark_jira > Connected to: Spark SQL (version 1.6.2) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > ++--+--+ > | tableName | isTemporary | > ++--+--+ > | foo| false| > ++--+--+ > 1 row selected
[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581022#comment-15581022 ] Reynold Xin commented on SPARK-17963: - It's definitely useful to do, but I don't think we need to have examples for every function (e.g. count). > Add examples (extend) in each function and improve documentation with > arguments > --- > > Key: SPARK-17963 > URL: https://issues.apache.org/jira/browse/SPARK-17963 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Hyukjin Kwon > > Currently, it seems function documentation is inconsistent and does not have > examples ({{extend}} much. > For example, some functions have a bad indentation as below: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct; > Function: approx_count_distinct > Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus > Usage: approx_count_distinct(expr) - Returns the estimated cardinality by > HyperLogLog++. > approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated > cardinality by HyperLogLog++ > with relativeSD, the maximum estimation error allowed. > Extended Usage: > No example for approx_count_distinct. > {code} > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED count; > Function: count > Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count > Usage: count(*) - Returns the total number of retrieved rows, including rows > containing NULL values. > count(expr) - Returns the number of rows for which the supplied > expression is non-NULL. > count(DISTINCT expr[, expr...]) - Returns the number of rows for which > the supplied expression(s) are unique and non-NULL. > Extended Usage: > No example for count. > {code} > whereas some do have a pretty one > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; > Function: percentile_approx > Class: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > Usage: > percentile_approx(col, percentage [, accuracy]) - Returns the > approximate percentile value of numeric > column `col` at the given percentage. The value of percentage must be > between 0.0 > and 1.0. The `accuracy` parameter (default: 1) is a positive > integer literal which > controls approximation accuracy at the cost of memory. Higher value of > `accuracy` yields > better accuracy, `1.0/accuracy` is the relative error of the > approximation. > percentile_approx(col, array(percentage1 [, percentage2]...) [, > accuracy]) - Returns the approximate > percentile array of column `col` at the given percentage array. Each > value of the > percentage array must be between 0.0 and 1.0. The `accuracy` parameter > (default: 1) is >a positive integer literal which controls approximation accuracy at > the cost of memory. >Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is > the relative error of >the approximation. > Extended Usage: > No example for percentile_approx. > {code} > Also, there are several inconsistent indentation, for example, > {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. > It'd be nicer if most of them have a good example with possible argument > types. > Suggested format is as below for multiple line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED rand; > Function: rand > Class: org.apache.spark.sql.catalyst.expressions.Rand > Usage: > rand() - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed is given randomly. > rand(seed) - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed should be an integer/long/NULL literal. > Extended Usage: > > SELECT rand(); > 0.9629742951434543 > > SELECT rand(0); > 0.8446490682263027 > > SELECT rand(NULL); > 0.8446490682263027 > {code} > For single line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED date_add; > Function: date_add > Class: org.apache.spark.sql.catalyst.expressions.DateAdd > Usage: date_add(start_date, num_days) - Returns the date that is num_days > after start_date. > Extended Usage: > > SELECT date_add('2016-07-30', 1); > '2016-07-31' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581031#comment-15581031 ] Reynold Xin commented on SPARK-10915: - What's the use case? Is it not possible to just run df.rdd.reduceByKey? The main issue is that it is actually very convoluted and complex to implement this properly. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581023#comment-15581023 ] Susan X. Huynh commented on SPARK-11524: Thanks for the advice and for breaking down the sub-issues. For mesos _cluster_ mode, is there additional work? Or is it just the same problem to locate the sparkr.zip at slave nodes? > Support SparkR with Mesos cluster > - > > Key: SPARK-11524 > URL: https://issues.apache.org/jira/browse/SPARK-11524 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17968) Support using 3rd-party R packages on Mesos
Sun Rui created SPARK-17968: --- Summary: Support using 3rd-party R packages on Mesos Key: SPARK-17968 URL: https://issues.apache.org/jira/browse/SPARK-17968 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.0.1 Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool
[ https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580989#comment-15580989 ] Low Chin Wei commented on SPARK-13747: -- java.lang.IllegalArgumentException: spark.sql.execution.id is already set at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2192) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2199) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2227) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2226) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2559) ~[spark-sql_2.11-2.0.1.jar:2.0.1] at org.apache.spark.sql.Dataset.count(Dataset.scala:2226) ~[spark-sql_2.11-2.0.1.jar:2.0.1] > Concurrent execution in SQL doesn't work with Scala ForkJoinPool > > > Key: SPARK-13747 > URL: https://issues.apache.org/jira/browse/SPARK-13747 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Shixiong Zhu >Assignee: Andrew Or > Fix For: 2.0.0 > > > Run the following codes may fail > {code} > (1 to 100).par.foreach { _ => > println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()) > } > java.lang.IllegalArgumentException: spark.sql.execution.id is already set > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) > > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) > {code} > This is because SparkContext.runJob can be suspended when using a > ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it > calls Await.ready (introduced by https://github.com/apache/spark/pull/9264). > So when SparkContext.runJob is suspended, ForkJoinPool will run another task > in the same thread, however, the local properties has been polluted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11524) Support SparkR with Mesos cluster
[ https://issues.apache.org/jira/browse/SPARK-11524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580960#comment-15580960 ] Sun Rui commented on SPARK-11524: - great, go ahead. Please look at the linking JIRA issues as references. Basically, to enable SparkR with mesos client mode, to problem is to locate the sparkr.zip at slave nodes, and set the environment for R so that it can load the SparkR package. (No need to pass the sparkr.zip from the driver node to slave nodes, because the file is already part of the spark distribution.) When locating sparkr.zip, two cases should be considered: locally installed spark distribution or downloaded from a public-accessible place. (check spark.mesos.executor.home ) I just broke this issue into sub-issues, we can address it one by one > Support SparkR with Mesos cluster > - > > Key: SPARK-11524 > URL: https://issues.apache.org/jira/browse/SPARK-11524 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Sun Rui > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17967) Support for list or other types as an option for datasources
Hyukjin Kwon created SPARK-17967: Summary: Support for list or other types as an option for datasources Key: SPARK-17967 URL: https://issues.apache.org/jira/browse/SPARK-17967 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1, 2.0.0 Reporter: Hyukjin Kwon This was discussed in SPARK-17878 For other datasources, it seems okay with string/long/boolean/double value as an option but it seems it is not enough for the datasource such as CSV. As it is an interface for other external datasources, I guess it'd affect several ones out there. I took a look a first but it seems it'd be difficult to support this (need to change a lot). One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17967) Support for list or other types as an option for datasources
[ https://issues.apache.org/jira/browse/SPARK-17967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580958#comment-15580958 ] Hyukjin Kwon commented on SPARK-17967: -- I am leaving SPARK-17878 as a related one but it does not mean this one blocks that JIRA. > Support for list or other types as an option for datasources > > > Key: SPARK-17967 > URL: https://issues.apache.org/jira/browse/SPARK-17967 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Hyukjin Kwon > > This was discussed in SPARK-17878 > For other datasources, it seems okay with string/long/boolean/double value as > an option but it seems it is not enough for the datasource such as CSV. As it > is an interface for other external datasources, I guess it'd affect several > ones out there. > I took a look a first but it seems it'd be difficult to support this (need to > change a lot). > One suggestion is support this as a JSON array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17966) Support Spark packages with R code on Mesos
Sun Rui created SPARK-17966: --- Summary: Support Spark packages with R code on Mesos Key: SPARK-17966 URL: https://issues.apache.org/jira/browse/SPARK-17966 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.0.1 Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17965) Enable SparkR with Mesos cluster mode
Sun Rui created SPARK-17965: --- Summary: Enable SparkR with Mesos cluster mode Key: SPARK-17965 URL: https://issues.apache.org/jira/browse/SPARK-17965 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.0.1 Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17964) Enable SparkR with Mesos client mode
Sun Rui created SPARK-17964: --- Summary: Enable SparkR with Mesos client mode Key: SPARK-17964 URL: https://issues.apache.org/jira/browse/SPARK-17964 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.0.1 Reporter: Sun Rui -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-17963: - Description: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct; Function: approx_count_distinct Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++. approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++ with relativeSD, the maximum estimation error allowed. Extended Usage: No example for approx_count_distinct. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. {code} Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. It'd be nicer if most of them have a good example with possible argument types. Suggested format is as below for multiple line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED rand; Function: rand Class: org.apache.spark.sql.catalyst.expressions.Rand Usage: rand() - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed is given randomly. rand(seed) - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed should be an integer/long/NULL literal. Extended Usage: > SELECT rand(); 0.9629742951434543 > SELECT rand(0); 0.8446490682263027 > SELECT rand(NULL); 0.8446490682263027 {code} For single line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED date_add; Function: date_add Class: org.apache.spark.sql.catalyst.expressions.DateAdd Usage: date_add(start_date, num_days) - Returns the date that is num_days after start_date. Extended Usage: > SELECT date_add('2016-07-30', 1); '2016-07-31' {code} was: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class:
[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-17963: - Description: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. {code} Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. It'd be nicer if most of them have a good example with possible argument types. Suggested format is as below for multiple line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED rand; Function: rand Class: org.apache.spark.sql.catalyst.expressions.Rand Usage: rand() - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed is given randomly. rand(seed) - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed should be an integer/long/NULL literal. Extended Usage: > SELECT rand(); 0.9629742951434543 > SELECT rand(0); 0.8446490682263027 > SELECT rand(NULL); 0.8446490682263027 {code} For single line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED date_add; Function: date_add Class: org.apache.spark.sql.catalyst.expressions.DateAdd Usage: date_add(start_date, num_days) - Returns the date that is num_days after start_date. Extended Usage: > SELECT date_add('2016-07-30', 1); '2016-07-31' {code} was: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the
[jira] [Commented] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580942#comment-15580942 ] Hyukjin Kwon commented on SPARK-17963: -- Hi [~rxin] and [~srowen], I guess the PR would be pretty big. So, I would like you both confirm this first. Do you think it'd be sensible? > Add examples (extend) in each function and improve documentation with > arguments > --- > > Key: SPARK-17963 > URL: https://issues.apache.org/jira/browse/SPARK-17963 > Project: Spark > Issue Type: Documentation > Components: SQL >Reporter: Hyukjin Kwon > > Currently, it seems function documentation is inconsistent and does not have > examples ({{extend}} much. > For example, some functions have a bad indentation as below: > {code} > spark-sql> DESCRIBE FUNCTION last; > Function: last > Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last > Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a > group of rows. > last(expr,isIgnoreNull=false) - Returns the last value of `child` for a > group of rows. > If isIgnoreNull is true, returns only non-null values. > {code} > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED count; > Function: count > Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count > Usage: count(*) - Returns the total number of retrieved rows, including rows > containing NULL values. > count(expr) - Returns the number of rows for which the supplied > expression is non-NULL. > count(DISTINCT expr[, expr...]) - Returns the number of rows for which > the supplied expression(s) are unique and non-NULL. > Extended Usage: > No example for count. > {code} > whereas some do have a pretty one > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; > Function: percentile_approx > Class: > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > Usage: > percentile_approx(col, percentage [, accuracy]) - Returns the > approximate percentile value of numeric > column `col` at the given percentage. The value of percentage must be > between 0.0 > and 1.0. The `accuracy` parameter (default: 1) is a positive > integer literal which > controls approximation accuracy at the cost of memory. Higher value of > `accuracy` yields > better accuracy, `1.0/accuracy` is the relative error of the > approximation. > percentile_approx(col, array(percentage1 [, percentage2]...) [, > accuracy]) - Returns the approximate > percentile array of column `col` at the given percentage array. Each > value of the > percentage array must be between 0.0 and 1.0. The `accuracy` parameter > (default: 1) is >a positive integer literal which controls approximation accuracy at > the cost of memory. >Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is > the relative error of >the approximation. > Extended Usage: > No example for percentile_approx. > {code} > Also, there are several inconsistent indentation, for example, > {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. > It'd be nicer if most of them have a good example with possible argument > types. > Suggested format is as below for multiple line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED rand; > Function: rand > Class: org.apache.spark.sql.catalyst.expressions.Rand > Usage: > rand() - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed is given randomly. > rand(seed) - Returns a random column with i.i.d. uniformly distributed > values in [0, 1]. > seed should be an integer/long/NULL literal. > Extended Usage: > > SELECT rand(); > 0.9629742951434543 > > SELECT rand(0); > 0.8446490682263027 > > SELECT rand(NULL); > 0.8446490682263027 > {code} > For single line usage: > {code} > spark-sql> DESCRIBE FUNCTION EXTENDED date_add; > Function: date_add > Class: org.apache.spark.sql.catalyst.expressions.DateAdd > Usage: date_add(start_date, num_days) - Returns the date that is num_days > after start_date. > Extended Usage: > > SELECT date_add('2016-07-30', 1); > '2016-07-31' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
[ https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-17963: - Description: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. {code} Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. It'd be nicer if most of them have a good example with possible argument types. Suggested format is as below for multiple line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED rand; Function: rand Class: org.apache.spark.sql.catalyst.expressions.Rand Usage: rand() - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed is given randomly. rand(seed) - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed should be an integer/long/NULL literal. Extended Usage: > SELECT rand(); 0.9629742951434543 > SELECT rand(0); 0.8446490682263027 > SELECT rand(NULL); 0.8446490682263027 {code} For single line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED date_add; Function: date_add Class: org.apache.spark.sql.catalyst.expressions.DateAdd Usage: date_add(start_date, num_days) - Returns the date that is num_days after start_date. Extended Usage: > SELECT date_add('2016-07-30', 1); '2016-07-31' {code} was: Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the
[jira] [Created] (SPARK-17963) Add examples (extend) in each function and improve documentation with arguments
Hyukjin Kwon created SPARK-17963: Summary: Add examples (extend) in each function and improve documentation with arguments Key: SPARK-17963 URL: https://issues.apache.org/jira/browse/SPARK-17963 Project: Spark Issue Type: Documentation Components: SQL Reporter: Hyukjin Kwon Currently, it seems function documentation is inconsistent and does not have examples ({{extend}} much. For example, some functions have a bad indentation as below: {code} spark-sql> DESCRIBE FUNCTION last; Function: last Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group of rows. last(expr,isIgnoreNull=false) - Returns the last value of `child` for a group of rows. If isIgnoreNull is true, returns only non-null values. {code} {code} spark-sql> DESCRIBE FUNCTION EXTENDED count; Function: count Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count Usage: count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Extended Usage: No example for count. {code} whereas some do have a pretty one {code} spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx; Function: percentile_approx Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile Usage: percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric column `col` at the given percentage. The value of percentage must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate percentile array of column `col` at the given percentage array. Each value of the percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 1) is a positive integer literal which controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of the approximation. Extended Usage: No example for percentile_approx. {code} Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments. It'd be nicer if most of them have a good example with possible argument types. Suggested format is as below for multiple line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED rand; Function: rand Class: org.apache.spark.sql.catalyst.expressions.Rand Usage: rand() - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed is given randomly. rand(seed) - Returns a random column with i.i.d. uniformly distributed values in [0, 1]. seed should be an integer/long/NULL literal. Extended Usage: > SELECT rand(); 0.9629742951434543 > SELECT rand(0); 0.8446490682263027 > SELECT rand(NULL); 0.8446490682263027 {code} For single line usage: {code} spark-sql> DESCRIBE FUNCTION EXTENDED date_add; Function: date_add Class: org.apache.spark.sql.catalyst.expressions.DateAdd Usage: date_add(start_date, num_days) - Returns the date that is num_days after start_date. Extended Usage: > SELECT date_add('2016-07-30', 1); '2016-07-31' {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580917#comment-15580917 ] Tobi Bosede commented on SPARK-10915: - Thoughts [~davies] and [~mgummelt]? Refer to https://www.mail-archive.com/user@spark.apache.org/msg58125.html > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17898) --repositories needs username and password
[ https://issues.apache.org/jira/browse/SPARK-17898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580857#comment-15580857 ] lichenglin commented on SPARK-17898: I have found a way to declaration the username and password: --repositories http://username:passw...@wx.bjdv.com:8081/nexus/content/grou‌​ps/bigdata/ May be,we should add this feature to the document? > --repositories needs username and password > --- > > Key: SPARK-17898 > URL: https://issues.apache.org/jira/browse/SPARK-17898 > Project: Spark > Issue Type: Wish >Affects Versions: 2.0.1 >Reporter: lichenglin > > My private repositories need username and password to visit. > I can't find a way to declaration the username and password when submit > spark application > {code} > bin/spark-submit --repositories > http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ --packages > com.databricks:spark-csv_2.10:1.2.0 --class > org.apache.spark.examples.SparkPi --master local[8] > examples/jars/spark-examples_2.11-2.0.1.jar 100 > {code} > The rep http://wx.bjdv.com:8081/nexus/content/groups/bigdata/ need username > and password -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
[ https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Hankinson updated SPARK-17962: -- Description: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark (The cluster name must be the same as the resource group name used for this to launch properly, login with username hadoop, and launch the shell with /home/hadoop/spark-2.0.0-bin-hadoop2.7/bin/spark-shell --master yarn --deploy-mode client --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.catalogImplementation=in-memory --driver-memory 10g --driver-cores 4) We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0| null| null| null| However, using the RDD API produces the correct result dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5) res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0 We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896 We have also determined this appears to be related to the memory settings of the cluster. The worker machines have 56000MB available, the node manager memory is set to 54784M and executor memory set to 48407M when we see this issue happen. Lowering the executor memory to something like 28407M removes the issue from happening. was: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark (The cluster name must be the same as the resource group name used for this to launch properly) We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this
[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
[ https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Hankinson updated SPARK-17962: -- Description: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark (The cluster name must be the same as the resource group name used for this to launch properly) We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0| null| null| null| However, using the RDD API produces the correct result dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5) res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0 We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896 We have also determined this appears to be related to the memory settings of the cluster. The worker machines have 56000MB available, the node manager memory is set to 54784M and executor memory set to 48407M when we see this issue happen. Lowering the executor memory to something like 28407M removes the issue from happening. was: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0|
[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
[ https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Hankinson updated SPARK-17962: -- Description: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0| null| null| null| However, using the RDD API produces the correct result dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5) res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0 We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896 We have also determined this appears to be related to the memory settings of the cluster. The worker machines have 56000MB available, the node manager memory is set to 54784M and executor memory set to 48407M when we see this issue happen. Lowering the executor memory to something like 28407M removes the issue from happening. was: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0|
[jira] [Updated] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
[ https://issues.apache.org/jira/browse/SPARK-17962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Hankinson updated SPARK-17962: -- Description: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0|0| 1| 1.0| which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs || user||dimension||score|| |12345|0| 1.0| ||dimension||cluster||score|| |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| +-+---+-+ || user||dimension||score||dimension||cluster||score|| |12345|0| 1.0| null| null| null| However, using the RDD API produces the correct result dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5) res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0 We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896 We have also determined this appears to be related to the memory settings of the cluster. The worker machines have 56000MB available, the node manager memory is set to 54784M and executor memory set to 48407M when we see this issue happen. Lowering the executor memory to something like 28407M removes the issue from happening. was: Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs +-+-+-+ | user|dimension|score| +-+-+-+ |12345|0| 1.0| +-+-+-+ +-+---+-+ |dimension|cluster|score| +-+---+-+ |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| +-+---+-+ +-+-+-+-+---+-+ | user|dimension|score|dimension|cluster|score| +-+-+-+-+---+-+ |12345|0| 1.0|0| 1| 1.0| +-+-+-+-+---+-+ which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 =
[jira] [Created] (SPARK-17962) DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn
Stephen Hankinson created SPARK-17962: - Summary: DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn Key: SPARK-17962 URL: https://issues.apache.org/jira/browse/SPARK-17962 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 2.0.0 Environment: Centos 7.2, Hadoop 2.7.2, Spark 2.0.0 Reporter: Stephen Hankinson Environment can be reproduced via this git repo using the Deploy to Azure button: https://github.com/shankinson/spark We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We had written some new code using the Spark DataFrame/DataSet APIs but are noticing incorrect results on a join after writing and then reading data to Windows Azure Storage Blobs (The default HDFS location). I've been able to duplicate the issue with the following snippet of code running on the cluster. case class UserDimensions(user: Long, dimension: Long, score: Double) case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double) val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS val cent = sc.parallelize(Array(CentroidClusterScore(0, 1, 1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2, 1.0))).toDS dims.show cent.show dims.join(cent, dims("dimension") === cent("dimension") ).show outputs +-+-+-+ | user|dimension|score| +-+-+-+ |12345|0| 1.0| +-+-+-+ +-+---+-+ |dimension|cluster|score| +-+---+-+ |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| +-+---+-+ +-+-+-+-+---+-+ | user|dimension|score|dimension|cluster|score| +-+-+-+-+---+-+ |12345|0| 1.0|0| 1| 1.0| +-+-+-+-+---+-+ which is correct. However after writing and reading the data, we see this dims.write.mode("overwrite").save("/tmp/dims2.parquet") cent.write.mode("overwrite").save("/tmp/cent2.parquet") val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions] val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore] dims2.show cent2.show dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show outputs +-+-+-+ | user|dimension|score| +-+-+-+ |12345|0| 1.0| +-+-+-+ +-+---+-+ |dimension|cluster|score| +-+---+-+ |0| 1| 1.0| |1| 0| 1.0| |2| 2| 1.0| +-+---+-+ +-+-+-+-+---+-+ | user|dimension|score|dimension|cluster|score| +-+-+-+-+---+-+ |12345|0| 1.0| null| null| null| +-+-+-+-+---+-+ However, using the RDD API produces the correct result dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row => (row.dimension, row) ) ).take(5) res5: Array[(Long, (UserDimensions, CentroidClusterScore))] = Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0 We've tried changing the output format to ORC instead of parquet, but we see the same results. Running Spark 2.0 locally, not on a cluster, does not have this issue. Also running spark in local mode on the master node of the Hadoop cluster also works. Only when running on top of YARN do we see this issue. This also seems very similar to this issue: https://issues.apache.org/jira/browse/SPARK-10896 We have also determined this appears to be related to the memory settings of the cluster. The worker machines have 56000MB available, the node manager memory is set to 54784M and executor memory set to 48407M when we see this issue happen. Lowering the executor memory to something like 28407M removes the issue from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580502#comment-15580502 ] Michael Schmeißer commented on SPARK-650: - What if I have a Hadoop InputFormat? Then, certain things happen before the first RDD exists, don't they? I'll give the solution with the empty RDD a shot next week, this sounds a little bit better than what we have right now, but it still relies on certain internals of Spark which are most likely undocumented and might change in future? I've had the feeling that Spark basically has a functional approach with the RDDs and executing anything on an empty RDD could be optimized to just do nothing? > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580135#comment-15580135 ] Sean Owen commented on SPARK-650: - But, why do you need to do it before you have an RDD? You can easily make this a library function. Or, just some static init that happens on demand whenever a certain class is loaded. The nice thing about that is that it's transparent, just like with any singleton / static init in the JVM. If you really want, you can make an empty RDD and repartition it and use that as a dummy, but it only serves to do some initialization early that would happen transparently anyway. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17961) Add storageLevel to Dataset for SparkR
[ https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-17961: --- Component/s: SQL SparkR > Add storageLevel to Dataset for SparkR > -- > > Key: SPARK-17961 > URL: https://issues.apache.org/jira/browse/SPARK-17961 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Add storageLevel to Dataset for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17961) Add storageLevel to Dataset for SparkR
[ https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-17961: --- Issue Type: Improvement (was: Bug) > Add storageLevel to Dataset for SparkR > -- > > Key: SPARK-17961 > URL: https://issues.apache.org/jira/browse/SPARK-17961 > Project: Spark > Issue Type: Improvement >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Add storageLevel to Dataset for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17961) Add storageLevel to Dataset for SparkR
[ https://issues.apache.org/jira/browse/SPARK-17961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580131#comment-15580131 ] Weichen Xu commented on SPARK-17961: I am working on it and will create PR soon. > Add storageLevel to Dataset for SparkR > -- > > Key: SPARK-17961 > URL: https://issues.apache.org/jira/browse/SPARK-17961 > Project: Spark > Issue Type: Bug >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Add storageLevel to Dataset for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17961) Add storageLevel to Dataset for SparkR
Weichen Xu created SPARK-17961: -- Summary: Add storageLevel to Dataset for SparkR Key: SPARK-17961 URL: https://issues.apache.org/jira/browse/SPARK-17961 Project: Spark Issue Type: Bug Reporter: Weichen Xu Add storageLevel to Dataset for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580069#comment-15580069 ] Michael Schmeißer commented on SPARK-650: - But I'll need to have an RDD to do this, I can't just do it during the SparkContext setup - right now, we have multiple sources of RDDs and every developer would still need to know that they have to run this code after creating an RDD, won't they? Or is there some way to use a "pseudo-RDD" right after creation of the SparkContext to execute the init code on the executors? > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580062#comment-15580062 ] Sean Owen commented on SPARK-650: - This is still easy to do with mapPartitions, which can call {{initWithTheseParamsIfNotAlreadyInitialized(...)}} once per partition, which should guarantee it happens once per JVM before anything else proceeds. I don't think you need to bury it in serialization logic. I can see there are hard ways to implement this, but I believe an easy way is still readily available within the existing API mechanisms. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15580055#comment-15580055 ] Michael Schmeißer commented on SPARK-650: - Ok, let me explain the specific problems that we have encountered, which might help to understand the issue and possible solutions: We need to run some code on the executors before anything gets processed, e.g. initialization of the log system or context setup. To do this, we need information that is present on the driver, but not on the executors. Our current solution is to provide a base class for Spark function implementations which contains the information from the driver and initializes everything in its readObject method. Since multiple narrow-dependent functions may be executed on the same executor JVM subsequently, this class needs to make sure that initialization doesn't run multiple times. Sure, that's not hard to do, but if you mix setup and cleanup logic for functions, partitions and/or the JVM itself, it can get quite confusing without explicit hooks. So, our solution basically works, but with that approach, you can't use lambdas for Spark functions, which is quite inconvenient, especially for simple map operations. Even worse, if you use a lambda or otherwise forget to extend the required base class, the initialization doesn't occur and very weird exceptions follow, depending on which resource your function tries to access during its execution. Or if you have very bad luck, no exception will occur, but the log messages will get logged to an incorrect destination. It's very hard to prevent such cases without an explicit initialization mechanism and in a team with several developers, you can't expect everyone to know what is going on there. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14141) Let user specify datatypes of pandas dataframe in toPandas()
[ https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579949#comment-15579949 ] holdenk commented on SPARK-14141: - Ah sorry for the delay, so doing the cache + count together is done since if you just do a cache it won't actually do any caching until an action is performed on the rdd / dataframe and the count is used to force evaluation of the entire dataframe. > Let user specify datatypes of pandas dataframe in toPandas() > > > Key: SPARK-14141 > URL: https://issues.apache.org/jira/browse/SPARK-14141 > Project: Spark > Issue Type: New Feature > Components: Input/Output, PySpark, SQL >Reporter: Luke Miner >Priority: Minor > > Would be nice to specify the dtypes of the pandas dataframe during the > toPandas() call. Something like: > bq. pdf = df.toPandas(dtypes={'a': 'float64', 'b': 'datetime64', 'c': 'bool', > 'd': 'category'}) > Since dtypes like `category` are more memory efficient, you could potentially > load many more rows into a pandas dataframe with this option without running > out of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579946#comment-15579946 ] holdenk commented on SPARK-13534: - And now they have a release :) I'm not certain its at the stage where we can use it - but I'll do some poking over the next few weeks :) > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Wes McKinney > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()
[ https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579942#comment-15579942 ] holdenk commented on SPARK-12753: - (oh as a follow up it appears the user answered their own question on stack overflow so all is well) :) > Import error during unit test while calling a function from reduceByKey() > - > > Key: SPARK-12753 > URL: https://issues.apache.org/jira/browse/SPARK-12753 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 1.6.0 > Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, > Anaconda >Reporter: Dat Tran >Priority: Trivial > Labels: pyspark, python3, unit-test > Attachments: log.txt, map.py, test_map.py > > > The current directory structure for my test script is as follows: > project/ >script/ > __init__.py > map.py >test/ > __init.py__ > test_map.py > I have attached map.py and test_map.py file with this issue. > When I run the nosetest in the test directory, the test fails. I get no > module named "script" found error. > However when I modify the map_add function to replace the call to add within > reduceByKey in map.py like this: > def map_add(df): > result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: > x+y) > return result > The test passes. > Also, when I run the original test_map.py from the project directory, the > test passes. > I am not able to figure out why the test doesn't detect the script module > when it is within the test directory. > I have also attached the log error file. Any help will be much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()
[ https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-12753. --- Resolution: Not A Problem I don't believe this is a PySpark issue but rather it seems like a Python issue/question. If I've missunderstood feel free to re-open - but you might have better luck with the user list or stack overflow if this is a Python issue. > Import error during unit test while calling a function from reduceByKey() > - > > Key: SPARK-12753 > URL: https://issues.apache.org/jira/browse/SPARK-12753 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 1.6.0 > Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, > Anaconda >Reporter: Dat Tran >Priority: Trivial > Labels: pyspark, python3, unit-test > Attachments: log.txt, map.py, test_map.py > > > The current directory structure for my test script is as follows: > project/ >script/ > __init__.py > map.py >test/ > __init.py__ > test_map.py > I have attached map.py and test_map.py file with this issue. > When I run the nosetest in the test directory, the test fails. I get no > module named "script" found error. > However when I modify the map_add function to replace the call to add within > reduceByKey in map.py like this: > def map_add(df): > result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: > x+y) > return result > The test passes. > Also, when I run the original test_map.py from the project directory, the > test passes. > I am not able to figure out why the test doesn't detect the script module > when it is within the test directory. > I have also attached the log error file. Any help will be much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579928#comment-15579928 ] Sean Owen commented on SPARK-650: - Yeah that's a decent use case, because latency is an issue (streaming) and you potentially have time to set up before latency matters. You can still use this approach because empty RDDs arrive if no data has, and empty RDDs can still be repartitioned. Here's a way to, once, if the first RDD has no data, do something once per partition, which ought to amount to at least once per executor: {code} var first = true lines.foreachRDD { rdd => if (first) { if (rdd.isEmpty) { rdd.repartition(sc.defaultParallelism).foreachPartition(_ => Thing.initOnce()) } first = false } } {code} "ought", because, there isn't actually a guarantee that it will put the empty partitions on different executors. In practice, it seems to, when I just tried it. That's a partial solution, but it's an optimization anyway, and maybe it helps you right now. I am still not sure it means this needs a whole mechanism, if this is the only type of use case. Maybe there are others. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid
[ https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-11223. - Resolution: Fixed Fixed in SPARK-12810 by [~vectorijk] :) > PySpark CrossValidatorModel does not output metrics for every param in > paramGrid > > > Key: SPARK-11223 > URL: https://issues.apache.org/jira/browse/SPARK-11223 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Raela Wang >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid
[ https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579923#comment-15579923 ] holdenk commented on SPARK-11223: - Oh wait it looks like we've already done this and I was looking at the wrong tuning model. Going to go ahead and resolve :) > PySpark CrossValidatorModel does not output metrics for every param in > paramGrid > > > Key: SPARK-11223 > URL: https://issues.apache.org/jira/browse/SPARK-11223 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Raela Wang >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid
[ https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579921#comment-15579921 ] holdenk commented on SPARK-11223: - This could be a good starter issue for someone interested in ML or PySpark. You will want to expose some additional getters on the Python CrossValidator model so users can get at the avgMetrics. Feel free to ping me for help if you want :) If no one takes this by December I'll probably give it a shot myself :) > PySpark CrossValidatorModel does not output metrics for every param in > paramGrid > > > Key: SPARK-11223 > URL: https://issues.apache.org/jira/browse/SPARK-11223 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Raela Wang >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11223) PySpark CrossValidatorModel does not output metrics for every param in paramGrid
[ https://issues.apache.org/jira/browse/SPARK-11223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-11223: Labels: starter (was: ) > PySpark CrossValidatorModel does not output metrics for every param in > paramGrid > > > Key: SPARK-11223 > URL: https://issues.apache.org/jira/browse/SPARK-11223 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Raela Wang >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10635) pyspark - running on a different host
[ https://issues.apache.org/jira/browse/SPARK-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579914#comment-15579914 ] holdenk commented on SPARK-10635: - it would be a bit difficult, although as Py4J is speeding up the bridge it might not be impossible. I think realistically though this is probably a won't fix unless since there doesn't seem to be a lot of interest in this feature. > pyspark - running on a different host > - > > Key: SPARK-10635 > URL: https://issues.apache.org/jira/browse/SPARK-10635 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ben Duffield > > At various points we assume we only ever talk to a driver on the same host. > e.g. > https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L615 > We use pyspark to connect to an existing driver (i.e. do not let pyspark > launch the driver itself, but instead construct the SparkContext with the > gateway and jsc arguments. > There are a few reasons for this, but essentially it's to allow more > flexibility when running in AWS. > Before 1.3.1 we were able to monkeypatch around this: > {code} > def _load_from_socket(port, serializer): > sock = socket.socket() > sock.settimeout(3) > try: > sock.connect((host, port)) > rf = sock.makefile("rb", 65536) > for item in serializer.load_stream(rf): > yield item > finally: > sock.close() > pyspark.rdd._load_from_socket = _load_from_socket > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10628) Add support for arbitrary RandomRDD generation to PySparkAPI
[ https://issues.apache.org/jira/browse/SPARK-10628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579910#comment-15579910 ] holdenk commented on SPARK-10628: - For someone who is interested in doing this, we might be able to do something based on how parallelize is for Python is special cased for generators. > Add support for arbitrary RandomRDD generation to PySparkAPI > > > Key: SPARK-10628 > URL: https://issues.apache.org/jira/browse/SPARK-10628 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: holdenk >Priority: Minor > > SPARK-2724 added support for specific RandomRDDs, add support for arbitrary > Random RDD generation for feature parity with Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10525) Add Python example for VectorSlicer to user guide
[ https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-10525. --- Resolution: Fixed Fix Version/s: 2.0.0 Fixed in SPARK-14514 by [~podongfeng] > Add Python example for VectorSlicer to user guide > - > > Key: SPARK-10525 > URL: https://issues.apache.org/jira/browse/SPARK-10525 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10525) Add Python example for VectorSlicer to user guide
[ https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579906#comment-15579906 ] holdenk commented on SPARK-10525: - It looks like it does, I'm going to go ahead and resolve this. Thanks [~amicon] :) > Add Python example for VectorSlicer to user guide > - > > Key: SPARK-10525 > URL: https://issues.apache.org/jira/browse/SPARK-10525 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10319) ALS training using PySpark throws a StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579901#comment-15579901 ] holdenk commented on SPARK-10319: - Is this issue still occurring for you? > ALS training using PySpark throws a StackOverflowError > -- > > Key: SPARK-10319 > URL: https://issues.apache.org/jira/browse/SPARK-10319 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.1 > Environment: Windows 10, spark - 1.4.1, >Reporter: Velu nambi > > When attempting to train a machine learning model using ALS in Spark's MLLib > (1.4) on windows, Pyspark always terminates with a StackoverflowError. I > tried adding the checkpoint as described in > http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help. > Here's the training code and stack trace: > {code:none} > ranks = [8, 12] > lambdas = [0.1, 10.0] > numIters = [10, 20] > bestModel = None > bestValidationRmse = float("inf") > bestRank = 0 > bestLambda = -1.0 > bestNumIter = -1 > for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters): > ALS.checkpointInterval = 2 > model = ALS.train(training, rank, numIter, lmbda) > validationRmse = computeRmse(model, validation, numValidation) > if (validationRmse < bestValidationRmse): > bestModel = model > bestValidationRmse = validationRmse > bestRank = rank > bestLambda = lmbda > bestNumIter = numIter > testRmse = computeRmse(bestModel, test, numTest) > {code} > Stacktrace: > 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID > 127) > java.lang.StackOverflowError > at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source) > at java.io.ObjectInputStream.readHandle(Unknown Source) > at java.io.ObjectInputStream.readClassDesc(Unknown Source) > at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) > at java.io.ObjectInputStream.readObject0(Unknown Source) > at java.io.ObjectInputStream.defaultReadFields(Unknown Source) > at java.io.ObjectInputStream.readSerialData(Unknown Source) > at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) > at java.io.ObjectInputStream.readObject0(Unknown Source) > at java.io.ObjectInputStream.defaultReadFields(Unknown Source) > at java.io.ObjectInputStream.readSerialData(Unknown Source) > at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) > at java.io.ObjectInputStream.readObject0(Unknown Source) > at java.io.ObjectInputStream.defaultReadFields(Unknown Source) > at java.io.ObjectInputStream.readSerialData(Unknown Source) > at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source) > at java.io.ObjectInputStream.readObject0(Unknown Source) > at java.io.ObjectInputStream.readObject(Unknown Source) > at scala.collection.immutable.$colon$colon.readObject(List.scala:362) > at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at java.io.ObjectStreamClass.invokeReadObject(Unknown Source) > at java.io.ObjectInputStream.readSerialData(Unknown Source) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17960) Upgrade to Py4J 0.10.4
[ https://issues.apache.org/jira/browse/SPARK-17960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579903#comment-15579903 ] Sean Owen commented on SPARK-17960: --- It seems OK. If the changes aren't essential, maybe we don't have to update to every single maintenance release, but, getting a fix in and at least periodically updating this (and other components) seems like a win. I had thought to review the updates available from key dependencies before 2.1.0 and perform a few other updates like this too. > Upgrade to Py4J 0.10.4 > -- > > Key: SPARK-17960 > URL: https://issues.apache.org/jira/browse/SPARK-17960 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: holdenk >Priority: Trivial > Labels: starter > > In general we should try and keep up to date with Py4J's new releases. The > changes in this one are small ( > https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact > Spark in any significant way so I'm going to tag this as a starter issue for > someone looking to get a deeper understanding of how PySpark works. > Upgrading Py4J can be a bit tricky compared to updating other packages in > general the steps are: > 1) Upgrade the Py4J version on the Java side > 2) Update the py4j src zip file we bundle with Spark > 3) Make sure everything still works (especially the streaming tests because > we do weird things to make streaming work and its the most likely place to > break during a Py4J upgrade). > You can see how these bits have been done in past releases by looking in the > git log for the last time we changed the Py4J version numbers. Sometimes even > for "compatible" releases like this one we may need to make some small code > changes in side of PySpark because we hook into Py4Js internals, but I don't > think this should be the case here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10223) Add takeOrderedByKey function to extract top N records within each group
[ https://issues.apache.org/jira/browse/SPARK-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-10223. --- Resolution: Won't Fix I don't see this feature being particularly popular, especially since its relatively easy to implement outside of Spark its self. If you disagree feel free to re-open this. > Add takeOrderedByKey function to extract top N records within each group > > > Key: SPARK-10223 > URL: https://issues.apache.org/jira/browse/SPARK-10223 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Ritesh Agrawal >Priority: Minor > > Currently PySpark has takeOrdered function that returns top N records. > However often you want to extract top N records within each group. This can > be easily implemented using combineByKey operation and using fixed size heap > to capture top N within each group. A working solution can be found over > [here](https://ragrawal.wordpress.com/2015/08/25/pyspark-top-n-records-in-each-group/) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-9965) Scala, Python SQLContext input methods' deprecation statuses do not match
[ https://issues.apache.org/jira/browse/SPARK-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk closed SPARK-9965. -- Resolution: Resolved Fix Version/s: 2.0.0 These methods were removed in 77ab49b8575d2ebd678065fa70b0343d532ab9c2 by [~rxin]. > Scala, Python SQLContext input methods' deprecation statuses do not match > - > > Key: SPARK-9965 > URL: https://issues.apache.org/jira/browse/SPARK-9965 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 2.0.0 > > > Scala's SQLContext has several methods for data input (jsonFile, jsonRDD, > etc.) deprecated. These methods are not deprecated in Python's SQLContext. > They should be, but only after Python's DataFrameReader implements analogous > methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9931) Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579889#comment-15579889 ] holdenk commented on SPARK-9931: Is this still a test people are finding flakey or did [~josephkb]'s fix work? > Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. > test_training_and_prediction > > > Key: SPARK-9931 > URL: https://issues.apache.org/jira/browse/SPARK-9931 > Project: Spark > Issue Type: Bug > Components: PySpark, Tests >Reporter: Davies Liu >Priority: Critical > > {code} > FAIL: test_training_and_prediction > (__main__.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1250, in test_training_and_prediction > self.assertTrue(errors[1] - errors[-1] > 0.3) > AssertionError: False is not true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7653) ML Pipeline and meta-algs should take random seed param
[ https://issues.apache.org/jira/browse/SPARK-7653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579884#comment-15579884 ] holdenk commented on SPARK-7653: I think the simplest workaround would be exposing HasSeed to the public. What do you think? > ML Pipeline and meta-algs should take random seed param > --- > > Key: SPARK-7653 > URL: https://issues.apache.org/jira/browse/SPARK-7653 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > ML Pipelines and other meta-algorithms should implement HasSeed. If the seed > is set, then the meta-alg will use that seed to generate new seeds to pass to > every component PipelineStage. > * Note: This may require a little discussion about whether HasSeed should be > a public API. > This will make it easier for users to have reproducible results for entire > pipelines (rather than setting the seed for each stage manually). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7941) Cache Cleanup Failure when job is killed by Spark
[ https://issues.apache.org/jira/browse/SPARK-7941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579878#comment-15579878 ] holdenk commented on SPARK-7941: So if its ok - since I don't see other reports of this - unless this is an issue someone (including [~cqnguyen]) is still experience I'll go ahead and soft-close this at the end of next weak. > Cache Cleanup Failure when job is killed by Spark > -- > > Key: SPARK-7941 > URL: https://issues.apache.org/jira/browse/SPARK-7941 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 1.3.1 >Reporter: Cory Nguyen > Attachments: screenshot-1.png > > > Problem/Bug: > If a job is running and Spark kills the job intentionally, the cache files > remains on the local/worker nodes and are not cleaned up properly. Over time > the old cache builds up and causes "No Space Left on Device" error. > The cache is cleaned up properly when the job succeeds. I have not verified > if the cached remains when the user intentionally kills the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579874#comment-15579874 ] holdenk commented on SPARK-7721: [~joshrosen]is this something your still looking at/interested in or would you have the review bandwidth for this be a good place for someone else to step up and help out? > Generate test coverage report from Python > - > > Key: SPARK-7721 > URL: https://issues.apache.org/jira/browse/SPARK-7721 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Reporter: Reynold Xin > > Would be great to have test coverage report for Python. Compared with Scala, > it is tricker to understand the coverage without coverage reports in Python > because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3981) Consider a better approach to initialize SerDe on executors
[ https://issues.apache.org/jira/browse/SPARK-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579868#comment-15579868 ] holdenk commented on SPARK-3981: That's a good question. It seems like much of the code that was copied in SPARK-3971 is still around, but given our slow migration from MLlib to ML I'm thinking that we should maybe consider marking this as a "Won't Fix". What are your thoughts [~mengxr]? > Consider a better approach to initialize SerDe on executors > --- > > Key: SPARK-3981 > URL: https://issues.apache.org/jira/browse/SPARK-3981 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng > > In SPARK-3971, we copied SerDe code from Core to MLlib in order to recognize > MLlib types on executors as a hotfix. This is not ideal. We should find a way > to add hooks to the SerDe in Core to support MLlib types in a pluggable way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2868) Support named accumulators in Python
[ https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579866#comment-15579866 ] holdenk commented on SPARK-2868: ping [~davies] - would you be available to review if I got this switched around? > Support named accumulators in Python > > > Key: SPARK-2868 > URL: https://issues.apache.org/jira/browse/SPARK-2868 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Patrick Wendell > > SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to > make some additional changes. One potential path is to have a 1:1 > correspondence with Scala accumulators (instead of a one-to-many). A > challenge is exposing the stringified values of the accumulators to the Scala > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579847#comment-15579847 ] holdenk commented on SPARK-9487: Great, thanks for taking this issue on :) > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17960) Upgrade to Py4J 0.10.4
holdenk created SPARK-17960: --- Summary: Upgrade to Py4J 0.10.4 Key: SPARK-17960 URL: https://issues.apache.org/jira/browse/SPARK-17960 Project: Spark Issue Type: Improvement Components: PySpark Reporter: holdenk Priority: Trivial In general we should try and keep up to date with Py4J's new releases. The changes in this one are small ( https://github.com/bartdag/py4j/milestone/21?closed=1 ) and shouldn't impact Spark in any significant way so I'm going to tag this as a starter issue for someone looking to get a deeper understanding of how PySpark works. Upgrading Py4J can be a bit tricky compared to updating other packages in general the steps are: 1) Upgrade the Py4J version on the Java side 2) Update the py4j src zip file we bundle with Spark 3) Make sure everything still works (especially the streaming tests because we do weird things to make streaming work and its the most likely place to break during a Py4J upgrade). You can see how these bits have been done in past releases by looking in the git log for the last time we changed the Py4J version numbers. Sometimes even for "compatible" releases like this one we may need to make some small code changes in side of PySpark because we hook into Py4Js internals, but I don't think this should be the case here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join due to calculated size of LogicalRdd
[ https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-17959: Summary: spark.sql.join.preferSortMergeJoin has no effect for simple join due to calculated size of LogicalRdd (was: spark.sql.join.preferSortMergeJoin has no effect for simple join) > spark.sql.join.preferSortMergeJoin has no effect for simple join due to > calculated size of LogicalRdd > - > > Key: SPARK-17959 > URL: https://issues.apache.org/jira/browse/SPARK-17959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stavros Kontopoulos > > Example code: > val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, > "dss@s2"), > ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", > "Email") > df.as("a").join(df.as("b")) > .where($"a.Group" === $"b.Group") > .explain() > I always get the SortMerge strategy (never shuffle hash join) even if i set > spark.sql.join.preferSortMergeJoin to false since: > sinzeInBytes = 2^63-1 > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 > and thus: > condition here: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 > is always false... > I think this shouldnt be the case my df has a specifc size and number of > partitions (200 which is btw far from optimal)... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join
[ https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-17959: Description: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... I think this shouldnt be the case my df has a specifc size and number of partitions (200 which is btw far from optimal)... was: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... I think this shouldnt be the case... > spark.sql.join.preferSortMergeJoin has no effect for simple join > > > Key: SPARK-17959 > URL: https://issues.apache.org/jira/browse/SPARK-17959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stavros Kontopoulos > > Example code: > val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, > "dss@s2"), > ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", > "Email") > df.as("a").join(df.as("b")) > .where($"a.Group" === $"b.Group") > .explain() > I always get the SortMerge strategy even if i set > spark.sql.join.preferSortMergeJoin to false since: > sinzeInBytes = 2^63-1 > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 > and thus: > condition here: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 > is always false... > I think this shouldnt be the case my df has a specifc size and number of > partitions (200 which is btw far from optimal)... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join
[ https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-17959: Description: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy (never shuffle hash join) even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... I think this shouldnt be the case my df has a specifc size and number of partitions (200 which is btw far from optimal)... was: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... I think this shouldnt be the case my df has a specifc size and number of partitions (200 which is btw far from optimal)... > spark.sql.join.preferSortMergeJoin has no effect for simple join > > > Key: SPARK-17959 > URL: https://issues.apache.org/jira/browse/SPARK-17959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stavros Kontopoulos > > Example code: > val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, > "dss@s2"), > ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", > "Email") > df.as("a").join(df.as("b")) > .where($"a.Group" === $"b.Group") > .explain() > I always get the SortMerge strategy (never shuffle hash join) even if i set > spark.sql.join.preferSortMergeJoin to false since: > sinzeInBytes = 2^63-1 > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 > and thus: > condition here: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 > is always false... > I think this shouldnt be the case my df has a specifc size and number of > partitions (200 which is btw far from optimal)... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join
[ https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-17959: Description: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... I think this shouldnt be the case... was: Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... > spark.sql.join.preferSortMergeJoin has no effect for simple join > > > Key: SPARK-17959 > URL: https://issues.apache.org/jira/browse/SPARK-17959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Stavros Kontopoulos > > Example code: > val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, > "dss@s2"), > ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", > "Email") > df.as("a").join(df.as("b")) > .where($"a.Group" === $"b.Group") > .explain() > I always get the SortMerge strategy even if i set > spark.sql.join.preferSortMergeJoin to false since: > sinzeInBytes = 2^63-1 > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 > and thus: > condition here: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 > is always false... > I think this shouldnt be the case... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join
Stavros Kontopoulos created SPARK-17959: --- Summary: spark.sql.join.preferSortMergeJoin has no effect for simple join Key: SPARK-17959 URL: https://issues.apache.org/jira/browse/SPARK-17959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Stavros Kontopoulos Example code: val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, "dss@s2"), ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", "Email") df.as("a").join(df.as("b")) .where($"a.Group" === $"b.Group") .explain() I always get the SortMerge strategy even if i set spark.sql.join.preferSortMergeJoin to false since: sinzeInBytes = 2^63-1 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101 and thus: condition here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127 is always false... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579737#comment-15579737 ] Olivier Armand commented on SPARK-650: -- Data doesn't arrives necessarily immediately, but we need to ensure that when it arrives, lazy initialization doesn't introduce latency. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579720#comment-15579720 ] Sean Owen commented on SPARK-650: - It would work in this case to immediately schedule initialization on the executors because it sounds like data arrives immediately in your case. The part I am missing is how it can occur faster than this with another mechanism. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579710#comment-15579710 ] Olivier Armand commented on SPARK-650: -- > "just run a dummy mapPartitions at the outset on the same data that the first > job would touch" But this wouldn't work for Spark Streaming? (our case). > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17958) Why I ran into issue " accumulator, copyandreset must be zero error"
[ https://issues.apache.org/jira/browse/SPARK-17958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17958. --- Resolution: Invalid Pleas read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Questions go to user@, not JIRA. You would also never set Blocker. > Why I ran into issue " accumulator, copyandreset must be zero error" > > > Key: SPARK-17958 > URL: https://issues.apache.org/jira/browse/SPARK-17958 > Project: Spark > Issue Type: Question > Components: Java API >Affects Versions: 2.0.1 > Environment: linux. >Reporter: dianwei Han >Priority: Blocker > Original Estimate: 24h > Remaining Estimate: 24h > > I used to run spark code under spark 1.5. No problem at all. right now, I ran > code on spark 2.0. see error > "Exception in thread "main" java.lang.AssertionError: assertion failed: > copyAndReset must return a zero value copy > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.util.AccumulatorV2.writeReplace(AccumulatorV2.scala:157) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > Please help or give suggestion on how to fix the issue? > Really appreciate it, > Dianwei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579630#comment-15579630 ] Sean Owen commented on SPARK-650: - Reopening doesn't do anything by itself, or cause anyone to consider this. If this just sits for another year, it will have been a tiny part of a larger problem. I would ask those asking to keep this open to advance the discussion, or else I think you'd agree it eventually should be closed. (Here, I'm really speaking about hundreds of issues like this here, not so much this one.) Part of the problem is that I don't think the details of this feature request were ever elaborated. I think that if you dig into what it would mean, you'd find that a) it's kind of tricky to define and then implement all the right semantics, and b) almost any use case along these lines in my experience is resolved as I suggest, with a simple per-JVM initialization. If the response lately here is, well, we're not quite sure how that works, then we need to get to the bottom of that, not just insisting an issue stay open. To your points: - The executor is going to load user code into one classloader, so we do have that an executor = JVM = classloader. - You can fail things as fast as you like by invoking this init as soon as like in your app. - It's clear where things execute, or else, we must assume app developers understand this or else all bets are off. The driver program executes things in the driver unless they're part of a distributed map() etc operation, which clearly execute on the executor. These IMHO aren't reasons to design a new, different, bespoke mechanism. That has a cost too, if you're positing that it's hard to understand when things run where. The one catch I see is that, by design, we don't control which tasks run on what executors. We can't guarantee init code runs on all executors this way. But, is it meaningful to initialize an executor that never sees an app's tasks? it can't be. Lazy init is a good thing and compatible with the Spark model. If startup time is an issue (and I'm still not clear on the latency problem mentioned above), then it gets a little more complicated, but, that's also a little more niche: just run a dummy mapPartitions at the outset on the same data that the first job would touch, even asynchronously if you like with other driver activities. No need to wait; it just gives the init a head-start on the executors that will need it straight away. That's just my opinion of course, but I think those are the questions that would need to be answered to argue something happens here. > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579590#comment-15579590 ] Apache Spark commented on SPARK-17931: -- User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/15505 > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17931: Assignee: (was: Apache Spark) > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17931: Assignee: Apache Spark > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Guoqiang Li >Assignee: Apache Spark > > When taskScheduler instantiates TaskDescription, it calls > `Task.serializeWithDependencies(task, sched.sc.addedFiles, > sched.sc.addedJars, ser)`. It serializes task and its dependency. > But after SPARK-2521 has been merged into the master, the ResultTask class > and ShuffleMapTask class no longer contain rdd and closure objects. > TaskDescription class can be changed as below: > {noformat} > class TaskDescription[T]( > val taskId: Long, > val attemptNumber: Int, > val executorId: String, > val name: String, > val index: Int, > val task: Task[T]) extends Serializable > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org