[jira] [Resolved] (SPARK-20035) Spark 2.0.2 writes empty file if no record is in the dataset
[ https://issues.apache.org/jira/browse/SPARK-20035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-20035. -- Resolution: Duplicate I am pretty sure that this is a duplicate of SPARK-15473. Let me resolve this as a duplicate. Please reopen this if anyone feels the JIRAs are separate and the fixes will be separate too. > Spark 2.0.2 writes empty file if no record is in the dataset > > > Key: SPARK-20035 > URL: https://issues.apache.org/jira/browse/SPARK-20035 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Linux/Windows >Reporter: Andrew > > When there is no record in a dataset, the call to write with the spark-csv > creates empty file (i.e. with no title line) > ``` > dataset.write().format("com.databricks.spark.csv").option("header", > "true").save("... file name here ..."); > or > dataset.write().option("header", "true").csv("... file name here ..."); > ``` > The same file then cannot be read by using the same format (i.e. spark-csv) > since it is empty as below. The same call works if the dataset has at least > one record. > ``` > sqlCtx.read().format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load("... file name here ..."); > or > sparkSession.read().option("header", "true").option("inferSchema", > "true").csv("... file name here ..."); > ``` > This is not right, you should always be able to read the file that you wrote > to. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15473) CSV fails to write and read back empty dataframe
[ https://issues.apache.org/jira/browse/SPARK-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942609#comment-15942609 ] Hyukjin Kwon commented on SPARK-15473: -- I am not working on this. Please take over this if anyone is willing to. > CSV fails to write and read back empty dataframe > > > Key: SPARK-15473 > URL: https://issues.apache.org/jira/browse/SPARK-15473 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > Currently CSV data source fails to write and read empty data. > The code below: > {code} > val emptyDf = spark.range(10).filter(_ => false) > emptyDf.write > .format("csv") > .save(path.getCanonicalPath) > val copyEmptyDf = spark.read > .format("csv") > .load(path.getCanonicalPath) > copyEmptyDf.show() > {code} > throws an exception below: > {code} > Can not create a Path from an empty string > java.lang.IllegalArgumentException: Can not create a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) > at org.apache.hadoop.fs.Path.(Path.java:135) > at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178) > at scala.Option.map(Option.scala:146) > {code} > Note that this is a different case with the data below > {code} > val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema) > {code} > In this case, any writer is not initialised and created. (no calls of > {{WriterContainer.writeRows()}}. > Maybe, it should be able to read/write header for schemas as well as empty > data. > For Parquet and JSON, it works but CSV does not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20105) Add tests for checkType and type string in structField in R
[ https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20105: Assignee: Apache Spark > Add tests for checkType and type string in structField in R > --- > > Key: SPARK-20105 > URL: https://issues.apache.org/jira/browse/SPARK-20105 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > It seems {{checkType}} and the type string in {{structField}} are not being > tested closely. > This string format currently seems R-specific. Therefore, it seems nicer if > we test positive/negative cases in R side. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20105) Add tests for checkType and type string in structField in R
[ https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20105: Assignee: (was: Apache Spark) > Add tests for checkType and type string in structField in R > --- > > Key: SPARK-20105 > URL: https://issues.apache.org/jira/browse/SPARK-20105 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{checkType}} and the type string in {{structField}} are not being > tested closely. > This string format currently seems R-specific. Therefore, it seems nicer if > we test positive/negative cases in R side. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20105) Add tests for checkType and type string in structField in R
[ https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942587#comment-15942587 ] Apache Spark commented on SPARK-20105: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17439 > Add tests for checkType and type string in structField in R > --- > > Key: SPARK-20105 > URL: https://issues.apache.org/jira/browse/SPARK-20105 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It seems {{checkType}} and the type string in {{structField}} are not being > tested closely. > This string format currently seems R-specific. Therefore, it seems nicer if > we test positive/negative cases in R side. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20104: Assignee: (was: Apache Spark) > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942583#comment-15942583 ] Apache Spark commented on SPARK-20104: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/17438 > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
[ https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20104: Assignee: Apache Spark > Don't estimate IsNull or IsNotNull predicates for non-leaf node > --- > > Key: SPARK-20104 > URL: https://issues.apache.org/jira/browse/SPARK-20104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > In current stage, we don't have advanced statistics such as sketches or > histograms. As a result, some operator can't estimate `nullCount` accurately. > E.g. left outer join estimation does not accurately update `nullCount` > currently. So for IsNull and IsNotNull predicates, we only estimate them when > the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20105) Add tests for checkType and type string in structField in R
Hyukjin Kwon created SPARK-20105: Summary: Add tests for checkType and type string in structField in R Key: SPARK-20105 URL: https://issues.apache.org/jira/browse/SPARK-20105 Project: Spark Issue Type: Test Components: SparkR Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Minor It seems {{checkType}} and the type string in {{structField}} are not being tested closely. This string format currently seems R-specific. Therefore, it seems nicer if we test positive/negative cases in R side. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node
Zhenhua Wang created SPARK-20104: Summary: Don't estimate IsNull or IsNotNull predicates for non-leaf node Key: SPARK-20104 URL: https://issues.apache.org/jira/browse/SPARK-20104 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Zhenhua Wang In current stage, we don't have advanced statistics such as sketches or histograms. As a result, some operator can't estimate `nullCount` accurately. E.g. left outer join estimation does not accurately update `nullCount` currently. So for IsNull and IsNotNull predicates, we only estimate them when the child is a leaf node, whose `nullCount` is accurate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint
Rajesh Mutha created SPARK-20103: Summary: Spark structured steaming from kafka - last message processed again after resume from checkpoint Key: SPARK-20103 URL: https://issues.apache.org/jira/browse/SPARK-20103 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Environment: Linux, Spark 2.10 Reporter: Rajesh Mutha When the application starts after a failure or a graceful shutdown, it is consistently processing the last message of the previous batch even though it was already processed correctly without failure. We are making sure database writes are idempotent using postgres 9.6 feature. Is this the default behavior of spark? I added a code snippet with 2 streaming queries. One of the query is idempotent; since query2 is not idempotent, we are seeing duplicate entries in table. --- object StructuredStreaming { def main(args: Array[String]): Unit = { val db_url = "jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password" val spark = SparkSession .builder .appName("StructuredKafkaReader") .master("local[*]") .getOrCreate() spark.conf.set("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint_research/") import spark.implicits._ val server = "10.205.82.113:9092" val topic = "checkpoint" val subscribeType="subscribe" val lines = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", server) .option(subscribeType, topic) .load().selectExpr("CAST(value AS STRING)").as[String] lines.printSchema() import org.apache.spark.sql.ForeachWriter val writer = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query1 = lines.writeStream .outputMode("append") .queryName("checkpoint1") .trigger(ProcessingTime(30.seconds)) .foreach(writer) .start() val writer2 = new ForeachWriter[String] { def open(partitionId: Long, version: Long): Boolean = { println("After db props"); true } def process(value: String) = { val conn = DriverManager.getConnection(db_url) try{ conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 VALUES ('"+value+"')") } finally { conn.close() } } def close(errorOrNull: Throwable) = {} } import scala.concurrent.duration._ val query2 = lines.writeStream .outputMode("append") .queryName("checkpoint2") .trigger(ProcessingTime(30.seconds)) .foreach(writer2) .start() query2.awaitTermination() query1.awaitTermination() }} --- -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey
[ https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942498#comment-15942498 ] Joseph K. Bradley commented on SPARK-6567: -- Linking [SPARK-10078], which tracks adding vector-free L-BFGS. > Large linear model parallelism via a join and reduceByKey > - > > Key: SPARK-6567 > URL: https://issues.apache.org/jira/browse/SPARK-6567 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Reza Zadeh > Attachments: model-parallelism.pptx > > > To train a linear model, each training point in the training set needs its > dot product computed against the model, per iteration. If the model is large > (too large to fit in memory on a single machine) then SPARK-4590 proposes > using parameter server. > There is an easier way to achieve this without parameter servers. In > particular, if the data is held as a BlockMatrix and the model as an RDD, > then each block can be joined with the relevant part of the model, followed > by a reduceByKey to compute the dot products. > This obviates the need for a parameter server, at least for linear models. > However, it's unclear how it compares performance-wise to parameter servers. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19281. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17218 [https://github.com/apache/spark/pull/17218] > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-19281: - Assignee: Maciej Szymkiewicz > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley >Assignee: Maciej Szymkiewicz > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19281) spark.ml Python API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19281: -- Shepherd: Joseph K. Bradley (was: Nick Pentreath) > spark.ml Python API for FPGrowth > > > Key: SPARK-19281 > URL: https://issues.apache.org/jira/browse/SPARK-19281 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > See parent issue. This is for a Python API *after* the Scala API has been > designed and implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20072) Clarify ALS-WR documentation
[ https://issues.apache.org/jira/browse/SPARK-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942473#comment-15942473 ] chris snow commented on SPARK-20072: Will do. Thanks Sean. > Clarify ALS-WR documentation > > > Key: SPARK-20072 > URL: https://issues.apache.org/jira/browse/SPARK-20072 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.0 >Reporter: chris snow >Priority: Trivial > > https://www.mail-archive.com/user@spark.apache.org/msg62590.html > The documentation for collaborative filtering is as follows: > === > Scaling of the regularization parameter > Since v1.1, we scale the regularization parameter lambda in solving > each least squares problem by the number of ratings the user generated > in updating user factors, or the number of ratings the product > received in updating product factors. > === > I find this description confusing, probably because I lack a detailed > understanding of ALS. The wording suggest that the number of ratings > change ("generated", "received") during solving the least squares. > This is how I think I should be interpreting the description: > === > Since v1.1, we scale the regularization parameter lambda when solving > each least squares problem. When updating the user factors, we scale > the regularization parameter by the total number of ratings from the > user. Similarly, when updating the product factors, we scale the > regularization parameter by the total number of ratings for the > product. > === -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
[ https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942454#comment-15942454 ] Apache Spark commented on SPARK-20102: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/17437 > Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds > > > Key: SPARK-20102 > URL: https://issues.apache.org/jira/browse/SPARK-20102 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.1 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The master snapshot publisher builds are currently broken due to two minor > build issues: > 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors > when the remote FTP directory already exists. To work around this, we should > update the script to ignore errors. > 2. The PySpark setup.py file references a non-existent module, causing Python > packaging to fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
[ https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20102: Assignee: Apache Spark (was: Josh Rosen) > Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds > > > Key: SPARK-20102 > URL: https://issues.apache.org/jira/browse/SPARK-20102 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.1 >Reporter: Josh Rosen >Assignee: Apache Spark > > The master snapshot publisher builds are currently broken due to two minor > build issues: > 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors > when the remote FTP directory already exists. To work around this, we should > update the script to ignore errors. > 2. The PySpark setup.py file references a non-existent module, causing Python > packaging to fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
[ https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20102: Assignee: Josh Rosen (was: Apache Spark) > Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds > > > Key: SPARK-20102 > URL: https://issues.apache.org/jira/browse/SPARK-20102 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.1 >Reporter: Josh Rosen >Assignee: Josh Rosen > > The master snapshot publisher builds are currently broken due to two minor > build issues: > 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors > when the remote FTP directory already exists. To work around this, we should > update the script to ignore errors. > 2. The PySpark setup.py file references a non-existent module, causing Python > packaging to fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
Josh Rosen created SPARK-20102: -- Summary: Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds Key: SPARK-20102 URL: https://issues.apache.org/jira/browse/SPARK-20102 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.1.1 Reporter: Josh Rosen Assignee: Josh Rosen The master snapshot publisher builds are currently broken due to two minor build issues: 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors when the remote FTP directory already exists. To work around this, we should update the script to ignore errors. 2. The PySpark setup.py file references a non-existent module, causing Python packaging to fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-20086: - Assignee: Herman van Hovell > issue with pyspark 2.1.0 window function > > > Key: SPARK-20086 > URL: https://issues.apache.org/jira/browse/SPARK-20086 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: mandar uapdhye >Assignee: Herman van Hovell > Fix For: 2.1.1, 2.2.0 > > > original post at > [stackoverflow | > http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] > I get error when working with pyspark window function. here is some example > code: > {code:title=borderStyle=solid} > import pyspark > import pyspark.sql.functions as sf > import pyspark.sql.types as sparktypes > from pyspark.sql import window > > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) > df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) > df.show() > {code} > gives: > | x|AmtPaid| > | 1|2.0| > | 1|3.0| > | 1|1.0| > | 1| -2.0| > | 1| -1.0| > next, compute cumulative sum > {code:title=test.py|borderStyle=solid} > win_spec_max = (window.Window > .partitionBy(['x']) > .rowsBetween(window.Window.unboundedPreceding, 0))) > df = df.withColumn('AmtPaidCumSum', >sf.sum(sf.col('AmtPaid')).over(win_spec_max)) > df.show() > {code} > gives, > | x|AmtPaid|AmtPaidCumSum| > | 1|2.0| 2.0| > | 1|3.0| 5.0| > | 1|1.0| 6.0| > | 1| -2.0| 4.0| > | 1| -1.0| 3.0| > next, compute cumulative max, > {code} > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > gives error log > {noformat} > Py4JJavaError: An error occurred while calling o2609.showString. > with traceback: > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 df.show() > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in > show(self, n, truncate) > 316 """ > 317 if isinstance(truncate, bool) and truncate: > --> 318 print(self._jdf.showString(n, 20)) > 319 else: > 320 print(self._jdf.showString(n, int(truncate))) > > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, > self.name) >1134 >1135 for temp_arg in temp_args: > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling > {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > {noformat} > but interestingly enough, if i introduce another change before sencond window > operation, say inserting a column then it does not give that error: > {code} > df = df.withColumn('MaxBound', sf.lit(6.)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound| > | 1|2.0| 2.0| 6.0| > | 1|3.0| 5.0| 6.0| > | 1|1.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| > | 1| -1.0| 3.0| 6.0| > {code} > #then apply the second window operations > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| > | 1|2.0| 2.0| 6.0| 2.0| > | 1|3.0| 5.0| 6.0| 5.0| > | 1|1.0| 6.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| 6.0| > | 1| -1.0| 3.0| 6.0| 6.0| > I do not understand this behaviour > well, so far so
[jira] [Resolved] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20086. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > issue with pyspark 2.1.0 window function > > > Key: SPARK-20086 > URL: https://issues.apache.org/jira/browse/SPARK-20086 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: mandar uapdhye >Assignee: Herman van Hovell > Fix For: 2.1.1, 2.2.0 > > > original post at > [stackoverflow | > http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] > I get error when working with pyspark window function. here is some example > code: > {code:title=borderStyle=solid} > import pyspark > import pyspark.sql.functions as sf > import pyspark.sql.types as sparktypes > from pyspark.sql import window > > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) > df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) > df.show() > {code} > gives: > | x|AmtPaid| > | 1|2.0| > | 1|3.0| > | 1|1.0| > | 1| -2.0| > | 1| -1.0| > next, compute cumulative sum > {code:title=test.py|borderStyle=solid} > win_spec_max = (window.Window > .partitionBy(['x']) > .rowsBetween(window.Window.unboundedPreceding, 0))) > df = df.withColumn('AmtPaidCumSum', >sf.sum(sf.col('AmtPaid')).over(win_spec_max)) > df.show() > {code} > gives, > | x|AmtPaid|AmtPaidCumSum| > | 1|2.0| 2.0| > | 1|3.0| 5.0| > | 1|1.0| 6.0| > | 1| -2.0| 4.0| > | 1| -1.0| 3.0| > next, compute cumulative max, > {code} > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > gives error log > {noformat} > Py4JJavaError: An error occurred while calling o2609.showString. > with traceback: > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 df.show() > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in > show(self, n, truncate) > 316 """ > 317 if isinstance(truncate, bool) and truncate: > --> 318 print(self._jdf.showString(n, 20)) > 319 else: > 320 print(self._jdf.showString(n, int(truncate))) > > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, > self.name) >1134 >1135 for temp_arg in temp_args: > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling > {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > {noformat} > but interestingly enough, if i introduce another change before sencond window > operation, say inserting a column then it does not give that error: > {code} > df = df.withColumn('MaxBound', sf.lit(6.)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound| > | 1|2.0| 2.0| 6.0| > | 1|3.0| 5.0| 6.0| > | 1|1.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| > | 1| -1.0| 3.0| 6.0| > {code} > #then apply the second window operations > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| > | 1|2.0| 2.0| 6.0| 2.0| > | 1|3.0| 5.0| 6.0| 5.0| > | 1|1.0| 6.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| 6.0| > | 1| -1.0| 3.0| 6.0| 6.0| > I do not
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942431#comment-15942431 ] Gal Topper commented on SPARK-19476: The example in the description does indeed fully materialize the iterator to very simply reproduce the issue. To be clear, the real code I'm running doesn't do that :-)! Instead, it pulls items from the iterator on demand. The workaround I described basically makes sure that only the original executor thread ever calls next() on the iterator, which is still done on demand, not all-at-once. In my own experience, using threads works perfectly fine with the exception of this issue, and I've never read anything in the docs to discourage users from doing so. +1 for a note though, if that's really not something the authors intended. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper >Priority: Minor > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19476: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) You certainly access the iterator from Spark right? that's what I mean. You might just copy that off into a data structure with no relation to the iterator implementation. That of course may not be feasible. I think this is true over the whole API, that you're not intended to create your own threads. It could well work in many cases but don't think it's guaranteed to. Sure, maybe worth a note. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper >Priority: Minor > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942419#comment-15942419 ] Gal Topper commented on SPARK-19476: I'm pretty sure we're talking about different things. My code running inside foreachPartition naturally doesn't need any Spark internals. It just takes the data and writes it to a database, and that writing process happens not to be single-threaded. It doesn't copy any data, and it works pretty well using the (far from trivial) workaround described and provided above. If this thread local is too entrenched to feasibly fix the issue, I would at least suggest that this limitation be documented (e.g. "@param iterator may only be accessed by the original executor thread", and/or otherwise in the docs). That's what I'd do, anyway. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942413#comment-15942413 ] Dongjoon Hyun edited comment on SPARK-14922 at 3/26/17 7:16 PM: [~smilegator], [~hyukjin.kwon]. I will reopen this issue instead of SPARK-17732. This issue includes SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. SPARK-17732 tried to add operators like '<', but eventually it turns out that we also need to support *expressions for values* in predicates. was (Author: dongjoon): [~smilegator], [~hyukjin.kwon]. I will reopen this issue instead of SPARK-17732. This issue includes SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. SPARK-17732 tried to add operators like '<', but eventually it turns out that we need to support *expressions* in predicates. > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators
[ https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-17732. - Resolution: Duplicate > ALTER TABLE DROP PARTITION should support comparators > - > > Key: SPARK-17732 > URL: https://issues.apache.org/jira/browse/SPARK-17732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > > This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in > Apache Spark 2.0 for backward compatibility. > *Spark 1.6.2* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [result: string] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > res1: org.apache.spark.sql.DataFrame = [result: string] > {code} > *Spark 2.0* > {code} > scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, > quarter STRING)") > res0: org.apache.spark.sql.DataFrame = [] > scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')") > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '<' expecting {')', ','}(line 1, pos 42) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-14922: --- [~smilegator], [~hyukjin.kwon]. I will reopen this issue instead of SPARK-17732. This issue includes SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. SPARK-17732 tried to add operators like '<', but eventually it turns out that we need to support *expressions* in predicates. > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942405#comment-15942405 ] Sean Owen commented on SPARK-19476: --- You don't need an executor per I/O call, you need one slot. I don't believe what you're doing is generally going to work; you're building around Spark's execution model. You will probably need to rewrite to restrict the async processing to something that doesn't need to access these Spark objects somehow, or do something like copy the data off the iterator if that's feasible, to avoid this access. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19941. --- Resolution: Won't Fix > Spark should not schedule tasks on executors on decommissioning YARN nodes > -- > > Key: SPARK-19941 > URL: https://issues.apache.org/jira/browse/SPARK-19941 > Project: Spark > Issue Type: Improvement > Components: Scheduler, YARN >Affects Versions: 2.1.0 > Environment: Hadoop 2.8.0-rc1 >Reporter: Karthik Palaniappan > > Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in > YARN: https://issues.apache.org/jira/browse/YARN-914 > Essentially you can mark nodes to be decommissioned, and let them a) finish > work in progress and b) finish serving shuffle data. But no new work will be > scheduled on the node. > Spark should respect when NMs are set to decommissioned, and similarly > decommission executors on those nodes by not scheduling any more tasks on > them. > It looks like in the future YARN may inform the app master when containers > will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I > don't think Spark should schedule based on a timeout. We should gracefully > decommission the executor as fast as possible (which is the spirit of > YARN-914). The app master can query the RM for NM statuses (if it doesn't > already have them) and stop scheduling on executors on NMs that are > decommissioning. > Stretch feature: The timeout may be useful in determining whether running > further tasks on the executor is even helpful. Spark may be able to tell that > shuffle data will not be consumed by the time the node is decommissioned, so > it is not worth computing. The executor can be killed immediately. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19977. --- Resolution: Not A Problem > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942399#comment-15942399 ] Sean Owen commented on SPARK-20037: --- I'm not sure this coherently describes the problem in a way that's reproducible. It's not in general true that this fails or else nothing would work in Spark. I suspect your code is setting offsets incorrectly elsewhere. > impossible to set kafka offsets using kafka 0.10 and spark 2.0.0 > > > Key: SPARK-20037 > URL: https://issues.apache.org/jira/browse/SPARK-20037 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: Daniel Nuriyev > Attachments: offsets.png > > > I use kafka 0.10.1 and java code with the following dependencies: > > org.apache.kafka > kafka_2.11 > 0.10.1.1 > > > org.apache.kafka > kafka-clients > 0.10.1.1 > > > org.apache.spark > spark-streaming_2.11 > 2.0.0 > > > org.apache.spark > spark-streaming-kafka-0-10_2.11 > 2.0.0 > > The code tries to read the a topic starting with offsets. > The topic has 4 partitions that start somewhere before 585000 and end after > 674000. So I wanted to read all partitions starting with 585000 > fromOffsets.put(new TopicPartition(topic, 0), 585000L); > fromOffsets.put(new TopicPartition(topic, 1), 585000L); > fromOffsets.put(new TopicPartition(topic, 2), 585000L); > fromOffsets.put(new TopicPartition(topic, 3), 585000L); > Using 5 second batches: > jssc = new JavaStreamingContext(conf, Durations.seconds(5)); > The code immediately throws: > Beginning offset 585000 is after the ending offset 584464 for topic > commerce_item_expectation partition 1 > It does not make sense because this topic/partition starts at 584464, not ends > I use this as a base: > https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > But I use direct stream: > KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(), > ConsumerStrategies.Subscribe( > topics, kafkaParams, fromOffsets > ) > ) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
[ https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942384#comment-15942384 ] Apache Spark commented on SPARK-20101: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/17436 > Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true" > > > Key: SPARK-20101 > URL: https://issues.apache.org/jira/browse/SPARK-20101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and > {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used. > This JIRA enables to use {{OffHeapColumnVector}} when > {{spark.memory.offHeap.enabled}} is set to {{true}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
[ https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20101: Assignee: Apache Spark > Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true" > > > Key: SPARK-20101 > URL: https://issues.apache.org/jira/browse/SPARK-20101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and > {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used. > This JIRA enables to use {{OffHeapColumnVector}} when > {{spark.memory.offHeap.enabled}} is set to {{true}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
[ https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20101: Assignee: (was: Apache Spark) > Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true" > > > Key: SPARK-20101 > URL: https://issues.apache.org/jira/browse/SPARK-20101 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki > > While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and > {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used. > This JIRA enables to use {{OffHeapColumnVector}} when > {{spark.memory.offHeap.enabled}} is set to {{true}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942382#comment-15942382 ] Apache Spark commented on SPARK-20098: -- User 'szalai1' has created a pull request for this issue: https://github.com/apache/spark/pull/17435 > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20098: Assignee: (was: Apache Spark) > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField
[ https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20098: Assignee: Apache Spark > DataType's typeName method returns with 'StructF' in case of StructField > > > Key: SPARK-20098 > URL: https://issues.apache.org/jira/browse/SPARK-20098 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Peter Szalai >Assignee: Apache Spark > > Currently, if you want to get the name of a DateType and the DateType is a > `StructField`, you get `StructF`. > http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942380#comment-15942380 ] Gal Topper commented on SPARK-19476: BTW, my workaround was quite painful: I created a single-threaded producer that used the original executor thread to feed the data into the async stream. Here's the code, in case anyone stumbles on this and is looking for a way out: {code:title=SingleThreadedPublisher.scala|borderStyle=solid} import org.reactivestreams.{Publisher, Subscriber, Subscription} /** A reactive streams publisher to work around SPARK-19476 by only using Spark's iterator from the original thread. */ private class SingleThreadedPublisher[T] extends Publisher[T] { private var cancelled = false private var demand = 0L private var subscriber: Subscriber[_ >: T] = _ private val waitForDemandObject = new Object override def subscribe(s: Subscriber[_ >: T]): Unit = { this.subscriber = s val subscription = new Subscription { override def cancel(): Unit = waitForDemandObject.synchronized { cancelled = true waitForDemandObject.notify() } override def request(n: Long): Unit = { waitForDemandObject.synchronized { demand += n waitForDemandObject.notify() } } } s.onSubscribe(subscription) } private def produce(element: T): Unit = { demand -= 1 subscriber.onNext(element) } def push(iterator: Iterator[T]): Unit = { iterator.takeWhile(_ => !cancelled).foreach { element => waitForDemandObject.synchronized { if (demand > 0L) { produce(element) } else { waitForDemandObject.wait() produce(element) } } } if (!cancelled) { subscriber.onComplete() } } } {code} > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942326#comment-15942326 ] Gal Topper edited comment on SPARK-19476 at 3/26/17 6:08 PM: - In our case, we use Akka Streams to make many parallel, async I/O calls. To speed this up, we allow a large number of in-flight requests. It would not be feasible to have as many executors as we do in-flight requests, nor should the two be coupled IMO. One solution that came to my mind is to simply assign TaskContext.get().taskMetrics() to a private val (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107]) and use that later on when the iterator is exhausted (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]). Generally speaking, iterators are not expected to be thread-safe, but also not to be tethered to any one specific thread, and I think the suggested change would comply with this standard contract. I could make a small PR if the idea makes sense. EDIT: Nevermind, it's not that simple, because there is at least [one|https://github.com/apache/spark/blob/dd9049e0492cc70b629518fee9b3d1632374c612/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L313] more thread locality assumption that break when I try this. was (Author: gal.topper): In our case, we use Akka Streams to make many parallel, async I/O calls. To speed this up, we allow a large number of in-flight requests. It would not be feasible to have as many executors as we do in-flight requests, nor should the two be coupled IMO. One solution that came to my mind is to simply assign TaskContext.get().taskMetrics() to a private val (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107]) and use that later on when the iterator is exhausted (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]). Generally speaking, iterators are not expected to be thread-safe, but also not to be tethered to any one specific thread, and I think the suggested change would comply with this standard contract. I could make a small PR if the idea makes sense. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the
[jira] [Created] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
Kazuaki Ishizaki created SPARK-20101: Summary: Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true" Key: SPARK-20101 URL: https://issues.apache.org/jira/browse/SPARK-20101 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Kazuaki Ishizaki While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used. This JIRA enables to use {{OffHeapColumnVector}} when {{spark.memory.offHeap.enabled}} is set to {{true}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942326#comment-15942326 ] Gal Topper commented on SPARK-19476: In our case, we use Akka Streams to make many parallel, async I/O calls. To speed this up, we allow a large number of in-flight requests. It would not be feasible to have as many executors as we do in-flight requests, nor should the two be coupled IMO. One solution that came to my mind is to simply assign TaskContext.get().taskMetrics() to a private val (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107]) and use that later on when the iterator is exhausted (see [here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]). Generally speaking, iterators are not expected to be thread-safe, but also not to be tethered to any one specific thread, and I think the suggested change would comply with this standard contract. I could make a small PR if the idea makes sense. > Running threads in Spark DataFrame foreachPartition() causes > NullPointerException > - > > Key: SPARK-19476 > URL: https://issues.apache.org/jira/browse/SPARK-19476 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Gal Topper > > First reported on [Stack > overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition]. > I use multiple threads inside foreachPartition(), which works great for me > except for when the underlying iterator is TungstenAggregationIterator. Here > is a minimal code snippet to reproduce: > {code:title=Reproduce.scala|borderStyle=solid} > import scala.concurrent.ExecutionContext.Implicits.global > import scala.concurrent.duration.Duration > import scala.concurrent.{Await, Future} > import org.apache.spark.SparkContext > import org.apache.spark.sql.SQLContext > object Reproduce extends App { > val sc = new SparkContext("local", "reproduce") > val sqlContext = new SQLContext(sc) > import sqlContext.implicits._ > val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count() > df.foreachPartition { iterator => > val f = Future(iterator.toVector) > Await.result(f, Duration.Inf) > } > } > {code} > When I run this, I get: > {noformat} > java.lang.NullPointerException > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > {noformat} > I believe I actually understand why this happens - > TungstenAggregationIterator uses a ThreadLocal variable that returns null > when called from a thread other than the original thread that got the > iterator from Spark. From examining the code, this does not appear to differ > between recent Spark versions. > However, this limitation is specific to TungstenAggregationIterator, and not > documented, as far as I'm aware. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247378#comment-15247378 ] Hyukjin Kwon edited comment on SPARK-14726 at 3/26/17 2:09 PM: --- This is currently not supported. I will work on this if it is decided to be supported. [~rxin] was (Author: hyukjin.kwon): This is currently not supported. I can work on this but I feel a bit hesitating because I believe CSV data source is ported mainly for "small data world". But I believe there are a lot of users dealing with large CSV files. I will work on this if it is decided to be supported. [~rxin] > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones
[ https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942285#comment-15942285 ] Hyukjin Kwon commented on SPARK-14057: -- [~vsparmar], would {{spark.sql.session.timeZone}} solve this problem? > sql time stamps do not respect time zones > - > > Key: SPARK-14057 > URL: https://issues.apache.org/jira/browse/SPARK-14057 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Davidson >Priority: Minor > > we have time stamp data. The time stamp data is UTC how ever when we load the > data into spark data frames, the system assume the time stamps are in the > local time zone. This causes problems for our data scientists. Often they > pull data from our data center into their local macs. The data centers run > UTC. There computers are typically in PST or EST. > It is possible to hack around this problem > This cause a lot of errors in their analysis > A complete description of this issue can be found in the following mail msg > https://www.mail-archive.com/user@spark.apache.org/msg48121.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13169) CROSS JOIN slow or fails on tiny table
[ https://issues.apache.org/jira/browse/SPARK-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-13169. -- Resolution: Cannot Reproduce I am resolving this. I can't reproduce this against the current master as below: {code} val sql = """ SELECT `gwtlyrpywf`.`gear`,`gwtlyrpywf`.`cyl`,`vs` FROM ( SELECT DISTINCT * FROM ( SELECT `gear` AS `gear`, `cyl` AS `cyl`FROM `mtcars`) AS `zzz1`) AS `gwtlyrpywf` CROSS JOIN ( SELECT DISTINCT * FROM ( SELECT `vs` AS `vs` FROM `mtcars`) AS `zzz3`) AS `arytvfispy` """ spark.read.option("header", true).option("inferSchema", true).csv("mtcars.csv").createOrReplaceTempView("mtcars") spark.sql(sql).show() {code} {code} ++---+---+ |gear|cyl| vs| ++---+---+ | 5| 6| 1| | 5| 6| 0| | 5| 4| 1| | 5| 4| 0| | 4| 6| 1| | 4| 6| 0| | 3| 6| 1| | 3| 6| 0| | 5| 8| 1| | 5| 8| 0| | 3| 8| 1| | 3| 8| 0| | 3| 4| 1| | 3| 4| 0| | 4| 4| 1| | 4| 4| 0| ++---+---+ {code} This seems fixed in the master. It would be great if someone identifies the JIRA and backports this if applicable. > CROSS JOIN slow or fails on tiny table > -- > > Key: SPARK-13169 > URL: https://issues.apache.org/jira/browse/SPARK-13169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Antonio Piccolboni > > I am running a cross join with a distinct select on both sides. Table is tiny > (32 X 16). Running query through the thriftserver. Data is here > (https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv). > Query never terminates before 200s, mostly fails (TTransportException) while > all cores are being used (single machine). > Query is > {code} > SELECT `gwtlyrpywf`.`gear`,`gwtlyrpywf`.`cyl`,`vs` FROM ( >SELECT DISTINCT * FROM ( > SELECT `gear` AS `gear`, `cyl` AS `cyl`FROM `mtcars`) > AS `zzz1`) > AS `gwtlyrpywf` >CROSS JOIN ( >SELECT DISTINCT * FROM ( > SELECT `vs` AS `vs` FROM `mtcars`) >AS `zzz3`) >AS `arytvfispy` > {code} > I know it can be simplified, but it comes from a generator and the generator > counts on the optimizer to do the right thing. EXPLAIN shows the following > {code} > plan > 1 > > > > > > >== Physical Plan == > 2 > > > > > > > Project [gear#21,cyl#22,vs#23] > 3 > > > > > > >+- CartesianProduct > 4 > > > > > >
[jira] [Comment Edited] (SPARK-11784) Support Timestamp filter pushdown in Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557940#comment-15557940 ] Hyukjin Kwon edited comment on SPARK-11784 at 3/26/17 1:42 PM: --- Could you fill up the description? it seems you referred the {{TimestampType}} support in Parquet data source? was (Author: hyukjin.kwon): Could you feel up the description? it seems you referred the {{TimestampType}} support in Parquet data source? > Support Timestamp filter pushdown in Parquet datasource > > > Key: SPARK-11784 > URL: https://issues.apache.org/jira/browse/SPARK-11784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Ian > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20025) Driver fail over will not work, if SPARK_LOCAL* env is set.
[ https://issues.apache.org/jira/browse/SPARK-20025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20025: -- Target Version/s: 2.2.0 (was: 2.1.1, 2.2.0) > Driver fail over will not work, if SPARK_LOCAL* env is set. > --- > > Key: SPARK-20025 > URL: https://issues.apache.org/jira/browse/SPARK-20025 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Prashant Sharma > > In a bare metal system with No DNS setup, spark may be configured with > SPARK_LOCAL* for IP and host properties. > During a driver failover, in cluster deployment mode. SPARK_LOCAL* should be > ignored while auto deploying and should be picked up from target system's > local environment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19570) Allow to disable hive in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19570: -- Target Version/s: (was: 2.1.1) > Allow to disable hive in pyspark shell > -- > > Key: SPARK-19570 > URL: https://issues.apache.org/jira/browse/SPARK-19570 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Jeff Zhang >Priority: Minor > > SPARK-15236 do this for scala shell, this ticket is for pyspark shell. This > is not only for pyspark itself, but can also benefit downstream project like > livy which use shell.py for its interactive session. For now, livy has no > control of whether enable hive or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20100) Consolidate SessionState construction
[ https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1594#comment-1594 ] Apache Spark commented on SPARK-20100: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17433 > Consolidate SessionState construction > - > > Key: SPARK-20100 > URL: https://issues.apache.org/jira/browse/SPARK-20100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The current SessionState initialization path is quite complex. A part of the > creation is done in the SessionState companion objects, a part of the > creation is one inside the SessionState class, and a part is done by passing > functions. > The proposal is to consolidate the SessionState initialization into a builder > class. This SessionState will not do any initialization and just becomes a > place holder for the various Spark SQL internals. The advantages of this > approach are the following: > - SessionState initialization is less dispersed. The builder should be a one > stop shop. > - This provides us with a start for removing the HiveSessionState. Removing > the hive session state would also require us to move resource loading into a > separate class, and to (re)move metadata hive. > - It is easier to customize the Spark Session. You just need to create a > custom version of the builder. I will add hooks to make this easier. Opening > up these API's will happen at a later point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20100) Consolidate SessionState construction
[ https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20100: Assignee: Herman van Hovell (was: Apache Spark) > Consolidate SessionState construction > - > > Key: SPARK-20100 > URL: https://issues.apache.org/jira/browse/SPARK-20100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The current SessionState initialization path is quite complex. A part of the > creation is done in the SessionState companion objects, a part of the > creation is one inside the SessionState class, and a part is done by passing > functions. > The proposal is to consolidate the SessionState initialization into a builder > class. This SessionState will not do any initialization and just becomes a > place holder for the various Spark SQL internals. The advantages of this > approach are the following: > - SessionState initialization is less dispersed. The builder should be a one > stop shop. > - This provides us with a start for removing the HiveSessionState. Removing > the hive session state would also require us to move resource loading into a > separate class, and to (re)move metadata hive. > - It is easier to customize the Spark Session. You just need to create a > custom version of the builder. I will add hooks to make this easier. Opening > up these API's will happen at a later point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20100) Consolidate SessionState construction
[ https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20100: Assignee: Apache Spark (was: Herman van Hovell) > Consolidate SessionState construction > - > > Key: SPARK-20100 > URL: https://issues.apache.org/jira/browse/SPARK-20100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > > The current SessionState initialization path is quite complex. A part of the > creation is done in the SessionState companion objects, a part of the > creation is one inside the SessionState class, and a part is done by passing > functions. > The proposal is to consolidate the SessionState initialization into a builder > class. This SessionState will not do any initialization and just becomes a > place holder for the various Spark SQL internals. The advantages of this > approach are the following: > - SessionState initialization is less dispersed. The builder should be a one > stop shop. > - This provides us with a start for removing the HiveSessionState. Removing > the hive session state would also require us to move resource loading into a > separate class, and to (re)move metadata hive. > - It is easier to customize the Spark Session. You just need to create a > custom version of the builder. I will add hooks to make this easier. Opening > up these API's will happen at a later point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20100) Consolidate SessionState construction
Herman van Hovell created SPARK-20100: - Summary: Consolidate SessionState construction Key: SPARK-20100 URL: https://issues.apache.org/jira/browse/SPARK-20100 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Assignee: Herman van Hovell The current SessionState initialization path is quite complex. A part of the creation is done in the SessionState companion objects, a part of the creation is one inside the SessionState class, and a part is done by passing functions. The proposal is to consolidate the SessionState initialization into a builder class. This SessionState will not do any initialization and just becomes a place holder for the various Spark SQL internals. The advantages of this approach are the following: - SessionState initialization is less dispersed. The builder should be a one stop shop. - This provides us with a start for removing the HiveSessionState. Removing the hive session state would also require us to move resource loading into a separate class, and to (re)move metadata hive. - It is easier to customize the Spark Session. You just need to create a custom version of the builder. I will add hooks to make this easier. Opening up these API's will happen at a later point. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20086: Assignee: (was: Apache Spark) > issue with pyspark 2.1.0 window function > > > Key: SPARK-20086 > URL: https://issues.apache.org/jira/browse/SPARK-20086 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: mandar uapdhye > > original post at > [stackoverflow | > http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] > I get error when working with pyspark window function. here is some example > code: > {code:title=borderStyle=solid} > import pyspark > import pyspark.sql.functions as sf > import pyspark.sql.types as sparktypes > from pyspark.sql import window > > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) > df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) > df.show() > {code} > gives: > | x|AmtPaid| > | 1|2.0| > | 1|3.0| > | 1|1.0| > | 1| -2.0| > | 1| -1.0| > next, compute cumulative sum > {code:title=test.py|borderStyle=solid} > win_spec_max = (window.Window > .partitionBy(['x']) > .rowsBetween(window.Window.unboundedPreceding, 0))) > df = df.withColumn('AmtPaidCumSum', >sf.sum(sf.col('AmtPaid')).over(win_spec_max)) > df.show() > {code} > gives, > | x|AmtPaid|AmtPaidCumSum| > | 1|2.0| 2.0| > | 1|3.0| 5.0| > | 1|1.0| 6.0| > | 1| -2.0| 4.0| > | 1| -1.0| 3.0| > next, compute cumulative max, > {code} > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > gives error log > {noformat} > Py4JJavaError: An error occurred while calling o2609.showString. > with traceback: > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 df.show() > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in > show(self, n, truncate) > 316 """ > 317 if isinstance(truncate, bool) and truncate: > --> 318 print(self._jdf.showString(n, 20)) > 319 else: > 320 print(self._jdf.showString(n, int(truncate))) > > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, > self.name) >1134 >1135 for temp_arg in temp_args: > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling > {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > {noformat} > but interestingly enough, if i introduce another change before sencond window > operation, say inserting a column then it does not give that error: > {code} > df = df.withColumn('MaxBound', sf.lit(6.)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound| > | 1|2.0| 2.0| 6.0| > | 1|3.0| 5.0| 6.0| > | 1|1.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| > | 1| -1.0| 3.0| 6.0| > {code} > #then apply the second window operations > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| > | 1|2.0| 2.0| 6.0| 2.0| > | 1|3.0| 5.0| 6.0| 5.0| > | 1|1.0| 6.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| 6.0| > | 1| -1.0| 3.0| 6.0| 6.0| > I do not understand this behaviour > well, so far so good, but then I try another operation then again get similar > error: > {code} >
[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20086: Assignee: Apache Spark > issue with pyspark 2.1.0 window function > > > Key: SPARK-20086 > URL: https://issues.apache.org/jira/browse/SPARK-20086 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: mandar uapdhye >Assignee: Apache Spark > > original post at > [stackoverflow | > http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] > I get error when working with pyspark window function. here is some example > code: > {code:title=borderStyle=solid} > import pyspark > import pyspark.sql.functions as sf > import pyspark.sql.types as sparktypes > from pyspark.sql import window > > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) > df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) > df.show() > {code} > gives: > | x|AmtPaid| > | 1|2.0| > | 1|3.0| > | 1|1.0| > | 1| -2.0| > | 1| -1.0| > next, compute cumulative sum > {code:title=test.py|borderStyle=solid} > win_spec_max = (window.Window > .partitionBy(['x']) > .rowsBetween(window.Window.unboundedPreceding, 0))) > df = df.withColumn('AmtPaidCumSum', >sf.sum(sf.col('AmtPaid')).over(win_spec_max)) > df.show() > {code} > gives, > | x|AmtPaid|AmtPaidCumSum| > | 1|2.0| 2.0| > | 1|3.0| 5.0| > | 1|1.0| 6.0| > | 1| -2.0| 4.0| > | 1| -1.0| 3.0| > next, compute cumulative max, > {code} > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > gives error log > {noformat} > Py4JJavaError: An error occurred while calling o2609.showString. > with traceback: > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 df.show() > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in > show(self, n, truncate) > 316 """ > 317 if isinstance(truncate, bool) and truncate: > --> 318 print(self._jdf.showString(n, 20)) > 319 else: > 320 print(self._jdf.showString(n, int(truncate))) > > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, > self.name) >1134 >1135 for temp_arg in temp_args: > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling > {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > {noformat} > but interestingly enough, if i introduce another change before sencond window > operation, say inserting a column then it does not give that error: > {code} > df = df.withColumn('MaxBound', sf.lit(6.)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound| > | 1|2.0| 2.0| 6.0| > | 1|3.0| 5.0| 6.0| > | 1|1.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| > | 1| -1.0| 3.0| 6.0| > {code} > #then apply the second window operations > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| > | 1|2.0| 2.0| 6.0| 2.0| > | 1|3.0| 5.0| 6.0| 5.0| > | 1|1.0| 6.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| 6.0| > | 1| -1.0| 3.0| 6.0| 6.0| > I do not understand this behaviour > well, so far so good, but then I try another operation then again get
[jira] [Commented] (SPARK-20086) issue with pyspark 2.1.0 window function
[ https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942190#comment-15942190 ] Apache Spark commented on SPARK-20086: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17432 > issue with pyspark 2.1.0 window function > > > Key: SPARK-20086 > URL: https://issues.apache.org/jira/browse/SPARK-20086 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: mandar uapdhye > > original post at > [stackoverflow | > http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function] > I get error when working with pyspark window function. here is some example > code: > {code:title=borderStyle=solid} > import pyspark > import pyspark.sql.functions as sf > import pyspark.sql.types as sparktypes > from pyspark.sql import window > > sc = pyspark.SparkContext() > sqlc = pyspark.SQLContext(sc) > rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)]) > df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"]) > df.show() > {code} > gives: > | x|AmtPaid| > | 1|2.0| > | 1|3.0| > | 1|1.0| > | 1| -2.0| > | 1| -1.0| > next, compute cumulative sum > {code:title=test.py|borderStyle=solid} > win_spec_max = (window.Window > .partitionBy(['x']) > .rowsBetween(window.Window.unboundedPreceding, 0))) > df = df.withColumn('AmtPaidCumSum', >sf.sum(sf.col('AmtPaid')).over(win_spec_max)) > df.show() > {code} > gives, > | x|AmtPaid|AmtPaidCumSum| > | 1|2.0| 2.0| > | 1|3.0| 5.0| > | 1|1.0| 6.0| > | 1| -2.0| 4.0| > | 1| -1.0| 3.0| > next, compute cumulative max, > {code} > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > gives error log > {noformat} > Py4JJavaError: An error occurred while calling o2609.showString. > with traceback: > Py4JJavaErrorTraceback (most recent call last) > in () > > 1 df.show() > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in > show(self, n, truncate) > 316 """ > 317 if isinstance(truncate, bool) and truncate: > --> 318 print(self._jdf.showString(n, 20)) > 319 else: > 320 print(self._jdf.showString(n, int(truncate))) > > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc > in __call__(self, *args) >1131 answer = self.gateway_client.send_command(command) >1132 return_value = get_return_value( > -> 1133 answer, self.gateway_client, self.target_id, > self.name) >1134 >1135 for temp_arg in temp_args: > /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in > deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc > in get_return_value(answer, gateway_client, target_id, name) > 317 raise Py4JJavaError( > 318 "An error occurred while calling > {0}{1}{2}.\n". > --> 319 format(target_id, ".", name), value) > 320 else: > 321 raise Py4JError( > {noformat} > but interestingly enough, if i introduce another change before sencond window > operation, say inserting a column then it does not give that error: > {code} > df = df.withColumn('MaxBound', sf.lit(6.)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound| > | 1|2.0| 2.0| 6.0| > | 1|3.0| 5.0| 6.0| > | 1|1.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| > | 1| -1.0| 3.0| 6.0| > {code} > #then apply the second window operations > df = df.withColumn('AmtPaidCumSumMax', > sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max)) > df.show() > {code} > | x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax| > | 1|2.0| 2.0| 6.0| 2.0| > | 1|3.0| 5.0| 6.0| 5.0| > | 1|1.0| 6.0| 6.0| 6.0| > | 1| -2.0| 4.0| 6.0| 6.0| > | 1| -1.0| 3.0| 6.0| 6.0| > I do not understand this behaviour
[jira] [Resolved] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()
[ https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-20046. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.2.0 > Facilitate loop optimizations in a JIT compiler regarding > sqlContext.read.parquet() > --- > > Key: SPARK-20046 > URL: https://issues.apache.org/jira/browse/SPARK-20046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} > to facilitate loop optimizations in a JIT compiler for achieving better > performance. In particular, [this stackoverflow > entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark] > suggests me to improve performance of > {{sqlContext.read.parquet("file").count}}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20092) Trigger AppVeyor R tests for changes in Scala code related with R API
[ https://issues.apache.org/jira/browse/SPARK-20092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20092. -- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > Trigger AppVeyor R tests for changes in Scala code related with R API > - > > Key: SPARK-20092 > URL: https://issues.apache.org/jira/browse/SPARK-20092 > Project: Spark > Issue Type: Improvement > Components: Project Infra, SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > We are currently detecting the changes in {{./R}} directory and then trigger > AppVeyor tests. > It seems we need to tests when there are some changes in > {{./core/src/main/scala/org/apache/spark/r}} and > {{./sql/core/src/main/scala/org/apache/spark/sql/api/r}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org