[jira] [Resolved] (SPARK-20035) Spark 2.0.2 writes empty file if no record is in the dataset

2017-03-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20035.
--
Resolution: Duplicate

I am pretty sure that this is a duplicate of SPARK-15473. Let me resolve this 
as a duplicate. Please reopen this if anyone feels the JIRAs are separate and 
the fixes will be separate too.

> Spark 2.0.2 writes empty file if no record is in the dataset
> 
>
> Key: SPARK-20035
> URL: https://issues.apache.org/jira/browse/SPARK-20035
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Linux/Windows
>Reporter: Andrew
>
> When there is no record in a dataset, the call to write with the spark-csv 
> creates empty file (i.e. with no title line)
> ```
> dataset.write().format("com.databricks.spark.csv").option("header", 
> "true").save("... file name here ...");
> or 
> dataset.write().option("header", "true").csv("... file name here ...");
> ```
> The same file then cannot be read by using the same format (i.e. spark-csv) 
> since it is empty as below. The same call works if the dataset has at least 
> one record.
> ```
> sqlCtx.read().format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load("... file name here ...");
> or 
> sparkSession.read().option("header", "true").option("inferSchema", 
> "true").csv("... file name here ...");
> ```
> This is not right, you should always be able to read the file that you wrote 
> to.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15473) CSV fails to write and read back empty dataframe

2017-03-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942609#comment-15942609
 ] 

Hyukjin Kwon commented on SPARK-15473:
--

I am not working on this. Please take over this if anyone is willing to.

> CSV fails to write and read back empty dataframe
> 
>
> Key: SPARK-15473
> URL: https://issues.apache.org/jira/browse/SPARK-15473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently CSV data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).filter(_ => false)
> emptyDf.write
>   .format("csv")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("csv")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Can not create a Path from an empty string
> java.lang.IllegalArgumentException: Can not create a Path from an empty string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:987)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178)
>   at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:178)
>   at scala.Option.map(Option.scala:146)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> Maybe, it should be able to read/write header for schemas as well as empty 
> data.
> For Parquet and JSON, it works but CSV does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20105) Add tests for checkType and type string in structField in R

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20105:


Assignee: Apache Spark

> Add tests for checkType and type string in structField in R
> ---
>
> Key: SPARK-20105
> URL: https://issues.apache.org/jira/browse/SPARK-20105
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It seems {{checkType}} and the type string in {{structField}} are not being 
> tested closely.
> This string format currently seems R-specific. Therefore, it seems nicer if 
> we test positive/negative cases in R side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20105) Add tests for checkType and type string in structField in R

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20105:


Assignee: (was: Apache Spark)

> Add tests for checkType and type string in structField in R
> ---
>
> Key: SPARK-20105
> URL: https://issues.apache.org/jira/browse/SPARK-20105
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{checkType}} and the type string in {{structField}} are not being 
> tested closely.
> This string format currently seems R-specific. Therefore, it seems nicer if 
> we test positive/negative cases in R side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20105) Add tests for checkType and type string in structField in R

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942587#comment-15942587
 ] 

Apache Spark commented on SPARK-20105:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17439

> Add tests for checkType and type string in structField in R
> ---
>
> Key: SPARK-20105
> URL: https://issues.apache.org/jira/browse/SPARK-20105
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{checkType}} and the type string in {{structField}} are not being 
> tested closely.
> This string format currently seems R-specific. Therefore, it seems nicer if 
> we test positive/negative cases in R side.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20104:


Assignee: (was: Apache Spark)

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942583#comment-15942583
 ] 

Apache Spark commented on SPARK-20104:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17438

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20104:


Assignee: Apache Spark

> Don't estimate IsNull or IsNotNull predicates for non-leaf node
> ---
>
> Key: SPARK-20104
> URL: https://issues.apache.org/jira/browse/SPARK-20104
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> In current stage, we don't have advanced statistics such as sketches or 
> histograms. As a result, some operator can't estimate `nullCount` accurately. 
> E.g. left outer join estimation does not accurately update `nullCount` 
> currently. So for IsNull and IsNotNull predicates, we only estimate them when 
> the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20105) Add tests for checkType and type string in structField in R

2017-03-26 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-20105:


 Summary: Add tests for checkType and type string in structField in 
R
 Key: SPARK-20105
 URL: https://issues.apache.org/jira/browse/SPARK-20105
 Project: Spark
  Issue Type: Test
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


It seems {{checkType}} and the type string in {{structField}} are not being 
tested closely.

This string format currently seems R-specific. Therefore, it seems nicer if we 
test positive/negative cases in R side.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20104) Don't estimate IsNull or IsNotNull predicates for non-leaf node

2017-03-26 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-20104:


 Summary: Don't estimate IsNull or IsNotNull predicates for 
non-leaf node
 Key: SPARK-20104
 URL: https://issues.apache.org/jira/browse/SPARK-20104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Zhenhua Wang


In current stage, we don't have advanced statistics such as sketches or 
histograms. As a result, some operator can't estimate `nullCount` accurately. 
E.g. left outer join estimation does not accurately update `nullCount` 
currently. So for IsNull and IsNotNull predicates, we only estimate them when 
the child is a leaf node, whose `nullCount` is accurate. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20103) Spark structured steaming from kafka - last message processed again after resume from checkpoint

2017-03-26 Thread Rajesh Mutha (JIRA)
Rajesh Mutha created SPARK-20103:


 Summary: Spark structured steaming from kafka - last message 
processed again after resume from checkpoint
 Key: SPARK-20103
 URL: https://issues.apache.org/jira/browse/SPARK-20103
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
 Environment: Linux, Spark 2.10 
Reporter: Rajesh Mutha


When the application starts after a failure or a graceful shutdown, it is 
consistently processing the last message of the previous batch even though it 
was already processed correctly without failure.

We are making sure database writes are idempotent using postgres 9.6 feature. 
Is this the default behavior of spark? I added a code snippet with 2 streaming 
queries. One of the query is idempotent; since query2 is not idempotent, we are 
seeing duplicate entries in table. 

---
object StructuredStreaming {
  def main(args: Array[String]): Unit = {
val db_url = 
"jdbc:postgresql://dynamic-milestone-dev.crv1otzbekk9.us-east-1.rds.amazonaws.com:5432/DYNAMICPOSTGRES?user=dsdbadmin=password"
val spark = SparkSession
  .builder
  .appName("StructuredKafkaReader")
  .master("local[*]")
  .getOrCreate()
spark.conf.set("spark.sql.streaming.checkpointLocation", 
"/tmp/checkpoint_research/")
import spark.implicits._
val server = "10.205.82.113:9092"
val topic = "checkpoint"
val subscribeType="subscribe"
val lines = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", server)
  .option(subscribeType, topic)
  .load().selectExpr("CAST(value AS STRING)").as[String]
lines.printSchema()
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[String] {
   def open(partitionId: Long, version: Long):  Boolean = {
 println("After db props"); true
   }
   def process(value: String) = {
 val conn = DriverManager.getConnection(db_url)
 try{
   conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint1 
VALUES ('"+value+"')")
 }
 finally {
   conn.close()
 }
  }
   def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query1 = lines.writeStream
 .outputMode("append")
 .queryName("checkpoint1")
 .trigger(ProcessingTime(30.seconds))
 .foreach(writer)
 .start()
 val writer2 = new ForeachWriter[String] {
  def open(partitionId: Long, version: Long):  Boolean = {
println("After db props"); true
  }
  def process(value: String) = {
val conn = DriverManager.getConnection(db_url)
try{
  conn.createStatement().executeUpdate("INSERT INTO PUBLIC.checkpoint2 
VALUES ('"+value+"')")
}
finally {
  conn.close()
}
   }
  def close(errorOrNull: Throwable) = {}
}
import scala.concurrent.duration._
val query2 = lines.writeStream
  .outputMode("append")
  .queryName("checkpoint2")
  .trigger(ProcessingTime(30.seconds))
  .foreach(writer2)
  .start()
query2.awaitTermination()
query1.awaitTermination()
}}
---



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6567) Large linear model parallelism via a join and reduceByKey

2017-03-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942498#comment-15942498
 ] 

Joseph K. Bradley commented on SPARK-6567:
--

Linking [SPARK-10078], which tracks adding vector-free L-BFGS.

> Large linear model parallelism via a join and reduceByKey
> -
>
> Key: SPARK-6567
> URL: https://issues.apache.org/jira/browse/SPARK-6567
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Reza Zadeh
> Attachments: model-parallelism.pptx
>
>
> To train a linear model, each training point in the training set needs its 
> dot product computed against the model, per iteration. If the model is large 
> (too large to fit in memory on a single machine) then SPARK-4590 proposes 
> using parameter server.
> There is an easier way to achieve this without parameter servers. In 
> particular, if the data is held as a BlockMatrix and the model as an RDD, 
> then each block can be joined with the relevant part of the model, followed 
> by a reduceByKey to compute the dot products.
> This obviates the need for a parameter server, at least for linear models. 
> However, it's unclear how it compares performance-wise to parameter servers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19281.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17218
[https://github.com/apache/spark/pull/17218]

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19281:
-

Assignee: Maciej Szymkiewicz

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Maciej Szymkiewicz
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19281) spark.ml Python API for FPGrowth

2017-03-26 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19281:
--
Shepherd: Joseph K. Bradley  (was: Nick Pentreath)

> spark.ml Python API for FPGrowth
> 
>
> Key: SPARK-19281
> URL: https://issues.apache.org/jira/browse/SPARK-19281
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> See parent issue.  This is for a Python API *after* the Scala API has been 
> designed and implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20072) Clarify ALS-WR documentation

2017-03-26 Thread chris snow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942473#comment-15942473
 ] 

chris snow commented on SPARK-20072:


Will do.  Thanks Sean.

> Clarify ALS-WR documentation
> 
>
> Key: SPARK-20072
> URL: https://issues.apache.org/jira/browse/SPARK-20072
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: chris snow
>Priority: Trivial
>
> https://www.mail-archive.com/user@spark.apache.org/msg62590.html
> The documentation for collaborative filtering is as follows:
> ===
> Scaling of the regularization parameter
> Since v1.1, we scale the regularization parameter lambda in solving
> each least squares problem by the number of ratings the user generated
> in updating user factors, or the number of ratings the product
> received in updating product factors.
> ===
> I find this description confusing, probably because I lack a detailed
> understanding of ALS.   The wording suggest that the number of ratings
> change ("generated", "received") during solving the least squares.
> This is how I think I should be interpreting the description:
> ===
> Since v1.1, we scale the regularization parameter lambda when solving
> each least squares problem.  When updating the user factors, we scale
> the regularization parameter by the total number of ratings from the
> user.  Similarly, when updating the product factors, we scale the
> regularization parameter by the total number of ratings for the
> product.
> ===



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942454#comment-15942454
 ] 

Apache Spark commented on SPARK-20102:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17437

> Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
> 
>
> Key: SPARK-20102
> URL: https://issues.apache.org/jira/browse/SPARK-20102
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The master snapshot publisher builds are currently broken due to two minor 
> build issues:
> 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors 
> when the remote FTP directory already exists. To work around this, we should 
> update the script to ignore errors.
> 2. The PySpark setup.py file references a non-existent module, causing Python 
> packaging to fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20102:


Assignee: Apache Spark  (was: Josh Rosen)

> Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
> 
>
> Key: SPARK-20102
> URL: https://issues.apache.org/jira/browse/SPARK-20102
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The master snapshot publisher builds are currently broken due to two minor 
> build issues:
> 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors 
> when the remote FTP directory already exists. To work around this, we should 
> update the script to ignore errors.
> 2. The PySpark setup.py file references a non-existent module, causing Python 
> packaging to fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20102:


Assignee: Josh Rosen  (was: Apache Spark)

> Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds
> 
>
> Key: SPARK-20102
> URL: https://issues.apache.org/jira/browse/SPARK-20102
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The master snapshot publisher builds are currently broken due to two minor 
> build issues:
> 1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors 
> when the remote FTP directory already exists. To work around this, we should 
> update the script to ignore errors.
> 2. The PySpark setup.py file references a non-existent module, causing Python 
> packaging to fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20102) Fix two minor build script issues blocking 2.1.1 RC + master snapshot builds

2017-03-26 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-20102:
--

 Summary: Fix two minor build script issues blocking 2.1.1 RC + 
master snapshot builds
 Key: SPARK-20102
 URL: https://issues.apache.org/jira/browse/SPARK-20102
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.1
Reporter: Josh Rosen
Assignee: Josh Rosen


The master snapshot publisher builds are currently broken due to two minor 
build issues:

1. For unknown reasons, the LFTP {{mkdir -p}} command started throwing errors 
when the remote FTP directory already exists. To work around this, we should 
update the script to ignore errors.
2. The PySpark setup.py file references a non-existent module, causing Python 
packaging to fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-26 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-20086:
-

Assignee: Herman van Hovell

> issue with pyspark 2.1.0 window function
> 
>
> Key: SPARK-20086
> URL: https://issues.apache.org/jira/browse/SPARK-20086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: mandar uapdhye
>Assignee: Herman van Hovell
> Fix For: 2.1.1, 2.2.0
>
>
> original  post at
> [stackoverflow | 
> http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]
> I get error when working with pyspark window function. here is some example 
> code:
> {code:title=borderStyle=solid}
> import pyspark
> import pyspark.sql.functions as sf
> import pyspark.sql.types as sparktypes
> from pyspark.sql import window
> 
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
> df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
> df.show()
> {code}
> gives:
> |  x|AmtPaid|
> |  1|2.0|
> |  1|3.0|
> |  1|1.0|
> |  1|   -2.0|
> |  1|   -1.0|
> next, compute cumulative sum
> {code:title=test.py|borderStyle=solid}
> win_spec_max = (window.Window
> .partitionBy(['x'])
> .rowsBetween(window.Window.unboundedPreceding, 0)))
> df = df.withColumn('AmtPaidCumSum',
>sf.sum(sf.col('AmtPaid')).over(win_spec_max))
> df.show()
> {code}
> gives,
> |  x|AmtPaid|AmtPaidCumSum|
> |  1|2.0|  2.0|
> |  1|3.0|  5.0|
> |  1|1.0|  6.0|
> |  1|   -2.0|  4.0|
> |  1|   -1.0|  3.0|
> next, compute cumulative max,
> {code}
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> gives error log
> {noformat}
>  Py4JJavaError: An error occurred while calling o2609.showString.
> with traceback:
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 df.show()
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
> show(self, n, truncate)
> 316 """
> 317 if isinstance(truncate, bool) and truncate:
> --> 318 print(self._jdf.showString(n, 20))
> 319 else:
> 320 print(self._jdf.showString(n, int(truncate)))
> 
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
> in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, 
> self.name)
>1134 
>1135 for temp_arg in temp_args:
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling 
> {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> {noformat}
> but interestingly enough, if i introduce another change before sencond window 
> operation, say inserting a column then it does not give that error:
> {code}
> df = df.withColumn('MaxBound', sf.lit(6.))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|
> |  1|2.0|  2.0| 6.0|
> |  1|3.0|  5.0| 6.0|
> |  1|1.0|  6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0|
> |  1|   -1.0|  3.0| 6.0|
> {code}
> #then apply the second window operations
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
> |  1|2.0|  2.0| 6.0| 2.0|
> |  1|3.0|  5.0| 6.0| 5.0|
> |  1|1.0|  6.0| 6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0| 6.0|
> |  1|   -1.0|  3.0| 6.0| 6.0|
> I do not understand this behaviour
> well, so far so 

[jira] [Resolved] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-26 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20086.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> issue with pyspark 2.1.0 window function
> 
>
> Key: SPARK-20086
> URL: https://issues.apache.org/jira/browse/SPARK-20086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: mandar uapdhye
>Assignee: Herman van Hovell
> Fix For: 2.1.1, 2.2.0
>
>
> original  post at
> [stackoverflow | 
> http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]
> I get error when working with pyspark window function. here is some example 
> code:
> {code:title=borderStyle=solid}
> import pyspark
> import pyspark.sql.functions as sf
> import pyspark.sql.types as sparktypes
> from pyspark.sql import window
> 
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
> df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
> df.show()
> {code}
> gives:
> |  x|AmtPaid|
> |  1|2.0|
> |  1|3.0|
> |  1|1.0|
> |  1|   -2.0|
> |  1|   -1.0|
> next, compute cumulative sum
> {code:title=test.py|borderStyle=solid}
> win_spec_max = (window.Window
> .partitionBy(['x'])
> .rowsBetween(window.Window.unboundedPreceding, 0)))
> df = df.withColumn('AmtPaidCumSum',
>sf.sum(sf.col('AmtPaid')).over(win_spec_max))
> df.show()
> {code}
> gives,
> |  x|AmtPaid|AmtPaidCumSum|
> |  1|2.0|  2.0|
> |  1|3.0|  5.0|
> |  1|1.0|  6.0|
> |  1|   -2.0|  4.0|
> |  1|   -1.0|  3.0|
> next, compute cumulative max,
> {code}
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> gives error log
> {noformat}
>  Py4JJavaError: An error occurred while calling o2609.showString.
> with traceback:
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 df.show()
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
> show(self, n, truncate)
> 316 """
> 317 if isinstance(truncate, bool) and truncate:
> --> 318 print(self._jdf.showString(n, 20))
> 319 else:
> 320 print(self._jdf.showString(n, int(truncate)))
> 
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
> in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, 
> self.name)
>1134 
>1135 for temp_arg in temp_args:
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling 
> {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> {noformat}
> but interestingly enough, if i introduce another change before sencond window 
> operation, say inserting a column then it does not give that error:
> {code}
> df = df.withColumn('MaxBound', sf.lit(6.))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|
> |  1|2.0|  2.0| 6.0|
> |  1|3.0|  5.0| 6.0|
> |  1|1.0|  6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0|
> |  1|   -1.0|  3.0| 6.0|
> {code}
> #then apply the second window operations
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
> |  1|2.0|  2.0| 6.0| 2.0|
> |  1|3.0|  5.0| 6.0| 5.0|
> |  1|1.0|  6.0| 6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0| 6.0|
> |  1|   -1.0|  3.0| 6.0| 6.0|
> I do not 

[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Gal Topper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942431#comment-15942431
 ] 

Gal Topper commented on SPARK-19476:


The example in the description does indeed fully materialize the iterator to 
very simply reproduce the issue. To be clear, the real code I'm running doesn't 
do that :-)! Instead, it pulls items from the iterator on demand. The 
workaround I described basically makes sure that only the original executor 
thread ever calls next() on the iterator, which is still done on demand, not 
all-at-once.

In my own experience, using threads works perfectly fine with the exception of 
this issue, and I've never read anything in the docs to discourage users from 
doing so. +1 for a note though, if that's really not something the authors 
intended.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>Priority: Minor
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19476:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

You certainly access the iterator from Spark right? that's what I mean. You 
might just copy that off into a data structure with no relation to the iterator 
implementation. That of course may not be feasible.

I think this is true over the whole API, that you're not intended to create 
your own threads. It could well work in many cases but don't think it's 
guaranteed to. Sure, maybe worth a note.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>Priority: Minor
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Gal Topper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942419#comment-15942419
 ] 

Gal Topper commented on SPARK-19476:


I'm pretty sure we're talking about different things. My code running inside 
foreachPartition naturally doesn't need any Spark internals. It just takes the 
data and writes it to a database, and that writing process happens not to be 
single-threaded. It doesn't copy any data, and it works pretty well using the 
(far from trivial) workaround described and provided above.

If this thread local is too entrenched to feasibly fix the issue, I would at 
least suggest that this limitation be documented (e.g. "@param iterator may 
only be accessed by the original executor thread", and/or otherwise in the 
docs). That's what I'd do, anyway.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2017-03-26 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942413#comment-15942413
 ] 

Dongjoon Hyun edited comment on SPARK-14922 at 3/26/17 7:16 PM:


[~smilegator], [~hyukjin.kwon].
I will reopen this issue instead of SPARK-17732. This issue includes 
SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. 
SPARK-17732 tried to add operators like '<', but eventually it turns out that 
we also need to support *expressions for values* in predicates.


was (Author: dongjoon):
[~smilegator], [~hyukjin.kwon].
I will reopen this issue instead of SPARK-17732. This issue includes 
SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. 
SPARK-17732 tried to add operators like '<', but eventually it turns out that 
we need to support *expressions* in predicates.

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17732) ALTER TABLE DROP PARTITION should support comparators

2017-03-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17732.
-
Resolution: Duplicate

> ALTER TABLE DROP PARTITION should support comparators
> -
>
> Key: SPARK-17732
> URL: https://issues.apache.org/jira/browse/SPARK-17732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>
> This issue aims to support `comparators`, e.g. '<', '<=', '>', '>=', again in 
> Apache Spark 2.0 for backward compatibility.
> *Spark 1.6.2*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = [result: string]
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> res1: org.apache.spark.sql.DataFrame = [result: string]
> {code}
> *Spark 2.0*
> {code}
> scala> sql("CREATE TABLE sales(id INT) PARTITIONED BY (country STRING, 
> quarter STRING)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> sql("ALTER TABLE sales DROP PARTITION (country < 'KR')")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input '<' expecting {')', ','}(line 1, pos 42)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2017-03-26 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-14922:
---

[~smilegator], [~hyukjin.kwon].
I will reopen this issue instead of SPARK-17732. This issue includes 
SPARK-17732 . When I created SPARK-17732, I mistakenly didn't notice this one. 
SPARK-17732 tried to add operators like '<', but eventually it turns out that 
we need to support *expressions* in predicates.

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942405#comment-15942405
 ] 

Sean Owen commented on SPARK-19476:
---

You don't need an executor per I/O call, you need one slot. I don't believe 
what you're doing is generally going to work; you're building around Spark's 
execution model. You will probably need to rewrite to restrict the async 
processing to something that doesn't need to access these Spark objects 
somehow, or do something like copy the data off the iterator if that's 
feasible, to avoid this access.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes

2017-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19941.
---
Resolution: Won't Fix

> Spark should not schedule tasks on executors on decommissioning YARN nodes
> --
>
> Key: SPARK-19941
> URL: https://issues.apache.org/jira/browse/SPARK-19941
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Affects Versions: 2.1.0
> Environment: Hadoop 2.8.0-rc1
>Reporter: Karthik Palaniappan
>
> Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in 
> YARN: https://issues.apache.org/jira/browse/YARN-914
> Essentially you can mark nodes to be decommissioned, and let them a) finish 
> work in progress and b) finish serving shuffle data. But no new work will be 
> scheduled on the node.
> Spark should respect when NMs are set to decommissioned, and similarly 
> decommission executors on those nodes by not scheduling any more tasks on 
> them.
> It looks like in the future YARN may inform the app master when containers 
> will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I 
> don't think Spark should schedule based on a timeout. We should gracefully 
> decommission the executor as fast as possible (which is the spirit of 
> YARN-914). The app master can query the RM for NM statuses (if it doesn't 
> already have them) and stop scheduling on executors on NMs that are 
> decommissioning.
> Stretch feature: The timeout may be useful in determining whether running 
> further tasks on the executor is even helpful. Spark may be able to tell that 
> shuffle data will not be consumed by the time the node is decommissioned, so 
> it is not worth computing. The executor can be killed immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19977.
---
Resolution: Not A Problem

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-26 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942399#comment-15942399
 ] 

Sean Owen commented on SPARK-20037:
---

I'm not sure this coherently describes the problem in a way that's 
reproducible. It's not in general true that this fails or else nothing would 
work in Spark. I suspect your code is setting offsets incorrectly elsewhere.

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
> Attachments: offsets.png
>
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942384#comment-15942384
 ] 

Apache Spark commented on SPARK-20101:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/17436

> Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
> 
>
> Key: SPARK-20101
> URL: https://issues.apache.org/jira/browse/SPARK-20101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and 
> {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used.
> This JIRA enables to use {{OffHeapColumnVector}} when 
> {{spark.memory.offHeap.enabled}} is set to {{true}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20101:


Assignee: Apache Spark

> Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
> 
>
> Key: SPARK-20101
> URL: https://issues.apache.org/jira/browse/SPARK-20101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and 
> {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used.
> This JIRA enables to use {{OffHeapColumnVector}} when 
> {{spark.memory.offHeap.enabled}} is set to {{true}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20101:


Assignee: (was: Apache Spark)

> Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"
> 
>
> Key: SPARK-20101
> URL: https://issues.apache.org/jira/browse/SPARK-20101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and 
> {{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used.
> This JIRA enables to use {{OffHeapColumnVector}} when 
> {{spark.memory.offHeap.enabled}} is set to {{true}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942382#comment-15942382
 ] 

Apache Spark commented on SPARK-20098:
--

User 'szalai1' has created a pull request for this issue:
https://github.com/apache/spark/pull/17435

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20098:


Assignee: (was: Apache Spark)

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20098) DataType's typeName method returns with 'StructF' in case of StructField

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20098:


Assignee: Apache Spark

> DataType's typeName method returns with 'StructF' in case of StructField
> 
>
> Key: SPARK-20098
> URL: https://issues.apache.org/jira/browse/SPARK-20098
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Peter Szalai
>Assignee: Apache Spark
>
> Currently, if you want to get the name of a DateType and the DateType is a 
> `StructField`, you get `StructF`. 
> http://spark.apache.org/docs/2.1.0/api/python/_modules/pyspark/sql/types.html 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Gal Topper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942380#comment-15942380
 ] 

Gal Topper commented on SPARK-19476:


BTW, my workaround was quite painful: I created a single-threaded producer that 
used the original executor thread to feed the data into the async stream. 
Here's the code, in case anyone stumbles on this and is looking for a way out:

{code:title=SingleThreadedPublisher.scala|borderStyle=solid}
import org.reactivestreams.{Publisher, Subscriber, Subscription}

/** A reactive streams publisher to work around SPARK-19476 by only using 
Spark's iterator from the original thread. */
private class SingleThreadedPublisher[T] extends Publisher[T] {

  private var cancelled = false
  private var demand = 0L
  private var subscriber: Subscriber[_ >: T] = _
  private val waitForDemandObject = new Object

  override def subscribe(s: Subscriber[_ >: T]): Unit = {
this.subscriber = s
val subscription = new Subscription {
  override def cancel(): Unit = waitForDemandObject.synchronized {
cancelled = true
waitForDemandObject.notify()
  }

  override def request(n: Long): Unit = {
waitForDemandObject.synchronized {
  demand += n
  waitForDemandObject.notify()
}
  }
}
s.onSubscribe(subscription)
  }

  private def produce(element: T): Unit = {
demand -= 1
subscriber.onNext(element)
  }

  def push(iterator: Iterator[T]): Unit = {
iterator.takeWhile(_ => !cancelled).foreach { element =>
  waitForDemandObject.synchronized {
if (demand > 0L) {
  produce(element)
} else {
  waitForDemandObject.wait()
  produce(element)
}
  }
}
if (!cancelled) {
  subscriber.onComplete()
}
  }
}
{code}

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Gal Topper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942326#comment-15942326
 ] 

Gal Topper edited comment on SPARK-19476 at 3/26/17 6:08 PM:
-

In our case, we use Akka Streams to make many parallel, async I/O calls. To 
speed this up, we allow a large number of in-flight requests. It would not be 
feasible to have as many executors as we do in-flight requests, nor should the 
two be coupled IMO.

One solution that came to my mind is to simply assign 
TaskContext.get().taskMetrics() to a private val (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107])
 and use that later on when the iterator is exhausted (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]).
 Generally speaking, iterators are not expected to be thread-safe, but also not 
to be tethered to any one specific thread, and I think the suggested change 
would comply with this standard contract.

I could make a small PR if the idea makes sense.

EDIT: Nevermind, it's not that simple, because there is at least 
[one|https://github.com/apache/spark/blob/dd9049e0492cc70b629518fee9b3d1632374c612/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L313]
 more thread locality assumption that break when I try this.


was (Author: gal.topper):
In our case, we use Akka Streams to make many parallel, async I/O calls. To 
speed this up, we allow a large number of in-flight requests. It would not be 
feasible to have as many executors as we do in-flight requests, nor should the 
two be coupled IMO.

One solution that came to my mind is to simply assign 
TaskContext.get().taskMetrics() to a private val (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107])
 and use that later on when the iterator is exhausted (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]).
 Generally speaking, iterators are not expected to be thread-safe, but also not 
to be tethered to any one specific thread, and I think the suggested change 
would comply with this standard contract.

I could make a small PR if the idea makes sense.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the 

[jira] [Created] (SPARK-20101) Use OffHeapColumnVector when "spark.memory.offHeap.enabled" is set to "true"

2017-03-26 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-20101:


 Summary: Use OffHeapColumnVector when 
"spark.memory.offHeap.enabled" is set to "true"
 Key: SPARK-20101
 URL: https://issues.apache.org/jira/browse/SPARK-20101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


While {{ColumnVector}} has two implementations {{OnHeapColumnVector}} and 
{{OffHeapColumnVector}}, only {{OnHeapColumnVector}} is used.
This JIRA enables to use {{OffHeapColumnVector}} when 
{{spark.memory.offHeap.enabled}} is set to {{true}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19476) Running threads in Spark DataFrame foreachPartition() causes NullPointerException

2017-03-26 Thread Gal Topper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942326#comment-15942326
 ] 

Gal Topper commented on SPARK-19476:


In our case, we use Akka Streams to make many parallel, async I/O calls. To 
speed this up, we allow a large number of in-flight requests. It would not be 
feasible to have as many executors as we do in-flight requests, nor should the 
two be coupled IMO.

One solution that came to my mind is to simply assign 
TaskContext.get().taskMetrics() to a private val (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L107])
 and use that later on when the iterator is exhausted (see 
[here|https://github.com/apache/spark/blob/362ee93296a0de6342b4339e941e6a11f445c5b2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregationIterator.scala#L419]).
 Generally speaking, iterators are not expected to be thread-safe, but also not 
to be tethered to any one specific thread, and I think the suggested change 
would comply with this standard contract.

I could make a small PR if the idea makes sense.

> Running threads in Spark DataFrame foreachPartition() causes 
> NullPointerException
> -
>
> Key: SPARK-19476
> URL: https://issues.apache.org/jira/browse/SPARK-19476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Gal Topper
>
> First reported on [Stack 
> overflow|http://stackoverflow.com/questions/41674069/running-threads-in-spark-dataframe-foreachpartition].
> I use multiple threads inside foreachPartition(), which works great for me 
> except for when the underlying iterator is TungstenAggregationIterator. Here 
> is a minimal code snippet to reproduce:
> {code:title=Reproduce.scala|borderStyle=solid}
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent.duration.Duration
> import scala.concurrent.{Await, Future}
> import org.apache.spark.SparkContext
> import org.apache.spark.sql.SQLContext
> object Reproduce extends App {
>   val sc = new SparkContext("local", "reproduce")
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
>   df.foreachPartition { iterator =>
> val f = Future(iterator.toVector)
> Await.result(f, Duration.Inf)
>   }
> }
> {code}
> When I run this, I get:
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> {noformat}
> I believe I actually understand why this happens - 
> TungstenAggregationIterator uses a ThreadLocal variable that returns null 
> when called from a thread other than the original thread that got the 
> iterator from Spark. From examining the code, this does not appear to differ 
> between recent Spark versions.
> However, this limitation is specific to TungstenAggregationIterator, and not 
> documented, as far as I'm aware.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source

2017-03-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247378#comment-15247378
 ] 

Hyukjin Kwon edited comment on SPARK-14726 at 3/26/17 2:09 PM:
---

This is currently not supported. I will work on this if it is decided to be 
supported. [~rxin]


was (Author: hyukjin.kwon):
This is currently not supported. I can work on this but I feel a bit hesitating 
because I believe CSV data source is ported mainly for "small data world". But 
I believe there are a lot of users dealing with large CSV files. 
I will work on this if it is decided to be supported. [~rxin]

> Support for sampling when inferring schema in CSV data source
> -
>
> Key: SPARK-14726
> URL: https://issues.apache.org/jira/browse/SPARK-14726
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Bomi Kim
>
> Currently, I am using CSV data source and trying to get used to Spark 2.0 
> because it has built-in CSV data source.
> I realized that CSV data source infers schema with all the data. JSON data 
> source supports sampling ratio option.
> It would be great if CSV data source has this option too (or is this 
> supported already?).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones

2017-03-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942285#comment-15942285
 ] 

Hyukjin Kwon commented on SPARK-14057:
--

[~vsparmar], would {{spark.sql.session.timeZone}} solve this problem?

> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13169) CROSS JOIN slow or fails on tiny table

2017-03-26 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-13169.
--
Resolution: Cannot Reproduce

I am resolving this. I can't reproduce this against the current master as below:

{code}
val sql = """
SELECT `gwtlyrpywf`.`gear`,`gwtlyrpywf`.`cyl`,`vs` FROM (
   SELECT DISTINCT * FROM (
 SELECT `gear` AS `gear`, `cyl` AS `cyl`FROM `mtcars`) 
  AS `zzz1`)
 AS `gwtlyrpywf`
   CROSS JOIN (
   SELECT DISTINCT * FROM (
 SELECT `vs` AS `vs` FROM `mtcars`)
   AS `zzz3`)
   AS `arytvfispy`
"""

spark.read.option("header", true).option("inferSchema", 
true).csv("mtcars.csv").createOrReplaceTempView("mtcars")
spark.sql(sql).show()
{code}

{code}
++---+---+
|gear|cyl| vs|
++---+---+
|   5|  6|  1|
|   5|  6|  0|
|   5|  4|  1|
|   5|  4|  0|
|   4|  6|  1|
|   4|  6|  0|
|   3|  6|  1|
|   3|  6|  0|
|   5|  8|  1|
|   5|  8|  0|
|   3|  8|  1|
|   3|  8|  0|
|   3|  4|  1|
|   3|  4|  0|
|   4|  4|  1|
|   4|  4|  0|
++---+---+
{code}

This seems fixed in the master. It would be great if someone identifies the 
JIRA and backports this if applicable.

> CROSS JOIN slow or fails on tiny table
> --
>
> Key: SPARK-13169
> URL: https://issues.apache.org/jira/browse/SPARK-13169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Antonio Piccolboni
>
> I am running a cross join with a distinct select on both sides. Table is tiny 
> (32 X 16). Running query through the thriftserver. Data is here 
> (https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv). 
> Query never terminates before 200s, mostly fails (TTransportException) while 
> all cores are being used (single machine).
> Query is
> {code}
> SELECT `gwtlyrpywf`.`gear`,`gwtlyrpywf`.`cyl`,`vs` FROM (
>SELECT DISTINCT * FROM (
>  SELECT `gear` AS `gear`, `cyl` AS `cyl`FROM `mtcars`) 
>   AS `zzz1`)
>  AS `gwtlyrpywf`
>CROSS JOIN (
>SELECT DISTINCT * FROM (
>  SELECT `vs` AS `vs` FROM `mtcars`)
>AS `zzz3`)
>AS `arytvfispy`
> {code}
> I know it can be simplified, but it comes from a generator and the generator 
> counts on the optimizer to do the right thing. EXPLAIN shows the following
> {code}
> plan
> 1 
>   
>   
>   
>   
>   
>   
>== Physical Plan ==
> 2 
>   
>   
>   
>   
>   
>   
> Project [gear#21,cyl#22,vs#23]
> 3 
>   
>   
>   
>   
>   
>   
>+- CartesianProduct
> 4 
>   
>   
>   
>   
>   
>   

[jira] [Comment Edited] (SPARK-11784) Support Timestamp filter pushdown in Parquet datasource

2017-03-26 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557940#comment-15557940
 ] 

Hyukjin Kwon edited comment on SPARK-11784 at 3/26/17 1:42 PM:
---

Could you fill up the description? it seems you referred the {{TimestampType}} 
support in Parquet data source?


was (Author: hyukjin.kwon):
Could you feel up the description? it seems you referred the {{TimestampType}} 
support in Parquet data source?

> Support Timestamp filter pushdown in Parquet datasource 
> 
>
> Key: SPARK-11784
> URL: https://issues.apache.org/jira/browse/SPARK-11784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ian
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20025) Driver fail over will not work, if SPARK_LOCAL* env is set.

2017-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20025:
--
Target Version/s: 2.2.0  (was: 2.1.1, 2.2.0)

> Driver fail over will not work, if SPARK_LOCAL* env is set.
> ---
>
> Key: SPARK-20025
> URL: https://issues.apache.org/jira/browse/SPARK-20025
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Prashant Sharma
>
> In a bare metal system with No DNS setup, spark may be configured with 
> SPARK_LOCAL* for IP and host properties.
> During a driver failover, in cluster deployment mode. SPARK_LOCAL* should be 
> ignored while auto deploying and should be picked up from target system's 
> local environment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19570) Allow to disable hive in pyspark shell

2017-03-26 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19570:
--
Target Version/s:   (was: 2.1.1)

> Allow to disable hive in pyspark shell
> --
>
> Key: SPARK-19570
> URL: https://issues.apache.org/jira/browse/SPARK-19570
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> SPARK-15236 do this for scala shell, this ticket is for pyspark shell.  This 
> is not only for pyspark itself, but can also benefit downstream project like 
> livy which use shell.py for its interactive session. For now, livy has no 
> control of whether enable hive or not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20100) Consolidate SessionState construction

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1594#comment-1594
 ] 

Apache Spark commented on SPARK-20100:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17433

> Consolidate SessionState construction
> -
>
> Key: SPARK-20100
> URL: https://issues.apache.org/jira/browse/SPARK-20100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The current SessionState initialization path is quite complex. A part of the 
> creation is done in the SessionState companion objects, a part of the 
> creation is one inside the SessionState class, and a part is done by passing 
> functions.
> The proposal is to consolidate the SessionState initialization into a builder 
> class. This SessionState will not do any initialization and just becomes a 
> place holder for the various Spark SQL internals. The advantages of this 
> approach are the following:
> - SessionState initialization is less dispersed. The builder should be a one 
> stop shop.
> - This provides us with a start for removing the HiveSessionState. Removing 
> the hive session state would also require us to move resource loading into a 
> separate class, and to (re)move metadata hive.
> - It is easier to customize the Spark Session. You just need to create a 
> custom version of the builder. I will add hooks to make this easier. Opening 
> up these API's will happen at a later point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20100) Consolidate SessionState construction

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20100:


Assignee: Herman van Hovell  (was: Apache Spark)

> Consolidate SessionState construction
> -
>
> Key: SPARK-20100
> URL: https://issues.apache.org/jira/browse/SPARK-20100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The current SessionState initialization path is quite complex. A part of the 
> creation is done in the SessionState companion objects, a part of the 
> creation is one inside the SessionState class, and a part is done by passing 
> functions.
> The proposal is to consolidate the SessionState initialization into a builder 
> class. This SessionState will not do any initialization and just becomes a 
> place holder for the various Spark SQL internals. The advantages of this 
> approach are the following:
> - SessionState initialization is less dispersed. The builder should be a one 
> stop shop.
> - This provides us with a start for removing the HiveSessionState. Removing 
> the hive session state would also require us to move resource loading into a 
> separate class, and to (re)move metadata hive.
> - It is easier to customize the Spark Session. You just need to create a 
> custom version of the builder. I will add hooks to make this easier. Opening 
> up these API's will happen at a later point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20100) Consolidate SessionState construction

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20100:


Assignee: Apache Spark  (was: Herman van Hovell)

> Consolidate SessionState construction
> -
>
> Key: SPARK-20100
> URL: https://issues.apache.org/jira/browse/SPARK-20100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The current SessionState initialization path is quite complex. A part of the 
> creation is done in the SessionState companion objects, a part of the 
> creation is one inside the SessionState class, and a part is done by passing 
> functions.
> The proposal is to consolidate the SessionState initialization into a builder 
> class. This SessionState will not do any initialization and just becomes a 
> place holder for the various Spark SQL internals. The advantages of this 
> approach are the following:
> - SessionState initialization is less dispersed. The builder should be a one 
> stop shop.
> - This provides us with a start for removing the HiveSessionState. Removing 
> the hive session state would also require us to move resource loading into a 
> separate class, and to (re)move metadata hive.
> - It is easier to customize the Spark Session. You just need to create a 
> custom version of the builder. I will add hooks to make this easier. Opening 
> up these API's will happen at a later point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20100) Consolidate SessionState construction

2017-03-26 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-20100:
-

 Summary: Consolidate SessionState construction
 Key: SPARK-20100
 URL: https://issues.apache.org/jira/browse/SPARK-20100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


The current SessionState initialization path is quite complex. A part of the 
creation is done in the SessionState companion objects, a part of the creation 
is one inside the SessionState class, and a part is done by passing functions.

The proposal is to consolidate the SessionState initialization into a builder 
class. This SessionState will not do any initialization and just becomes a 
place holder for the various Spark SQL internals. The advantages of this 
approach are the following:
- SessionState initialization is less dispersed. The builder should be a one 
stop shop.
- This provides us with a start for removing the HiveSessionState. Removing the 
hive session state would also require us to move resource loading into a 
separate class, and to (re)move metadata hive.
- It is easier to customize the Spark Session. You just need to create a custom 
version of the builder. I will add hooks to make this easier. Opening up these 
API's will happen at a later point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20086:


Assignee: (was: Apache Spark)

> issue with pyspark 2.1.0 window function
> 
>
> Key: SPARK-20086
> URL: https://issues.apache.org/jira/browse/SPARK-20086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: mandar uapdhye
>
> original  post at
> [stackoverflow | 
> http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]
> I get error when working with pyspark window function. here is some example 
> code:
> {code:title=borderStyle=solid}
> import pyspark
> import pyspark.sql.functions as sf
> import pyspark.sql.types as sparktypes
> from pyspark.sql import window
> 
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
> df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
> df.show()
> {code}
> gives:
> |  x|AmtPaid|
> |  1|2.0|
> |  1|3.0|
> |  1|1.0|
> |  1|   -2.0|
> |  1|   -1.0|
> next, compute cumulative sum
> {code:title=test.py|borderStyle=solid}
> win_spec_max = (window.Window
> .partitionBy(['x'])
> .rowsBetween(window.Window.unboundedPreceding, 0)))
> df = df.withColumn('AmtPaidCumSum',
>sf.sum(sf.col('AmtPaid')).over(win_spec_max))
> df.show()
> {code}
> gives,
> |  x|AmtPaid|AmtPaidCumSum|
> |  1|2.0|  2.0|
> |  1|3.0|  5.0|
> |  1|1.0|  6.0|
> |  1|   -2.0|  4.0|
> |  1|   -1.0|  3.0|
> next, compute cumulative max,
> {code}
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> gives error log
> {noformat}
>  Py4JJavaError: An error occurred while calling o2609.showString.
> with traceback:
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 df.show()
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
> show(self, n, truncate)
> 316 """
> 317 if isinstance(truncate, bool) and truncate:
> --> 318 print(self._jdf.showString(n, 20))
> 319 else:
> 320 print(self._jdf.showString(n, int(truncate)))
> 
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
> in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, 
> self.name)
>1134 
>1135 for temp_arg in temp_args:
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling 
> {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> {noformat}
> but interestingly enough, if i introduce another change before sencond window 
> operation, say inserting a column then it does not give that error:
> {code}
> df = df.withColumn('MaxBound', sf.lit(6.))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|
> |  1|2.0|  2.0| 6.0|
> |  1|3.0|  5.0| 6.0|
> |  1|1.0|  6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0|
> |  1|   -1.0|  3.0| 6.0|
> {code}
> #then apply the second window operations
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
> |  1|2.0|  2.0| 6.0| 2.0|
> |  1|3.0|  5.0| 6.0| 5.0|
> |  1|1.0|  6.0| 6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0| 6.0|
> |  1|   -1.0|  3.0| 6.0| 6.0|
> I do not understand this behaviour
> well, so far so good, but then I try another operation then again get similar 
> error:
> {code}
> 

[jira] [Assigned] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-26 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20086:


Assignee: Apache Spark

> issue with pyspark 2.1.0 window function
> 
>
> Key: SPARK-20086
> URL: https://issues.apache.org/jira/browse/SPARK-20086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: mandar uapdhye
>Assignee: Apache Spark
>
> original  post at
> [stackoverflow | 
> http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]
> I get error when working with pyspark window function. here is some example 
> code:
> {code:title=borderStyle=solid}
> import pyspark
> import pyspark.sql.functions as sf
> import pyspark.sql.types as sparktypes
> from pyspark.sql import window
> 
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
> df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
> df.show()
> {code}
> gives:
> |  x|AmtPaid|
> |  1|2.0|
> |  1|3.0|
> |  1|1.0|
> |  1|   -2.0|
> |  1|   -1.0|
> next, compute cumulative sum
> {code:title=test.py|borderStyle=solid}
> win_spec_max = (window.Window
> .partitionBy(['x'])
> .rowsBetween(window.Window.unboundedPreceding, 0)))
> df = df.withColumn('AmtPaidCumSum',
>sf.sum(sf.col('AmtPaid')).over(win_spec_max))
> df.show()
> {code}
> gives,
> |  x|AmtPaid|AmtPaidCumSum|
> |  1|2.0|  2.0|
> |  1|3.0|  5.0|
> |  1|1.0|  6.0|
> |  1|   -2.0|  4.0|
> |  1|   -1.0|  3.0|
> next, compute cumulative max,
> {code}
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> gives error log
> {noformat}
>  Py4JJavaError: An error occurred while calling o2609.showString.
> with traceback:
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 df.show()
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
> show(self, n, truncate)
> 316 """
> 317 if isinstance(truncate, bool) and truncate:
> --> 318 print(self._jdf.showString(n, 20))
> 319 else:
> 320 print(self._jdf.showString(n, int(truncate)))
> 
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
> in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, 
> self.name)
>1134 
>1135 for temp_arg in temp_args:
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling 
> {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> {noformat}
> but interestingly enough, if i introduce another change before sencond window 
> operation, say inserting a column then it does not give that error:
> {code}
> df = df.withColumn('MaxBound', sf.lit(6.))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|
> |  1|2.0|  2.0| 6.0|
> |  1|3.0|  5.0| 6.0|
> |  1|1.0|  6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0|
> |  1|   -1.0|  3.0| 6.0|
> {code}
> #then apply the second window operations
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
> |  1|2.0|  2.0| 6.0| 2.0|
> |  1|3.0|  5.0| 6.0| 5.0|
> |  1|1.0|  6.0| 6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0| 6.0|
> |  1|   -1.0|  3.0| 6.0| 6.0|
> I do not understand this behaviour
> well, so far so good, but then I try another operation then again get 

[jira] [Commented] (SPARK-20086) issue with pyspark 2.1.0 window function

2017-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942190#comment-15942190
 ] 

Apache Spark commented on SPARK-20086:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17432

> issue with pyspark 2.1.0 window function
> 
>
> Key: SPARK-20086
> URL: https://issues.apache.org/jira/browse/SPARK-20086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: mandar uapdhye
>
> original  post at
> [stackoverflow | 
> http://stackoverflow.com/questions/43007433/pyspark-2-1-0-error-when-working-with-window-function]
> I get error when working with pyspark window function. here is some example 
> code:
> {code:title=borderStyle=solid}
> import pyspark
> import pyspark.sql.functions as sf
> import pyspark.sql.types as sparktypes
> from pyspark.sql import window
> 
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> rdd = sc.parallelize([(1, 2.0), (1, 3.0), (1, 1.), (1, -2.), (1, -1.)])
> df = sqlc.createDataFrame(rdd, ["x", "AmtPaid"])
> df.show()
> {code}
> gives:
> |  x|AmtPaid|
> |  1|2.0|
> |  1|3.0|
> |  1|1.0|
> |  1|   -2.0|
> |  1|   -1.0|
> next, compute cumulative sum
> {code:title=test.py|borderStyle=solid}
> win_spec_max = (window.Window
> .partitionBy(['x'])
> .rowsBetween(window.Window.unboundedPreceding, 0)))
> df = df.withColumn('AmtPaidCumSum',
>sf.sum(sf.col('AmtPaid')).over(win_spec_max))
> df.show()
> {code}
> gives,
> |  x|AmtPaid|AmtPaidCumSum|
> |  1|2.0|  2.0|
> |  1|3.0|  5.0|
> |  1|1.0|  6.0|
> |  1|   -2.0|  4.0|
> |  1|   -1.0|  3.0|
> next, compute cumulative max,
> {code}
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> gives error log
> {noformat}
>  Py4JJavaError: An error occurred while calling o2609.showString.
> with traceback:
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
> > 1 df.show()
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.pyc in 
> show(self, n, truncate)
> 316 """
> 317 if isinstance(truncate, bool) and truncate:
> --> 318 print(self._jdf.showString(n, 20))
> 319 else:
> 320 print(self._jdf.showString(n, int(truncate)))
> 
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/java_gateway.pyc 
> in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, 
> self.name)
>1134 
>1135 for temp_arg in temp_args:
> /Users/<>/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /Users/<>/.virtualenvs/<>/lib/python2.7/site-packages/py4j/protocol.pyc 
> in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling 
> {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> {noformat}
> but interestingly enough, if i introduce another change before sencond window 
> operation, say inserting a column then it does not give that error:
> {code}
> df = df.withColumn('MaxBound', sf.lit(6.))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|
> |  1|2.0|  2.0| 6.0|
> |  1|3.0|  5.0| 6.0|
> |  1|1.0|  6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0|
> |  1|   -1.0|  3.0| 6.0|
> {code}
> #then apply the second window operations
> df = df.withColumn('AmtPaidCumSumMax', 
> sf.max(sf.col('AmtPaidCumSum')).over(win_spec_max))
> df.show()
> {code}
> |  x|AmtPaid|AmtPaidCumSum|MaxBound|AmtPaidCumSumMax|
> |  1|2.0|  2.0| 6.0| 2.0|
> |  1|3.0|  5.0| 6.0| 5.0|
> |  1|1.0|  6.0| 6.0| 6.0|
> |  1|   -2.0|  4.0| 6.0| 6.0|
> |  1|   -1.0|  3.0| 6.0| 6.0|
> I do not understand this behaviour

[jira] [Resolved] (SPARK-20046) Facilitate loop optimizations in a JIT compiler regarding sqlContext.read.parquet()

2017-03-26 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-20046.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.2.0

> Facilitate loop optimizations in a JIT compiler regarding 
> sqlContext.read.parquet()
> ---
>
> Key: SPARK-20046
> URL: https://issues.apache.org/jira/browse/SPARK-20046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{sqlContext.read.parquet("file")}} 
> to facilitate loop optimizations in a JIT compiler for achieving better 
> performance. In particular, [this stackoverflow 
> entry|http://stackoverflow.com/questions/40629435/fast-parquet-row-count-in-spark]
>  suggests me to improve performance of 
> {{sqlContext.read.parquet("file").count}}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20092) Trigger AppVeyor R tests for changes in Scala code related with R API

2017-03-26 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20092.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Trigger AppVeyor R tests for changes in Scala code related with R API
> -
>
> Key: SPARK-20092
> URL: https://issues.apache.org/jira/browse/SPARK-20092
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra, SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> We are currently detecting the changes in {{./R}} directory and then trigger 
> AppVeyor tests.
> It seems we need to tests when there are some changes in 
> {{./core/src/main/scala/org/apache/spark/r}} and 
> {{./sql/core/src/main/scala/org/apache/spark/sql/api/r}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org