date:20161205

[jira] [Commented] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2016-12-05 Thread Damian Momot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724702#comment-15724702
 ] 

Damian Momot commented on SPARK-18717:
--

Yep it's already workarounded this way in my code but usage of standard scala 
types is simply more natural

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Damian Momot
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18737) Serialization setting "spark.serializer" ignored in Spark 2.x

2016-12-05 Thread Dr. Michael Menzel (JIRA)

Dr. Michael Menzel created SPARK-18737:
--

 Summary: Serialization setting "spark.serializer" ignored in Spark 
2.x
 Key: SPARK-18737
 URL: https://issues.apache.org/jira/browse/SPARK-18737
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1, 2.0.0
Reporter: Dr. Michael Menzel


The following exception occurs although the JavaSerializer has been activated:

16/11/22 10:49:24 INFO TaskSetManager: Starting task 0.0 in stage 9.0 (TID 77, 
ip-10-121-14-147.eu-central-1.compute.internal, partition 1, RACK_LOCAL, 5621 
bytes)
16/11/22 10:49:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 
77 on executor id: 2 hostname: ip-10-121-14-147.eu-central-1.compute.internal.
16/11/22 10:49:24 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 
ip-10-121-14-147.eu-central-1.compute.internal:45059 (size: 879.0 B, free: 
410.4 MB)
16/11/22 10:49:24 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 77, 
ip-10-121-14-147.eu-central-1.compute.internal): 
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
13994
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.apache.spark.util.NextIterator.to(NextIterator.scala:21)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at org.apache.spark.util.NextIterator.toBuffer(NextIterator.scala:21)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at org.apache.spark.util.NextIterator.toArray(NextIterator.scala:21)
at 
org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDD$$anonfun$toLocalIterator$1$$anonfun$org$apache$spark$rdd$RDD$$anonfun$$collectPartition$1$1.apply(RDD.scala:927)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

The code runs perfectly with Spark 1.6.0. Since we moved to 2.0.0 and now 
2.0.1, we see the Kyro deserialization exception and over time the Spark 
streaming job stops processing since too many tasks failed.

Our action was to use conf.set("spark.serializer", 
"org.apache.spark.serializer.JavaSerializer") and to disable Kryo class 
registration with conf.set("spark.kryo.registrationRequired", false). We hope 
to identify the root cause of the exception. 

However, setting the serializer to JavaSerializer is oviously ignored by the 
Spark-internals. Despite the setting we still see the exception printed in the 
log and tasks fail. The occurence seems to be non-deterministic, but to become 
more frequent over time.

Several questions we could not answer during our troubleshooting:
1. How can the debug log for Kryo be enabled? -- We tried following the minilog 
documentation, but no output can be found.
2. Is the serializer setting effective for Spark internal serializations? How 
can the JavaSerialize be forced on internal serializations for worker to driver 
communication?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18736) [SQL] CreateMap allow non-unique keys

2016-12-05 Thread Eyal Farago (JIRA)

Eyal Farago created SPARK-18736:
---

 Summary: [SQL] CreateMap allow non-unique keys
 Key: SPARK-18736
 URL: https://issues.apache.org/jira/browse/SPARK-18736
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Eyal Farago


Spark-Sql, CreateMap does not enforce unique keys, i.e. it's possible to create 
a map with two identical keys: 
CreateMap(Literal(1), Literal(11),
   Literal(1), Literal(12))

This does not behave like standard maps in common programming languages.
proper behavior should be chosen"
1. first 'wins'
2. last 'wins'
3. runtime error.

* currently GetMapValue implements option #1. even if this is the desired 
behavior CreateMap should return a unique map.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18735) Why don't we destroy the broadcast variable after each iteration?

2016-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724620#comment-15724620
 ] 

Sean Owen commented on SPARK-18735:
---

Should be because they are used in a computation that produces a simple value 
on the driver, not an RDD. Problem 2 is not an issue in this context.

> Why don't we destroy the broadcast variable after each iteration?
> -
>
> Key: SPARK-18735
> URL: https://issues.apache.org/jira/browse/SPARK-18735
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Jianfei Wang
>
> I think we should destroy the broadcast variable bcWeights explicitly in 
> spark.mllib.GradientDescent.runMiniBatchSGD
> see the code below:
>   while (!converged && i <= numIterations) {
>   val start = System.nanoTime()
>   val bcWeights = data.context.broadcast(weights)
>  // some other code
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18735) Why don't we destroy the broadcast variable after each iteration?

2016-12-05 Thread Jianfei Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724605#comment-15724605
 ] 

Jianfei Wang commented on SPARK-18735:
--

oh no,I just see in Kmeans and GaussianMixture they both use destroy.

> Why don't we destroy the broadcast variable after each iteration?
> -
>
> Key: SPARK-18735
> URL: https://issues.apache.org/jira/browse/SPARK-18735
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Jianfei Wang
>
> I think we should destroy the broadcast variable bcWeights explicitly in 
> spark.mllib.GradientDescent.runMiniBatchSGD
> see the code below:
>   while (!converged && i <= numIterations) {
>   val start = System.nanoTime()
>   val bcWeights = data.context.broadcast(weights)
>  // some other code
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18735) Why don't we destroy the broadcast variable after each iteration?

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18735.
---
Resolution: Invalid

(Please ask questions on the mailing list)

Two reasons:

1) if the computation that uses the broadcast isn't triggered within the loop 
iteration, then you will be removing the broadcast before it's used
2) destroy()-ing even after the broadcast has definitely been used to compute, 
say, a cached RDD, is a problem because the cached RDD might still be 
recomputed and a destroyed broadcast can't be used. unpersist() is more correct 
in this case.

> Why don't we destroy the broadcast variable after each iteration?
> -
>
> Key: SPARK-18735
> URL: https://issues.apache.org/jira/browse/SPARK-18735
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.2
>Reporter: Jianfei Wang
>
> I think we should destroy the broadcast variable bcWeights explicitly in 
> spark.mllib.GradientDescent.runMiniBatchSGD
> see the code below:
>   while (!converged && i <= numIterations) {
>   val start = System.nanoTime()
>   val bcWeights = data.context.broadcast(weights)
>  // some other code
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18735) Why don't we destroy the broadcast variable after each iteration?

2016-12-05 Thread Jianfei Wang (JIRA)

Jianfei Wang created SPARK-18735:


 Summary: Why don't we destroy the broadcast variable after each 
iteration?
 Key: SPARK-18735
 URL: https://issues.apache.org/jira/browse/SPARK-18735
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.2
Reporter: Jianfei Wang


I think we should destroy the broadcast variable bcWeights explicitly in 
spark.mllib.GradientDescent.runMiniBatchSGD
see the code below:
  while (!converged && i <= numIterations) {
  val start = System.nanoTime()
  val bcWeights = data.context.broadcast(weights)
 // some other code
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724488#comment-15724488
 ] 

Sean Owen commented on SPARK-18731:
---

Yes, the scheduler delay comes from having to transmit the huge centroids at 
each step. This is inherent in how k-means is implemented. It proceeds by 
broadcasting the centroids, which at least means it's transmitted just once per 
executor. I don't think more partitions helps here.

Thinking about it, I would actually be surprised if the broadcast variable's 
size is part of the task size, because it's not transmitted as part of the task 
closure. That makes me wonder if, actually, the k-means implementation is 
inadvertently also sending the centroids a second time as part of the task 
closure. 

I haven't looked into it, but are you able to enable DEBUG logging, and then 
look for messages from util.ClosureCleaner? it will print some details about 
what it's serializing. Or attach a debugger to the driver, and break at the 
statement where it logs the warning, and see what object is part of the task 
closure? if I'm right, it shouldn't have a copy of the centroids, because 
that's broadcast. If it has something huge I think that's a problem.

> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724442#comment-15724442
 ] 

Xiao Li commented on SPARK-18712:
-

[~cloud_fan] Sure, let me take this.

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one element in dataframe df, but the function f has been called 
> twice, so I guess no short circuit.
> the result seems to conflict with #SPARK-1461



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724391#comment-15724391
 ] 

Wenchen Fan edited comment on SPARK-18712 at 12/6/16 5:35 AM:
--

Spark SQL has no guarantee about the filter conditions execution order, if the 
condition is deterministic. According to the fact that UDF must be 
deterministic, I'm afraid you can't write a filter condition UDF that depends 
on previous conditions. A workaround is, embed the previous conditions in your 
UDF.

The code snippet works in 1.5 because there is an issue in 2.0. The 
`PushDownPredicates` rule doesn't work for `BatchEvalPythonExec` anymore, 
anyone wanna work on it? cc [~smilegator] [~dongjoon] .And we also need to 
document the behaviour of filter and say that deterministic filter conditions 
should not depend on previous ones.


was (Author: cloud_fan):
Spark SQL has no guarantee about the filter conditions execution order, if the 
condition is deterministic. According to the fact that UDF must be 
deterministic, I'm afraid you can't write a filter condition UDF that depends 
on previous conditions. A workaround is, embed the previous conditions in your 
UDF.

The code snippet works in 1.5 because there is an issue in 2.0. The 
`PushDownPredicates` rule doesn't work for `BatchEvalPythonExec` anymore, 
anyone wanna work on it? cc [~smilegator] [~dongjoon]

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one element in dataframe df, but the function f has been called 
> twice, so I guess no

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Xiaoye Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724415#comment-15724415
 ] 

Xiaoye Sun commented on SPARK-18731:


My concern is not about improving the overall performance of K-means (improving 
the performance may be just the outcome but it is not my immediate goal). My 
concern is only about the huge "scheduler delay" shown on the Web UI page. I am 
working on a network system for Spark and our system experiment prefers small 
task sizes. In such case, tasks in the same stage can start almost at the same 
time across all the workers. K-means is one of the use cases of our system. I 
am wondering if there can be a K-means implementation having very small task 
sizes, and the data used by the tasks can be retrieved after the tasks have 
been deployed on the workers.

I am relatively new to Spark, so maybe what I want may not be the way how Spark 
works. 

I think the task size should always be small since I saw the warning shown at 
the driver complaining about large task size. I suppose Spark prefers small 
task size and there should be a way to work around large tasks size. 

In another word, my question is "for a task processing large data, can we have 
a small task size so that the scheduler can deploy it quickly and the executor 
retrieve the data after the task has been deployed on the worker".

Please educate me if say something wrong.

Thanks!

> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724391#comment-15724391
 ] 

Wenchen Fan commented on SPARK-18712:
-

Spark SQL has no guarantee about the filter conditions execution order, if the 
condition is deterministic. According to the fact that UDF must be 
deterministic, I'm afraid you can't write a filter condition UDF that depends 
on previous conditions. A workaround is, embed the previous conditions in your 
UDF.

The code snippet works in 1.5 because there is an issue in 2.0. The 
`PushDownPredicates` rule doesn't work for `BatchEvalPythonExec` anymore, 
anyone wanna work on it? cc [~smilegator] [~dongjoon]

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one element in dataframe df, but the function f has been called 
> twice, so I guess no short circuit.
> the result seems to conflict with #SPARK-1461



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724381#comment-15724381
 ] 

Cheng Lian edited comment on SPARK-18712 at 12/6/16 5:10 AM:
-

I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

In the case of the snippet in the JIRA description, the first predicate is a 
full function while the second is a partial function of the output of the 
original {{df}}.


was (Author: lian cheng):
I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one

[jira] [Commented] (SPARK-18712) keep the order of sql expression and support short circuit

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724381#comment-15724381
 ] 

Cheng Lian commented on SPARK-18712:


I think the contract here is that for a DataFrame {{df}} and 1 or more 
consecutive filter predicates applied to {{df}}, each filter predicate must be 
a full function over the output of {{df}}. Only in this way, we can ensure that 
the execution order of all the filter predicates can be irrelevant. This 
contract is important for optimizations like filter push-down. If we have to 
preserve execution order of all filter predicates, you won't be able to push 
down {{a}} in {{a AND b}}, and lose lots of optimization opportunities.

> keep the order of sql expression and support short circuit
> --
>
> Key: SPARK-18712
> URL: https://issues.apache.org/jira/browse/SPARK-18712
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: Ubuntu 16.04
>Reporter: yahsuan, chang
>
> The following python code fails with spark 2.0.2, but works with spark 1.5.2
> {code}
> # a.py
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> table = {5: True, 6: False}
> df = sqlc.range(10)
> df = df.where(df['id'].isin(5, 6))
> f = F.udf(lambda x: table[x], T.BooleanType())
> df = df.where(f(df['id']))
> # df.explain(True)
> print(df.count())
> {code}
> here is the exception 
> {code}
> KeyError: 0
> {code}
> I guess the problem is about the order of sql expression.
> the following are the explain of two spark version
> {code}
> # explain of spark 2.0.2
> == Parsed Logical Plan ==
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> id: bigint
> Filter (id#0L)
> +- Filter cast(id#0L as bigint) IN (cast(5 as bigint),cast(6 as bigint))
>+- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Filter (id#0L IN (5,6) && (id#0L))
> +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Project [id#0L]
> +- *Filter (id#0L IN (5,6) && pythonUDF0#5)
>+- BatchEvalPython [(id#0L)], [id#0L, pythonUDF0#5]
>   +- *Range (0, 10, step=1, splits=Some(4))
> {code}
> {code}
> # explain of spark 1.5.2
> == Parsed Logical Plan ==
> 'Project [*,PythonUDF#(id#0L) AS sad#1]
>  Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
>   LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Analyzed Logical Plan ==
> id: bigint, sad: int
> Project [id#0L,sad#1]
>  Project [id#0L,pythonUDF#2 AS sad#1]
>   EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>Filter id#0L IN (cast(5 as bigint),cast(6 as bigint))
> LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Optimized Logical Plan ==
> Project [id#0L,pythonUDF#2 AS sad#1]
>  EvaluatePython PythonUDF#(id#0L), pythonUDF#2
>   Filter id#0L IN (5,6)
>LogicalRDD [id#0L], MapPartitionsRDD[3] at range at 
> NativeMethodAccessorImpl.java:-2
> == Physical Plan ==
> TungstenProject [id#0L,pythonUDF#2 AS sad#1]
>  !BatchPythonEvaluation PythonUDF#(id#0L), [id#0L,pythonUDF#2]
>   Filter id#0L IN (5,6)
>Scan PhysicalRDD[id#0L]
> Code Generation: true
> {code}
> Also, I am not sure if the sql expression support short circuit evaluation, 
> so I do the following experiment
> {code}
> import pyspark
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> def f(x):
> print('in f')
> return True
> f_udf = F.udf(f, T.BooleanType())
> df = sqlc.createDataFrame([(1, 2)], schema=['a', 'b'])
> df = df.where(f_udf('a') | f_udf('b'))
> df.show()
> {code}
> and I got the following output for both spark 1.5.2 and spark 2.0.2
> {code}
> in f
> in f
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  2|
> +---+---+
> {code}
> there is only one element in dataframe df, but the function f has been called 
> twice, so I guess no short circuit.
> the result seems to conflict with #SPARK-1461



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18734:


Assignee: Apache Spark  (was: Tathagata Das)

> Represent timestamp in StreamingQueryProgress as formatted string instead of 
> millis
> ---
>
> Key: SPARK-18734
> URL: https://issues.apache.org/jira/browse/SPARK-18734
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18734:


Assignee: Tathagata Das  (was: Apache Spark)

> Represent timestamp in StreamingQueryProgress as formatted string instead of 
> millis
> ---
>
> Key: SPARK-18734
> URL: https://issues.apache.org/jira/browse/SPARK-18734
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724371#comment-15724371
 ] 

Apache Spark commented on SPARK-18734:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16166

> Represent timestamp in StreamingQueryProgress as formatted string instead of 
> millis
> ---
>
> Key: SPARK-18734
> URL: https://issues.apache.org/jira/browse/SPARK-18734
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-05 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-18734:
-

 Summary: Represent timestamp in StreamingQueryProgress as 
formatted string instead of millis
 Key: SPARK-18734
 URL: https://issues.apache.org/jira/browse/SPARK-18734
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das


Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18734) Represent timestamp in StreamingQueryProgress as formatted string instead of millis

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18734:
--
Issue Type: Improvement  (was: Bug)

> Represent timestamp in StreamingQueryProgress as formatted string instead of 
> millis
> ---
>
> Key: SPARK-18734
> URL: https://issues.apache.org/jira/browse/SPARK-18734
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> Easier to read when debugging



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18733) Spark history server file cleaner excludes in-progress files

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18733:


Assignee: (was: Apache Spark)

> Spark history server file cleaner excludes in-progress files
> 
>
> Key: SPARK-18733
> URL: https://issues.apache.org/jira/browse/SPARK-18733
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Ergin Seyfe
>
> When we restart history server, it does spend a lot of time to load/replay  
> incomplete applications which mean the inprogress log files in the log folder.
> We have already enabled "spark.history.fs.cleaner.enabled" but  seems like 
> it's skipping the inprogress files.
> I checked the log folder and saw that there are many old orphan files. 
> Probably files left over due to spark-driver failures or OOMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18733) Spark history server file cleaner excludes in-progress files

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18733:


Assignee: Apache Spark

> Spark history server file cleaner excludes in-progress files
> 
>
> Key: SPARK-18733
> URL: https://issues.apache.org/jira/browse/SPARK-18733
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Ergin Seyfe
>Assignee: Apache Spark
>
> When we restart history server, it does spend a lot of time to load/replay  
> incomplete applications which mean the inprogress log files in the log folder.
> We have already enabled "spark.history.fs.cleaner.enabled" but  seems like 
> it's skipping the inprogress files.
> I checked the log folder and saw that there are many old orphan files. 
> Probably files left over due to spark-driver failures or OOMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18733) Spark history server file cleaner excludes in-progress files

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724362#comment-15724362
 ] 

Apache Spark commented on SPARK-18733:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/16165

> Spark history server file cleaner excludes in-progress files
> 
>
> Key: SPARK-18733
> URL: https://issues.apache.org/jira/browse/SPARK-18733
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Ergin Seyfe
>
> When we restart history server, it does spend a lot of time to load/replay  
> incomplete applications which mean the inprogress log files in the log folder.
> We have already enabled "spark.history.fs.cleaner.enabled" but  seems like 
> it's skipping the inprogress files.
> I checked the log folder and saw that there are many old orphan files. 
> Probably files left over due to spark-driver failures or OOMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18733) Spark history server file cleaner excludes in-progress files

2016-12-05 Thread Ergin Seyfe (JIRA)

Ergin Seyfe created SPARK-18733:
---

 Summary: Spark history server file cleaner excludes in-progress 
files
 Key: SPARK-18733
 URL: https://issues.apache.org/jira/browse/SPARK-18733
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Ergin Seyfe


When we restart history server, it does spend a lot of time to load/replay  
incomplete applications which mean the inprogress log files in the log folder.

We have already enabled "spark.history.fs.cleaner.enabled" but  seems like it's 
skipping the inprogress files.

I checked the log folder and saw that there are many old orphan files. Probably 
files left over due to spark-driver failures or OOMs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724335#comment-15724335
 ] 

yuhao yang commented on SPARK-18731:


Based on my experiences, generally KMeans is fast even for large data size. 
100K * 100 is not that large (80M). If the warning concerns you, just try 
different partition size to see if there's any performance improvement 
(probably not). And you can also try to optimize K-Means with GEMM operations 
and use some native BLAS library which can make it much faster depending on 
your data size.



> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18721.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16160
[https://github.com/apache/spark/pull/16160]

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.1.0
>
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18672) Close recordwriter in SparkHadoopMapReduceWriter before committing

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18672.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16098
[https://github.com/apache/spark/pull/16098]

> Close recordwriter in SparkHadoopMapReduceWriter before committing
> --
>
> Key: SPARK-18672
> URL: https://issues.apache.org/jira/browse/SPARK-18672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems some APIs such as {{PairRDDFunctions.saveAsHadoopDataset()}} do not 
> close the record writer before issuing the commit for the task.
> On Windows, the output in the temp directory is being open and output 
> committer tries to rename it from temp directory to the output directory 
> after finishing writing. 
> So, it fails to move the file. It seems we should close the writer actually 
> before committing the task like the other writers such as 
> {{FileFormatWriter}}.
> Identified failure was as below:
> {code}
> FAILURE! - in org.apache.spark.JavaAPISuite
> writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite)  Time elapsed: 0.25 
> sec  <<< ERROR!
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
> Caused by: org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor 
> driver): org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Could not rename 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155__r_00_0
>  to 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155__r_00
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415)
>   at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
>   at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168)
>   ... 8 more
> Driver stacktrace:
>   at 
> org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
> Caused by: org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.io.IOException: Could not rename 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155__r_00_0
>  to 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155__r_00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17591) Fix/investigate the failure of tests in Scala On Windows

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17591:
--
Assignee: Hyukjin Kwon

> Fix/investigate the failure of tests in Scala On Windows
> 
>
> Key: SPARK-17591
> URL: https://issues.apache.org/jira/browse/SPARK-17591
> Project: Spark
>  Issue Type: Test
>  Components: Build, DStreams, Spark Core, SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>
> {code}
> Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.53 sec 
> <<< FAILURE! - in org.apache.spark.JavaAPISuite
> wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.313 sec  <<< 
> FAILURE!
> java.lang.AssertionError: 
> expected: > but was:
>   at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
> {code}
> {code}
> Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.062 sec <<< 
> FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
> testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time 
> elapsed: 0.047 sec  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<1>
>   at 
> org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:177)
> {code}
> {code}
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> Running org.apache.spark.streaming.JavaDurationSuite
> {code}
> {code}
> Running org.apache.spark.streaming.JavaAPISuite
> Tests run: 53, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.325 sec 
> <<< FAILURE! - in org.apache.spark.streaming.JavaAPISuite
> testCheckpointMasterRecovery(org.apache.spark.streaming.JavaAPISuite)  Time 
> elapsed: 3.418 sec  <<< ERROR!
> java.io.IOException: Failed to delete: 
> C:\projects\spark\streaming\target\tmp\1474255953021-0
>   at 
> org.apache.spark.streaming.JavaAPISuite.testCheckpointMasterRecovery(JavaAPISuite.java:1808)
> {code}
> {code}
> Results :
> Tests in error: 
>   JavaAPISuite.testCheckpointMasterRecovery:1808 � IO Failed to delete: 
> C:\proje...
> Tests run: 74, Failures: 0, Errors: 1, Skipped: 0
> {code}
> The tests were aborted for unknown reason during SQL tests - 
> {{BroadcastJoinSuite}} emitting the exceptions below continuously:
> {code}
> 20:48:09.876 ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error 
> running executor
> java.io.IOException: Cannot run program "C:\Progra~1\Java\jdk1.8.0\bin\java" 
> (in directory "C:\projects\spark\work\app-20160918204809-\0"): 
> CreateProcess error=206, The filename or extension is too long
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:167)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
> Caused by: java.io.IOException: CreateProcess error=206, The filename or 
> extension is too long
>   at java.lang.ProcessImpl.create(Native Method)
>   at java.lang.ProcessImpl.(ProcessImpl.java:386)
>   at java.lang.ProcessImpl.start(ProcessImpl.java:137)
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
>   ... 2 more
> {code}
> Here is the full log for the test - 
> https://ci.appveyor.com/project/spark-test/spark/build/15-scala-tests
> We may have to create sub-tasks if these are actual issues on Windows rather 
> than just mistakes in tests.
> I am willing to test this again after fixing some issues here in particular 
> the last one.
> I trigger the build by the comments below:
> {code}
> mvn -DskipTests -Phadoop-2.6 -Phive -Phive-thriftserver package
> mvn -Phadoop-2.6 -Phive -Phive-thriftserver --fail-never test
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18672) Close recordwriter in SparkHadoopMapReduceWriter before committing

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18672:
--
Assignee: Hyukjin Kwon

> Close recordwriter in SparkHadoopMapReduceWriter before committing
> --
>
> Key: SPARK-18672
> URL: https://issues.apache.org/jira/browse/SPARK-18672
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> It seems some APIs such as {{PairRDDFunctions.saveAsHadoopDataset()}} do not 
> close the record writer before issuing the commit for the task.
> On Windows, the output in the temp directory is being open and output 
> committer tries to rename it from temp directory to the output directory 
> after finishing writing. 
> So, it fails to move the file. It seems we should close the writer actually 
> before committing the task like the other writers such as 
> {{FileFormatWriter}}.
> Identified failure was as below:
> {code}
> FAILURE! - in org.apache.spark.JavaAPISuite
> writeWithNewAPIHadoopFile(org.apache.spark.JavaAPISuite)  Time elapsed: 0.25 
> sec  <<< ERROR!
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
> Caused by: org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor 
> driver): org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:182)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:100)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$3.apply(SparkHadoopMapReduceWriter.scala:99)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Could not rename 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155__r_00_0
>  to 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155__r_00
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:436)
>   at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:415)
>   at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
>   at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:76)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:153)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:167)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$$anonfun$4.apply(SparkHadoopMapReduceWriter.scala:156)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.org$apache$spark$internal$io$SparkHadoopMapReduceWriter$$executeTask(SparkHadoopMapReduceWriter.scala:168)
>   ... 8 more
> Driver stacktrace:
>   at 
> org.apache.spark.JavaAPISuite.writeWithNewAPIHadoopFile(JavaAPISuite.java:1231)
> Caused by: org.apache.spark.SparkException: Task failed while writing rows
> Caused by: java.io.IOException: Could not rename 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/_temporary/attempt_20161201005155__r_00_0
>  to 
> file:/C:/projects/spark/core/target/tmp/1480553515529-0/output/_temporary/0/task_20161201005155__r_00
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18684) Spark Executors off-heap memory usage keeps increasing while running spark streaming

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18684.
---
Resolution: Not A Problem

> Spark Executors off-heap memory usage keeps increasing while running spark 
> streaming
> 
>
> Key: SPARK-18684
> URL: https://issues.apache.org/jira/browse/SPARK-18684
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.2, 2.0.2
>Reporter: Krishna Gandra
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18572) Use the hive client method "getPartitionNames" to answer "SHOW PARTITIONS" queries on partitioned Hive tables

2016-12-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18572.
-
   Resolution: Fixed
 Assignee: Michael Allman
Fix Version/s: 2.1.0

> Use the hive client method "getPartitionNames" to answer "SHOW PARTITIONS" 
> queries on partitioned Hive tables
> -
>
> Key: SPARK-18572
> URL: https://issues.apache.org/jira/browse/SPARK-18572
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.0
>
>
> Currently Spark answers the {{SHOW PARTITIONS}} query by fetching all of the 
> table's partition metadata from the external catalog and constructing 
> partition names therefrom. The Hive client has a {{getPartitionNames}} method 
> which is orders of magnitude faster, with the performance improvement scaling 
> up with the number of partitions in the table. I believe we can use this 
> method to great effect.
> Further details are provided in the associated PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

2016-12-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724144#comment-15724144
 ] 

Dongjoon Hyun commented on SPARK-18709:
---

I'll check which commit added the guard condition.

> Automatic null conversion bug (instead of throwing error) when creating a 
> Spark Datarame with incompatible types for fields.
> 
>
> Key: SPARK-18709
> URL: https://issues.apache.org/jira/browse/SPARK-18709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Amogh Param
>  Labels: bug
> Fix For: 2.0.0
>
>
> When converting an RDD with a `float` type field to a spark dataframe with an 
> `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently 
> convert the field values to nulls instead of throwing an error like `LongType 
> can not accept object ___ in type `. However, this seems to be 
> fixed in Spark 2.0.2.
> The following example should make the problem clear:
> {code}
> from pyspark.sql.types import StructField, StructType, LongType, DoubleType
> schema = StructType([
> StructField("0", LongType(), True),
> StructField("1", DoubleType(), True),
> ])
> data = [[1.0, 1.0], [nan, 2.0]]
> spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
> spark_df.show()
> {code}
> Instead of throwing an error like:
> {code}
> LongType can not accept object 1.0 in type 
> {code}
> Spark converts all the values in the first column to nulls
> Running `spark_df.show()` gives:
> {code}
> ++---+
> |   0|  1|
> ++---+
> |null|1.0|
> |null|1.0|
> ++---+
> {code}
> For the purposes of my computation, I'm doing a `mapPartitions` on a spark 
> data frame, and for each partition, converting it into a pandas data frame, 
> doing a few computations on this pandas dataframe and the return value will 
> be a list of lists, which is converted to an RDD while being returned from 
> 'mapPartitions' (for all partitions). This RDD is then converted into a spark 
> dataframe similar to the example above, using 
> `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should 
> be converted to a `LongType` in the spark data frame, but since it has 
> missing values, it is a `float` type. When spark tries to create the data 
> frame, it converts all the values in that column to nulls instead of throwing 
> an error that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724147#comment-15724147
 ] 

Sean Owen commented on SPARK-18731:
---

Although later versions may be more efficient, I don't think this is a problem. 
You have 100 huge centroids, that are inherently dense, so they are going to be 
large to send around. The warning is correct but it doesn't mean there's a 
problem.

> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18732) The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" should not keep the same.

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18732:


Assignee: (was: Apache Spark)

> The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" 
> should not keep the same.
> 
>
> Key: SPARK-18732
> URL: https://issues.apache.org/jira/browse/SPARK-18732
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Priority: Minor
>
> Currently, the Y axis ranges of "schedulingDelay", "processingTime", and 
> "totalDelay" keeps the same. It is not convenient for users to see, like 
> following：
> !https://cloud.githubusercontent.com/assets/7402327/20911149/09004f26-bba1-11e6-9ff4-af2052979dce.png!
> So, we may need to separate the Y axis ranges of "schedulingDelay", 
> "processingTime", and "totalDelay".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18732) The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" should not keep the same.

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724139#comment-15724139
 ] 

Apache Spark commented on SPARK-18732:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16164

> The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" 
> should not keep the same.
> 
>
> Key: SPARK-18732
> URL: https://issues.apache.org/jira/browse/SPARK-18732
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Priority: Minor
>
> Currently, the Y axis ranges of "schedulingDelay", "processingTime", and 
> "totalDelay" keeps the same. It is not convenient for users to see, like 
> following：
> !https://cloud.githubusercontent.com/assets/7402327/20911149/09004f26-bba1-11e6-9ff4-af2052979dce.png!
> So, we may need to separate the Y axis ranges of "schedulingDelay", 
> "processingTime", and "totalDelay".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18732) The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" should not keep the same.

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18732:


Assignee: Apache Spark

> The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" 
> should not keep the same.
> 
>
> Key: SPARK-18732
> URL: https://issues.apache.org/jira/browse/SPARK-18732
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the Y axis ranges of "schedulingDelay", "processingTime", and 
> "totalDelay" keeps the same. It is not convenient for users to see, like 
> following：
> !https://cloud.githubusercontent.com/assets/7402327/20911149/09004f26-bba1-11e6-9ff4-af2052979dce.png!
> So, we may need to separate the Y axis ranges of "schedulingDelay", 
> "processingTime", and "totalDelay".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18722) Move no data rate limit from StreamExecution to ProgressReporter

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18722.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16155
[https://github.com/apache/spark/pull/16155]

> Move no data rate limit from StreamExecution to ProgressReporter
> 
>
> Key: SPARK-18722
> URL: https://issues.apache.org/jira/browse/SPARK-18722
> Project: Spark
>  Issue Type: Bug
>Reporter: Shixiong Zhu
> Fix For: 2.1.0
>
>
> So that we can also limit items in `recentProgresses`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18732) The Y axis ranges of "schedulingDelay", "processingTime", and "totalDelay" should not keep the same.

2016-12-05 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-18732:
-

 Summary: The Y axis ranges of "schedulingDelay", "processingTime", 
and "totalDelay" should not keep the same.
 Key: SPARK-18732
 URL: https://issues.apache.org/jira/browse/SPARK-18732
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Genmao Yu
Priority: Minor


Currently, the Y axis ranges of "schedulingDelay", "processingTime", and 
"totalDelay" keeps the same. It is not convenient for users to see, like 
following：

!https://cloud.githubusercontent.com/assets/7402327/20911149/09004f26-bba1-11e6-9ff4-af2052979dce.png!

So, we may need to separate the Y axis ranges of "schedulingDelay", 
"processingTime", and "totalDelay".





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18593) JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL

2016-12-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724129#comment-15724129
 ] 

Dongjoon Hyun commented on SPARK-18593:
---

Oops. Thank you for correction.

> JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL
> ---
>
> Key: SPARK-18593
> URL: https://issues.apache.org/jira/browse/SPARK-18593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Durga Prasad Gunturu
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: correctness
> Fix For: 2.0.0
>
>
> In Apache Spark 1.6.x, JDBCRDD returns incorrect results for a query with 
> filters on CHAR column with PostgreSQL CHAR type. The root cause is 
> PostgreSQL returns `space padded string` for a result. So, the post 
> processing filter `Filter (a#0 = A)` is evaluated false. Spark 2.0.0 removes 
> the post filter because it is already handled in the database by 
> `PushedFilters: [EqualTo(a,A)]`.
> {code}
> scala> val t_char = sqlContext.read.option("user", 
> "postgres").option("password", 
> "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_char", new 
> java.util.Properties())
> t_char: org.apache.spark.sql.DataFrame = [a: string]
> scala> val t_varchar = sqlContext.read.option("user", 
> "postgres").option("password", 
> "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_varchar", 
> new java.util.Properties())
> t_varchar: org.apache.spark.sql.DataFrame = [a: string]
> scala> t_char.show
> +--+
> | a|
> +--+
> |A |
> |AA|
> |AAA   |
> +--+
> scala> t_varchar.show
> +---+
> |  a|
> +---+
> |  A|
> | AA|
> |AAA|
> +---+
> scala> t_char.filter(t_char("a")==="A").show
> +---+
> |  a|
> +---+
> +---+
> scala> t_char.filter(t_char("a")==="A ").show
> +--+
> | a|
> +--+
> |A |
> +--+
> scala> t_varchar.filter(t_varchar("a")==="A").show
> +---+
> |  a|
> +---+
> |  A|
> +---+
> scala> t_char.filter(t_char("a")==="A").explain
> == Physical Plan ==
> Filter (a#0 = A)
> +- Scan 
> JDBCRelation(jdbc:postgresql://localhost:5432/postgres,t_char,[Lorg.apache.spark.Partition;@2f65c341,{user=postgres,
>  password=rootpass})[a#0] PushedFilters: [EqualTo(a,A)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Xiaoye Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724121#comment-15724121
 ] 

Xiaoye Sun commented on SPARK-18731:


Could you please provide a link to the solution?
Large task sizes may result in huge scheduler delay in the performance metric 
shown on the web UI. Is there any other way to overcome this?

> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18555) na.fill miss up original values in long integers

2016-12-05 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18555.
-
   Resolution: Fixed
 Assignee: Song Jun
Fix Version/s: 2.2.0

> na.fill miss up original values in long integers
> 
>
> Key: SPARK-18555
> URL: https://issues.apache.org/jira/browse/SPARK-18555
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Mahmoud Rawas
>Assignee: Song Jun
>Priority: Critical
> Fix For: 2.2.0
>
>
> Manly the issue is clarified in the following example:
> Given a Dataset: 
> scala> data.show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426677101|9123146560113991650|
> theoretically when we call na.fill(0) nothing should change, while the 
> current result is:
> scala> data.na.fill(0).show
> |  a|  b|
> |  1|  2|
> | -1| -2|
> |9123146099426676736|9123146560113991680|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18711) NPE in generated SpecificMutableProjection for Aggregator

2016-12-05 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724115#comment-15724115
 ] 

koert kuipers commented on SPARK-18711:
---

confirmed it resolved the issue for me. thanks

> NPE in generated SpecificMutableProjection for Aggregator
> -
>
> Key: SPARK-18711
> URL: https://issues.apache.org/jira/browse/SPARK-18711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> this is a bug in the branch-2.1, but i don't think it was in 2.1.0-rc1
> code (contrived, but based on real code we run):
> {noformat}
>   case class Holder(i: Int)
>   val agg1 = new Aggregator[Int, Tuple1[Option[Holder]], Seq[(String, Int, 
> Int)]] {
> def zero: Tuple1[Option[Holder]] = {
>   val x = Tuple1(None)
>   println(s"zero ${x}")
>   x
> }
> def reduce(b: Tuple1[Option[Holder]], a: Int): Tuple1[Option[Holder]] = {
>   println(s"reduce ${b} ${a}")
>   Tuple1(Some(Holder(b._1.map(_.i + a).getOrElse(a
> }
> def merge(b1: Tuple1[Option[Holder]], b2: Tuple1[Option[Holder]]): 
> Tuple1[Option[Holder]] = {
>   println(s"merge ${b1} ${b2}")
>   (b1._1, b2._1) match {
> case (Some(Holder(i1)), Some(Holder(i2))) => Tuple1(Some(Holder(i1 + 
> i2)))
> case (Some(Holder(i1)), _) => Tuple1(Some(Holder(i1)))
> case (_, Some(Holder(i2))) => Tuple1(Some(Holder(i2)))
> case _ => Tuple1(None)
>   }
> }
> def finish(reduction: Tuple1[Option[Holder]]): Seq[(String, Int, Int)] = {
>   println(s"finish ${reduction}")
>   Seq(("ha", reduction._1.get.i, 0))
> }
> def bufferEncoder: Encoder[Tuple1[Option[Holder]]] = 
> ExpressionEncoder[Tuple1[Option[Holder]]]()
> def outputEncoder: Encoder[Seq[(String, Int, Int)]] = 
> ExpressionEncoder[Seq[(String, Int, Int)]]()
>   }
>   val x = Seq(("a", 1), ("a", 2))
> .toDS
> .groupByKey(_._1)
> .mapValues(_._2)
> .agg(agg1.toColumn)
>   x.printSchema
>   x.show
> {noformat}
> result is:
> {noformat}
> org.apache.spark.executor.Executor: Exception in task 1.0 in stage 146.0 (TID 
> 423)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:223)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:221)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> the error seems to be in the code generation for the aggregator result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724110#comment-15724110
 ] 

Liang-Chi Hsieh commented on SPARK-18539:
-

That's cool.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>

[jira] [Updated] (SPARK-18657) Persist UUID across query restart

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18657:
--
Assignee: Tathagata Das

> Persist UUID across query restart
> -
>
> Key: SPARK-18657
> URL: https://issues.apache.org/jira/browse/SPARK-18657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.1.0
>
>
> We probably also want to add an instance Id or something that changes when 
> the query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18728:
--
Issue Type: Improvement  (was: Bug)

I think the questions will be: what does it gain? and what is the cost of 
introducing a dependency just for this class?

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Improvement
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18593) JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL

2016-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18593:
--
Assignee: Takeshi Yamamuro

> JDBCRDD returns incorrect results for filters on CHAR of PostgreSQL
> ---
>
> Key: SPARK-18593
> URL: https://issues.apache.org/jira/browse/SPARK-18593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Durga Prasad Gunturu
>Assignee: Takeshi Yamamuro
>Priority: Minor
>  Labels: correctness
> Fix For: 2.0.0
>
>
> In Apache Spark 1.6.x, JDBCRDD returns incorrect results for a query with 
> filters on CHAR column with PostgreSQL CHAR type. The root cause is 
> PostgreSQL returns `space padded string` for a result. So, the post 
> processing filter `Filter (a#0 = A)` is evaluated false. Spark 2.0.0 removes 
> the post filter because it is already handled in the database by 
> `PushedFilters: [EqualTo(a,A)]`.
> {code}
> scala> val t_char = sqlContext.read.option("user", 
> "postgres").option("password", 
> "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_char", new 
> java.util.Properties())
> t_char: org.apache.spark.sql.DataFrame = [a: string]
> scala> val t_varchar = sqlContext.read.option("user", 
> "postgres").option("password", 
> "rootpass").jdbc("jdbc:postgresql://localhost:5432/postgres", "t_varchar", 
> new java.util.Properties())
> t_varchar: org.apache.spark.sql.DataFrame = [a: string]
> scala> t_char.show
> +--+
> | a|
> +--+
> |A |
> |AA|
> |AAA   |
> +--+
> scala> t_varchar.show
> +---+
> |  a|
> +---+
> |  A|
> | AA|
> |AAA|
> +---+
> scala> t_char.filter(t_char("a")==="A").show
> +---+
> |  a|
> +---+
> +---+
> scala> t_char.filter(t_char("a")==="A ").show
> +--+
> | a|
> +--+
> |A |
> +--+
> scala> t_varchar.filter(t_varchar("a")==="A").show
> +---+
> |  a|
> +---+
> |  A|
> +---+
> scala> t_char.filter(t_char("a")==="A").explain
> == Physical Plan ==
> Filter (a#0 = A)
> +- Scan 
> JDBCRelation(jdbc:postgresql://localhost:5432/postgres,t_char,[Lorg.apache.spark.Partition;@2f65c341,{user=postgres,
>  password=rootpass})[a#0] PushedFilters: [EqualTo(a,A)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18709) Automatic null conversion bug (instead of throwing error) when creating a Spark Datarame with incompatible types for fields.

2016-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724101#comment-15724101
 ] 

Sean Owen commented on SPARK-18709:
---

BTW do you know what change fixed this, by any chance?

> Automatic null conversion bug (instead of throwing error) when creating a 
> Spark Datarame with incompatible types for fields.
> 
>
> Key: SPARK-18709
> URL: https://issues.apache.org/jira/browse/SPARK-18709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Amogh Param
>  Labels: bug
> Fix For: 2.0.0
>
>
> When converting an RDD with a `float` type field to a spark dataframe with an 
> `IntegerType` / `LongType` schema field, spark 1.6.2 and 1.6.3 silently 
> convert the field values to nulls instead of throwing an error like `LongType 
> can not accept object ___ in type `. However, this seems to be 
> fixed in Spark 2.0.2.
> The following example should make the problem clear:
> {code}
> from pyspark.sql.types import StructField, StructType, LongType, DoubleType
> schema = StructType([
> StructField("0", LongType(), True),
> StructField("1", DoubleType(), True),
> ])
> data = [[1.0, 1.0], [nan, 2.0]]
> spark_df = sqlContext.createDataFrame(sc.parallelize(data), schema)
> spark_df.show()
> {code}
> Instead of throwing an error like:
> {code}
> LongType can not accept object 1.0 in type 
> {code}
> Spark converts all the values in the first column to nulls
> Running `spark_df.show()` gives:
> {code}
> ++---+
> |   0|  1|
> ++---+
> |null|1.0|
> |null|1.0|
> ++---+
> {code}
> For the purposes of my computation, I'm doing a `mapPartitions` on a spark 
> data frame, and for each partition, converting it into a pandas data frame, 
> doing a few computations on this pandas dataframe and the return value will 
> be a list of lists, which is converted to an RDD while being returned from 
> 'mapPartitions' (for all partitions). This RDD is then converted into a spark 
> dataframe similar to the example above, using 
> `sqlContext.createDataFrame(rdd, schema)`. The rdd has a column that should 
> be converted to a `LongType` in the spark data frame, but since it has 
> missing values, it is a `float` type. When spark tries to create the data 
> frame, it converts all the values in that column to nulls instead of throwing 
> an error that there is a type mismatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18720) Code Refactoring of withColumn

2016-12-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18720.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16152
[https://github.com/apache/spark/pull/16152]

> Code Refactoring of withColumn
> --
>
> Key: SPARK-18720
> URL: https://issues.apache.org/jira/browse/SPARK-18720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>Priority: Minor
> Fix For: 2.2.0
>
>
> Our existing withColumn for adding metadata can simply use the existing 
> public withColumn API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18720) Code Refactoring of withColumn

2016-12-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18720:

Affects Version/s: (was: 2.0.2)

> Code Refactoring of withColumn
> --
>
> Key: SPARK-18720
> URL: https://issues.apache.org/jira/browse/SPARK-18720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.2.0
>
>
> Our existing withColumn for adding metadata can simply use the existing 
> public withColumn API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18720) Code Refactoring of withColumn

2016-12-05 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18720:

Assignee: Xiao Li

> Code Refactoring of withColumn
> --
>
> Key: SPARK-18720
> URL: https://issues.apache.org/jira/browse/SPARK-18720
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.2.0
>
>
> Our existing withColumn for adding metadata can simply use the existing 
> public withColumn API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724082#comment-15724082
 ] 

Sean Owen commented on SPARK-18731:
---

This may have been improved in more recent versions, by the way. In any event, 
it's just a warning, and is a valid one, if you're using huge dense features. I 
don't think this is a problem per se?

> Task size in K-means is so large
> 
>
> Key: SPARK-18731
> URL: https://issues.apache.org/jira/browse/SPARK-18731
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: Xiaoye Sun
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> When run the KMeans algorithm for a large model (e.g. 100k features and 100 
> centers), there will be warning shown for many of the stages saying that the 
> task size is very large. Here is an example warning. 
> WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
> The maximum recommended task size is 100 KB.
> This could happen at (sum at KMeansModel.scala:88), (takeSample at 
> KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
> KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18668) Do not auto-generate query name

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18668.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16113
[https://github.com/apache/spark/pull/16113]

> Do not auto-generate query name
> ---
>
> Key: SPARK-18668
> URL: https://issues.apache.org/jira/browse/SPARK-18668
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.1.0
>
>
> With SPARK-18657 we will make the StreamingQuery.id the persistently and 
> truly unique, it does not make sense to use an auto-generated name. Rather 
> name should be meant only as a purely optional pretty identifier set by the 
> user, or remain as null. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18657) Persist UUID across query restart

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18657.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16113
[https://github.com/apache/spark/pull/16113]

> Persist UUID across query restart
> -
>
> Key: SPARK-18657
> URL: https://issues.apache.org/jira/browse/SPARK-18657
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
> Fix For: 2.1.0
>
>
> We probably also want to add an instance Id or something that changes when 
> the query restarts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18729.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 16162
[https://github.com/apache/spark/pull/16162]

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.1.0
>
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15724013#comment-15724013
 ] 

Cheng Lian commented on SPARK-18539:


[~xwu0226], thanks for the new use case!

[~viirya], I do think this is a valid use case as long as all the missing 
columns are nullable. The only reason that this use case doesn't work right now 
is PARQUET-389.

I got some vague idea about a possible cleaner fix for this issue. Will post it 
later.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
>

[jira] [Resolved] (SPARK-18634) Corruption and Correctness issues with exploding Python UDFs

2016-12-05 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18634.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0
   2.0.3

> Corruption and Correctness issues with exploding Python UDFs
> 
>
> Key: SPARK-18634
> URL: https://issues.apache.org/jira/browse/SPARK-18634
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.3, 2.1.0
>
>
> There are some weird issues with exploding Python UDFs in SparkSQL.
> There are 2 cases where based on the DataType of the exploded column, the 
> result can be flat out wrong, or corrupt. Seems like something bad is 
> happening when telling Tungsten the schema of the rows during or after 
> applying the UDF.
> Please check the code below for reproduction.
> Notebook: 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6186780348633019/3425836135165635/4343791953238323/latest.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18729:
--
Priority: Critical  (was: Major)

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18729:
--
Target Version/s: 2.1.0

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18671) Add tests to ensure stability of that all Structured Streaming log formats

2016-12-05 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18671:
--
Priority: Major  (was: Critical)

> Add tests to ensure stability of that all Structured Streaming log formats
> --
>
> Key: SPARK-18671
> URL: https://issues.apache.org/jira/browse/SPARK-18671
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Tathagata Das
>
> To be able to restart StreamingQueries across Spark version, we have already 
> made the logs (offset log, file source log, file sink log) use json. We 
> should added tests with actual json files in the Spark such that any 
> incompatible changes in reading the logs is immediately caught. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Mansur Ashraf (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723786#comment-15723786
 ] 

Mansur Ashraf edited comment on SPARK-18728 at 12/6/16 1:37 AM:


Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of jobs on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we are 
trying to upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we are finding out that aggregateByKey/combineByKey are all gone so we cant 
pass algebird aggregators directly, instead there is a new aggregator API based 
on algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks


was (Author: mashraf):
Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of jobs on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18731) Task size in K-means is so large

2016-12-05 Thread Xiaoye Sun (JIRA)

Xiaoye Sun created SPARK-18731:
--

 Summary: Task size in K-means is so large
 Key: SPARK-18731
 URL: https://issues.apache.org/jira/browse/SPARK-18731
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.1
Reporter: Xiaoye Sun
Priority: Minor


When run the KMeans algorithm for a large model (e.g. 100k features and 100 
centers), there will be warning shown for many of the stages saying that the 
task size is very large. Here is an example warning. 
WARN TaskSetManager: Stage 23 contains a task of very large size (56256 KB). 
The maximum recommended task size is 100 KB.

This could happen at (sum at KMeansModel.scala:88), (takeSample at 
KMeans.scala:378), (aggregate at KMeans.scala:404) and (collect at 
KMeans.scala:436). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723958#comment-15723958
 ] 

Apache Spark commented on SPARK-18730:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/16163

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18730:


Assignee: Cheng Lian  (was: Apache Spark)

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18730:


Assignee: Apache Spark  (was: Cheng Lian)

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723960#comment-15723960
 ] 

Liang-Chi Hsieh commented on SPARK-18539:
-

Actually I am not sure if this is a valid usage. I tend to think it is not. As 
you specify a non existing column, and the system reports the column was not 
found in schema. It looks reasonable to me.


> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at

[jira] [Updated] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18730:
---
Priority: Minor  (was: Major)

> Ask the build script to link to Jenkins test report page instead of full 
> console output page when posting to GitHub
> ---
>
> Key: SPARK-18730
> URL: https://issues.apache.org/jira/browse/SPARK-18730
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
>
> Currently, the full console output page of a Spark Jenkins PR build can be as 
> large as several megabytes. It takes a relatively long time to load and may 
> even freeze the browser for quite a while.
> I'd suggest posting the test report page link to GitHub instead, which is way 
> more concise and is usually the first page I'd like to check when 
> investigating a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723949#comment-15723949
 ] 

Liang-Chi Hsieh commented on SPARK-18539:
-

Because we respect user-specified schema, we won't infer schema and schema 
merging won't step in of course.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at

[jira] [Created] (SPARK-18730) Ask the build script to link to Jenkins test report page instead of full console output page when posting to GitHub

2016-12-05 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-18730:
--

 Summary: Ask the build script to link to Jenkins test report page 
instead of full console output page when posting to GitHub
 Key: SPARK-18730
 URL: https://issues.apache.org/jira/browse/SPARK-18730
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently, the full console output page of a Spark Jenkins PR build can be as 
large as several megabytes. It takes a relatively long time to load and may 
even freeze the browser for quite a while.

I'd suggest posting the test report page link to GitHub instead, which is way 
more concise and is usually the first page I'd like to check when investigating 
a Jenkins build failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888
 ] 

Xin Wu edited comment on SPARK-18539 at 12/6/16 12:46 AM:
--

I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 is null").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
  at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
  at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
{code}

Here I have one parquet file missing column b and query with user-specified 
schema (a, b). 




was (Author: xwu0226):
I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 < 0").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723888#comment-15723888
 ] 

Xin Wu commented on SPARK-18539:


I think we will hit the issue if we use user-specified schema. Here is what I 
tried in spark-shell built from master branch:
{code}
val df = spark.range(1).coalesce(1)
df.selectExpr("id AS 
a").write.parquet("/Users/xinwu/spark-test/data/spark-18539")
val schema = StructType(Seq(StructField("a", IntegerType), StructField("b", 
IntegerType)))
spark.read.option("mergeSchema", 
"true").schema(schema).parquet("/Users/xinwu/spark-test/data/spark-18539").filter("b
 < 0").count()
{code}

The exception is 
{code}
Caused by: java.lang.IllegalArgumentException: Column [b] was not found in 
schema!
  at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:181)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:169)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:151)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:91)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:121)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:58)
  at 
org.apache.parquet.filter2.predicate.Operators$And.accept(Operators.java:308)
  at 
org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:63)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
  at 
org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
  at 
org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
  at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
{code}

Here I have one parquet file missing column b and query with user-specified 
schema (a, b). 



> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
>

[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Alex Levenson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723866#comment-15723866
 ] 

Alex Levenson commented on SPARK-18728:
---

I think the main selling point of Algebird aggregators are:

1) They are composable (you can take a Min aggregator and combine it with a Max 
aggregator to get an aggregator that gets both the Min + Max in 1 pass) -- as 
[~mashraf] points out, you can compose many times to get lots of aggregations 
in 1 pass

2) They have the option for efficient addition methods -- they use algebird's 
Semigroup, which has both plus(a,b) for adding 2 items, and sumOption(iter: 
TraversableOnce[T]) for adding N items. This allows for opting in to efficient 
additions without having a mutable API (sumOption can be mutable internally, 
but it has to be referentially transparent)

3) There are many already built implementations of Aggregator for both common 
types as well as probabilistic data structures available in algebird.

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14280) Update change-version.sh and pom.xml to add Scala 2.12 profiles

2016-12-05 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723814#comment-15723814
 ] 

Jakob Odersky commented on SPARK-14280:
---

You're welcome pull the changes back into your repo of course, however I'd also 
gladly continue on working to add 2.12 support!

Btw, how should this kind of all-or-nothing change be integrated into spark? I 
don't want to make a pull request for some half-baked feature, however I also 
feel like continuing to pile on features to this will result in a huge 
changeset, impossible to review

> Update change-version.sh and pom.xml to add Scala 2.12 profiles
> ---
>
> Key: SPARK-14280
> URL: https://issues.apache.org/jira/browse/SPARK-14280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following instructions will be kept quasi-up-to-date and are the best 
> starting point for building a Spark snapshot with Scala 2.12.0-M4:
> * Check out https://github.com/JoshRosen/spark/tree/build-for-2.12.
> * Install dependencies:
> ** chill: check out https://github.com/twitter/chill/pull/253 and run 
> {{sbt ++2.12.0-M4 publishLocal}}
> * Run {{./dev/change-scala-version.sh 2.12.0-M4}}
> * To compile Spark, run {{build/sbt -Dscala-2.12}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14280) Update change-version.sh and pom.xml to add Scala 2.12 profiles

2016-12-05 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723807#comment-15723807
 ] 

Jakob Odersky commented on SPARK-14280:
---

Hi [~joshrosen],

I rebased your initial work onto the latest master and upgraded dependencies. 
You can see the changes here 
https://github.com/apache/spark/compare/master...jodersky:scala-2.12

There were a few of merge conflicts, often with respect to dependency version 
mismatches. I tried to resolve conflicts cleanly, however considering that I 
also had to take into account libraries that recently built  for 2.12, it could 
be that certain of your changes in the pom.xml were lost.

There are still quite a few depedency issues for the latest scala versions, 
however core still builds :)

> Update change-version.sh and pom.xml to add Scala 2.12 profiles
> ---
>
> Key: SPARK-14280
> URL: https://issues.apache.org/jira/browse/SPARK-14280
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following instructions will be kept quasi-up-to-date and are the best 
> starting point for building a Spark snapshot with Scala 2.12.0-M4:
> * Check out https://github.com/JoshRosen/spark/tree/build-for-2.12.
> * Install dependencies:
> ** chill: check out https://github.com/twitter/chill/pull/253 and run 
> {{sbt ++2.12.0-M4 publishLocal}}
> * Run {{./dev/change-scala-version.sh 2.12.0-M4}}
> * To compile Spark, run {{build/sbt -Dscala-2.12}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Mansur Ashraf (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723786#comment-15723786
 ] 

Mansur Ashraf edited comment on SPARK-18728 at 12/5/16 11:55 PM:
-

Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of jobs on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks


was (Author: mashraf):
Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of job on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Mansur Ashraf (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723786#comment-15723786
 ] 

Mansur Ashraf commented on SPARK-18728:
---

Alex,

Thanks for opening the issue. Let me add some more detail to it. 

We have tons of job on Spark 1.6 that are using Algebird Aggregators through 
`aggregateByKey` or `combineByKey` functions on RDD. Since Algebird aggregators 
are composable (meaning you can combine X number of aggregators to get 1 
combined aggregators), in our jobs we are combining 10+ number of aggregators 
and doing single pass aggregations using aggregateByKey/combineByKey. As we 
upgrade to Spark 2.0.0 and new Dataset 
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset),
 we find out that aggregateByKey/combineByKey are all gone so we cant pass 
algebird aggregators directly, instead there is a new aggregator API based on 
algebird except (as far as I can tell) does not allow joining multiple 
aggregators and limiting number of aggregators to 4.  

It would be really nice if Spark use Algebird aggregators instead of creating 
its own or allow users to pass algebird aggregators in Dataset API in addition 
to Spark aggregators

Thanks

> Consider using Algebird's Aggregator instead of 
> org.apache.spark.sql.expressions.Aggregator
> ---
>
> Key: SPARK-18728
> URL: https://issues.apache.org/jira/browse/SPARK-18728
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Levenson
>Priority: Minor
>
> Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in 
> spark's Aggregator here:
> "Based loosely on Aggregator from Algebird: 
> https://github.com/twitter/algebird;
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46
> Which got a few of us wondering, given that this API is still experimental, 
> would you consider using algebird's Aggregator API directly instead?
> The algebird API is not coupled with any implementation details, and 
> shouldn't have any extra dependencies.
> Are there any blockers to doing that?
> Thanks!
> Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723781#comment-15723781
 ] 

Cheng Lian commented on SPARK-18539:


Please remind me if I missed anything important, otherwise, we can resolve this 
ticket as "Not a Problem".

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at

[jira] [Assigned] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18729:


Assignee: Shixiong Zhu  (was: Apache Spark)

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723772#comment-15723772
 ] 

Apache Spark commented on SPARK-18729:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16162

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18729:


Assignee: Apache Spark  (was: Shixiong Zhu)

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14660) Executors show up active tasks indefinitely after stage is killed

2016-12-05 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14660:
---
Component/s: Scheduler

> Executors show up active tasks indefinitely after stage is killed
> -
>
> Key: SPARK-14660
> URL: https://issues.apache.org/jira/browse/SPARK-14660
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.6.1
> Environment: YARN
>Reporter: saurabh paliwal
>Priority: Minor
>
> If a job is running and we kill it from all stages UI page, the executors on 
> which the tasks were running for that job keep showing them as active tasks, 
> and the executor won't get lost even after executorIdleTimeout seconds if you 
> have dynamic allocated executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-18729:


 Summary: MemorySink should not call DataFrame.collect when holding 
a lock
 Key: SPARK-18729
 URL: https://issues.apache.org/jira/browse/SPARK-18729
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Otherwise, other threads cannot query the content in MemorySink when 
`DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18729) MemorySink should not call DataFrame.collect when holding a lock

2016-12-05 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18729:
-
Issue Type: Improvement  (was: Bug)

> MemorySink should not call DataFrame.collect when holding a lock
> 
>
> Key: SPARK-18729
> URL: https://issues.apache.org/jira/browse/SPARK-18729
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Otherwise, other threads cannot query the content in MemorySink when 
> `DataFrame.collect` takes long time to finish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian edited comment on SPARK-18539 at 12/5/16 11:43 PM:
--

[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns to be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.


was (Author: lian cheng):
[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
>

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723747#comment-15723747
 ] 

Cheng Lian commented on SPARK-18539:


[~v-gerasimov], [~smilegator], and [~xwu0226], after some investigation, I 
don't think this is a bug now.

Just tested the master branch using the following test cases:
{code}
  for {
useVectorizedReader <- Seq(true, false)
mergeSchema <- Seq(true, false)
  } {
test(s"foo - mergeSchema: $mergeSchema - vectorized: $useVectorizedReader") 
{
  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
useVectorizedReader.toString) {
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val df = spark.range(1).coalesce(1)

  df.selectExpr("id AS a", "id AS b").write.parquet(path)
  df.selectExpr("id AS a").write.mode("append").parquet(path)

  assertResult(0) {
spark.read
  .option("mergeSchema", mergeSchema.toString)
  .parquet(path)
  .filter("b < 0")
  .count()
  }
}
  }
}
  }
{code}
It turned out that this issue only happens when schema merging is turned off. 
This also explains why PR #9940 doesn't prevent PARQUET-389: because the trick 
PR #9940 employs happens during schema merging phase. On the other hand, you 
can't expect missing columns can be properly read when schema merging is turned 
off. Therefore, I don't think it's a bug.

The fix for the snippet mentioned in the ticket description is easy, just add 
{{.option("mergeSchema", "true")}} to enable schema merging.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
>

[jira] [Commented] (SPARK-18539) Cannot filter by nonexisting column in parquet file

2016-12-05 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723718#comment-15723718
 ] 

Cheng Lian commented on SPARK-18539:


As commented on GitHub, there're two issues right now:
# This bug also affects the normal Parquet reader code path, where 
{{ParquetRecordReader}} is a 3rd party class closed for modification. 
Therefore, we can't capture the exception there.
# [PR #9940|https://github.com/apache/spark/pull/9940] should have already 
fixed this issue. But somehow it is broken right now.

> Cannot filter by nonexisting column in parquet file
> ---
>
> Key: SPARK-18539
> URL: https://issues.apache.org/jira/browse/SPARK-18539
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.0.2
>Reporter: Vitaly Gerasimov
>Priority: Critical
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.DataTypes._
>   import org.apache.spark.sql.types.{StructField, StructType}
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq("""{"a":1}"""))
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType
> .json(jsonRDD)
> .write
> .parquet("/tmp/test")
>   sc.read
> .schema(StructType(Seq(StructField("a", IntegerType), StructField("b", 
> IntegerType, nullable = true
> .load("/tmp/test")
> .createOrReplaceTempView("table")
>   sc.sql("select b from table where b is not null").show()
> {code}
> returns:
> {code}
> 16/11/22 17:43:47 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.IllegalArgumentException: Column [b] was not found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:100)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$NotEq.accept(Operators.java:194)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:110)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:341)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
>

[jira] [Updated] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18721:
-
Target Version/s: 2.1.0

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Shixiong Zhu
>Priority: Critical
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18717:


Assignee: (was: Apache Spark)

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Damian Momot
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723694#comment-15723694
 ] 

Apache Spark commented on SPARK-18717:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/16161

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Damian Momot
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18717) Datasets - crash (compile exception) when mapping to immutable scala map

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18717:


Assignee: Apache Spark

> Datasets - crash (compile exception) when mapping to immutable scala map
> 
>
> Key: SPARK-18717
> URL: https://issues.apache.org/jira/browse/SPARK-18717
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Damian Momot
>Assignee: Apache Spark
>
> {code}
> val spark: SparkSession = ???
> case class Test(id: String, map_test: Map[Long, String])
> spark.sql("CREATE TABLE xyz.map_test (id string, map_test map) 
> STORED AS PARQUET")
> spark.sql("SELECT * FROM xyz.map_test").as[Test].map(t => t).collect()
> {code}
> {code}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 307, Column 108: No applicable constructor/method found for actual parameters 
> "java.lang.String, scala.collection.Map"; candidates are: 
> "$line14.$read$$iw$$iw$Test(java.lang.String, scala.collection.immutable.Map)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18721:


Assignee: Shixiong Zhu  (was: Apache Spark)

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Shixiong Zhu
>Priority: Critical
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18721:


Assignee: Apache Spark  (was: Shixiong Zhu)

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Apache Spark
>Priority: Critical
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723691#comment-15723691
 ] 

Apache Spark commented on SPARK-18721:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16160

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Shixiong Zhu
>Priority: Critical
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18728) Consider using Algebird's Aggregator instead of org.apache.spark.sql.expressions.Aggregator

2016-12-05 Thread Alex Levenson (JIRA)

Alex Levenson created SPARK-18728:
-

 Summary: Consider using Algebird's Aggregator instead of 
org.apache.spark.sql.expressions.Aggregator
 Key: SPARK-18728
 URL: https://issues.apache.org/jira/browse/SPARK-18728
 Project: Spark
  Issue Type: Bug
Reporter: Alex Levenson
Priority: Minor


Mansur (https://twitter.com/mansur_ashraf) pointed out this comment in spark's 
Aggregator here:
"Based loosely on Aggregator from Algebird: https://github.com/twitter/algebird;
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala#L46

Which got a few of us wondering, given that this API is still experimental, 
would you consider using algebird's Aggregator API directly instead?

The algebird API is not coupled with any implementation details, and shouldn't 
have any extra dependencies.

Are there any blockers to doing that?

Thanks!
Alex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-05 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-14932:
---
Labels: starter  (was: )

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>  Labels: starter
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18694) Add StreamingQuery.explain and exception to Python and fix StreamingQueryException

2016-12-05 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18694.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Add StreamingQuery.explain and exception to Python and fix 
> StreamingQueryException
> --
>
> Key: SPARK-18694
> URL: https://issues.apache.org/jira/browse/SPARK-18694
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Blocker
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14932) Allow DataFrame.replace() to replace values with None

2016-12-05 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723620#comment-15723620
 ] 

Josh Rosen commented on SPARK-14932:


I think that there's a similar issue impacting the Scala / Java equivalent of 
this API. Running

{code}
df.na.replace("*", Map[String, String]("NULL" -> null))
{code}

will produce the exception

{code}
java.lang.IllegalArgumentException: Unsupported value type java.lang.String 
(NULL).
at 
org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$convertToDouble(DataFrameNaFunctions.scala:436)
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:348)
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:348)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.DataFrameNaFunctions.replace0(DataFrameNaFunctions.scala:348)
at 
org.apache.spark.sql.DataFrameNaFunctions.replace(DataFrameNaFunctions.scala:313)
{code}

The "convertToDouble" appearing in the stracktrace is because the pattern match 
at 
https://github.com/apache/spark/blob/v2.0.2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala#L345
 doesn't have a case to handle nulls so it ends up falling through to the 
convertToDouble case.

I bet this will be easy to fix: just add a {{case null: }} at the start of the 
pattern match, then do a change similar to what [~nchammas] is suggesting here 
to fix things for Python users.

> Allow DataFrame.replace() to replace values with None
> -
>
> Key: SPARK-14932
> URL: https://issues.apache.org/jira/browse/SPARK-14932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Current doc: 
> http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> I would like to specify {{None}} as the value to substitute in. This is 
> currently 
> [disallowed|https://github.com/apache/spark/blob/9797cc20c0b8fb34659df11af8eccb9ed293c52c/python/pyspark/sql/dataframe.py#L1144-L1145].
>  My use case is for replacing bad values with {{None}} so I can then ignore 
> them with {{dropna()}}.
> For example, I have a dataset that incorrectly includes empty strings where 
> there should be {{None}} values. I would like to replace the empty strings 
> with {{None}} and then drop all null data with {{dropna()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18721) ForeachSink breaks Watermark in append mode

2016-12-05 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-18721:


Assignee: Shixiong Zhu

> ForeachSink breaks Watermark in append mode
> ---
>
> Key: SPARK-18721
> URL: https://issues.apache.org/jira/browse/SPARK-18721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Cristian Opris
>Assignee: Shixiong Zhu
>Priority: Critical
>
> The watermark is not updated in append mode with a ForeachSink
> Because ForeachSink creates a separate IncrementalExecution instance, the 
> physical plan will be recreated for the logical plan, which results in a new 
> EventTimeWatermarkExec operator being created, that's unreachable from 
> StreamExecution. This results in the watermark never being updated, and 
> append mode never emits results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18719:
---
Assignee: Nicholas

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Nicholas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18719.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16151
[https://github.com/apache/spark/pull/16151]

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18719:
---
Assignee: Nicholas Chammas  (was: Nicholas)

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18697) Upgrade sbt plugins

2016-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723579#comment-15723579
 ] 

Apache Spark commented on SPARK-18697:
--

User 'weiqingy' has created a pull request for this issue:
https://github.com/apache/spark/pull/16159

> Upgrade sbt plugins
> ---
>
> Key: SPARK-18697
> URL: https://issues.apache.org/jira/browse/SPARK-18697
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Weiqing Yang
>Priority: Trivial
>
> For 2.2.x, it's better to make sbt plugins up-to-date. The following sbt 
> plugins will be upgraded:
> {code}
> sbt-assembly: 0.11.2 -> 0.14.3
> sbteclipse-plugin: 4.0.0 -> 5.0.1
> sbt-mima-plugin: 0.1.11 -> 0.1.12
> org.ow2.asm/asm: 5.0.3 -> 5.1 
> org.ow2.asm/asm-commons: 5.0.3 -> 5.1 
> {code}
> All other plugins are up-to-date. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 214 matches

Mail list logo