[jira] [Commented] (SPARK-17248) Add native Scala enum support to Dataset Encoders

2017-03-09 Thread Lee Dongjin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903245#comment-15903245
 ] 

Lee Dongjin commented on SPARK-17248:
-

[~pdxleif] // Although it may be an expired question, let me answer.

There are two ways of implementing Enum types in Scala. For more information, 
see: 
http://alvinalexander.com/scala/how-to-use-scala-enums-enumeration-examples The 
'enumeration object' in the tuning guide seems to point the second way, that 
is, using sealed traits and case objects.

However, it seems like the sentence "Consider using numeric IDs or enumeration 
objects instead of strings for keys." in the tuning guide does not apply to 
DataSet, like your case. From my humble knowledge, DataSet supports only case 
classes with primitive types, not with the other case class or object, in this 
case, the enum objects.

If you want some workaround, please check out this example: 
https://github.com/dongjinleekr/spark-dataset/blob/master/src/main/scala/com/github/dongjinleekr/spark/dataset/Titanic.scala
 It shows an example of DataSet using one of the public datasets. Please pay 
attention to how I matched the Passenger case class and its corresponding 
SchemaType - age, pClass, sex and embarked.

[~srowen] // It seems like lots of users are experiencing similar problems. How 
about changing this issue into providing more examples and explanations in 
official documentation? I needed, I would like to take the issue.

> Add native Scala enum support to Dataset Encoders
> -
>
> Key: SPARK-17248
> URL: https://issues.apache.org/jira/browse/SPARK-17248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Silvio Fiorito
>
> Enable support for Scala enums in Encoders. Ideally, users should be able to 
> use enums as part of case classes automatically.
> Currently, this code...
> {code}
> object MyEnum extends Enumeration {
>   type MyEnum = Value
>   val EnumVal1, EnumVal2 = Value
> }
> case class MyClass(col: MyEnum.MyEnum)
> val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS()
> {code}
> ...results in this stacktrace:
> {code}
> ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum
> - field (class: "scala.Enumeration.Value", name: "col")
> - root class: 
> "line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass"
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:274)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19884) Add the ability to get all registered functions from a SparkSession

2017-03-09 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903257#comment-15903257
 ] 

Herman van Hovell commented on SPARK-19884:
---

You can use the catalog for that by calling 
{{spark.catalog.listFunctions().show()}}, or you can use sql and issue the 
{{show functions}} command.

> Add the ability to get all registered functions from a SparkSession
> ---
>
> Key: SPARK-19884
> URL: https://issues.apache.org/jira/browse/SPARK-19884
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yael Aharon
>
> It would be very useful to get the list of functions that are registered with 
> a SparkSession. Built-in and otherwise.
> This would be useful e.g. for auto-completion support in editors built around 
> Spark SQL.
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19884) Add the ability to get all registered functions from a SparkSession

2017-03-09 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon resolved SPARK-19884.
-
Resolution: Not A Problem

Thank you so much for your reply

> Add the ability to get all registered functions from a SparkSession
> ---
>
> Key: SPARK-19884
> URL: https://issues.apache.org/jira/browse/SPARK-19884
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yael Aharon
>
> It would be very useful to get the list of functions that are registered with 
> a SparkSession. Built-in and otherwise.
> This would be useful e.g. for auto-completion support in editors built around 
> Spark SQL.
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15790) Audit @Since annotations in ML

2017-03-09 Thread Ehsun Behravesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904652#comment-15904652
 ] 

Ehsun Behravesh commented on SPARK-15790:
-

Does this JIRA still need someone to work on? 

> Audit @Since annotations in ML
> --
>
> Key: SPARK-15790
> URL: https://issues.apache.org/jira/browse/SPARK-15790
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Many classes & methods in ML are missing {{@Since}} annotations. Audit what's 
> missing and add annotations to public API constructors, vals and methods.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-17557.
--
Resolution: Cannot Reproduce

^ I can't reproduce too. Let me resolve this. Please reopen this if anyone 
still faces this issue and I was mistaken.

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6634) Allow replacing columns in Transformers

2017-03-09 Thread Tree Field (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904445#comment-15904445
 ] 

Tree Field commented on SPARK-6634:
---

I want this feature too.
because I often overwrite UnaryTransformer by myself  to enable this.

It seems it's only prevented in transformSchema method.
Now, unlike before v1.4,  dataframe's withColumn method used in 
UnaryTransformer allows replacing  the input column.

Any other reasons that is not allowed in transoformer, especially in 
UnaryTransformer.





> Allow replacing columns in Transformers
> ---
>
> Key: SPARK-6634
> URL: https://issues.apache.org/jira/browse/SPARK-6634
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Transformers do not allow input and output columns to share the 
> same name.  (In fact, this is not allowed but also not even checked.)
> Short-term proposal: Disallow input and output columns with the same name, 
> and add a check in transformSchema.
> Long-term proposal: Allow input & output columns with the same name, and 
> where the behavior is that the output columns replace input columns with the 
> same name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2017-03-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904460#comment-15904460
 ] 

Kazuaki Ishizaki commented on SPARK-14083:
--

I rebased this with master: 
https://github.com/kiszk/spark/tree/expression-analysis4

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19892:


Assignee: Apache Spark

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904471#comment-15904471
 ] 

Apache Spark commented on SPARK-19892:
--

User 'benradford' has created a pull request for this issue:
https://github.com/apache/spark/pull/17234

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19892:


Assignee: (was: Apache Spark)

> Implement findAnalogies method for Word2VecModel 
> -
>
> Key: SPARK-19892
> URL: https://issues.apache.org/jira/browse/SPARK-19892
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Benjamin Radford
>Priority: Minor
>  Labels: features, newbie
>
> Word2VecModel is missing a method that allows for performing analogy-like 
> queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
> functionality common to other word2vec implementations (see gensim) and is a 
> major component of word2vec's appeal as cited in seminal works on the model 
> (https://code.google.com/archive/p/word2vec/). 
> An implementation of this method, findAnalogies, should accept three 
> arguments:
> * positive - Array[String] of similar words
> * negative - Array[String] of dissimilar words
> * num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904479#comment-15904479
 ] 

Ji Yan commented on SPARK-19320:


i'm proposing to add a configuration parameter to guarantee a hard limit 
(spark.mesos.gpus) on gpu numbers. To avoid conflict, it will override 
spark.mesos.gpus.max whenever spark.mesos.gpus is greater than 0.

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18112:


Assignee: Apache Spark  (was: Xiao Li)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Apache Spark
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18112:


Assignee: Xiao Li  (was: Apache Spark)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904429#comment-15904429
 ] 

Apache Spark commented on SPARK-18112:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17232

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19892) Implement findAnalogies method for Word2VecModel

2017-03-09 Thread Benjamin Radford (JIRA)
Benjamin Radford created SPARK-19892:


 Summary: Implement findAnalogies method for Word2VecModel 
 Key: SPARK-19892
 URL: https://issues.apache.org/jira/browse/SPARK-19892
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.1.0
Reporter: Benjamin Radford
Priority: Minor


Word2VecModel is missing a method that allows for performing analogy-like 
queries on word vectors (e.g. King + Woman - Man = Queen). This is a 
functionality common to other word2vec implementations (see gensim) and is a 
major component of word2vec's appeal as cited in seminal works on the model 
(https://code.google.com/archive/p/word2vec/). 

An implementation of this method, findAnalogies, should accept three arguments:
* positive - Array[String] of similar words
* negative - Array[String] of dissimilar words
* num - Int number of synonyms or nearest neighbors to calculated vector



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904448#comment-15904448
 ] 

Apache Spark commented on SPARK-11569:
--

User 'crackcell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17233

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ##  py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19320:


Assignee: Apache Spark

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>Assignee: Apache Spark
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19320:


Assignee: (was: Apache Spark)

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904484#comment-15904484
 ] 

Apache Spark commented on SPARK-19320:
--

User 'yanji84' has created a pull request for this issue:
https://github.com/apache/spark/pull/17235

> Allow guaranteed amount of GPU to be used when launching jobs
> -
>
> Key: SPARK-19320
> URL: https://issues.apache.org/jira/browse/SPARK-19320
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Currently the only configuration for using GPUs with Mesos is setting the 
> maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
> exactly how much.
> We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19757) Executor with task scheduled could be killed due to idleness

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19757.

   Resolution: Fixed
 Assignee: Jimmy Xiang
Fix Version/s: 2.2.0

> Executor with task scheduled could be killed due to idleness
> 
>
> Key: SPARK-19757
> URL: https://issues.apache.org/jira/browse/SPARK-19757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
>Priority: Minor
> Fix For: 2.2.0
>
>
> With dynamic executor allocation enabled on yarn mode, after one job is 
> finished for a while, submit another job, then there is race between killing 
> idle executors and scheduling new task on these executors. Sometimes, some 
> executor is killed right after a task is scheduled on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19793) Use clock.getTimeMillis when mark task as finished in TaskSetManager.

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19793.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.2.0

> Use clock.getTimeMillis when mark task as finished in TaskSetManager.
> -
>
> Key: SPARK-19793
> URL: https://issues.apache.org/jira/browse/SPARK-19793
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Minor
> Fix For: 2.2.0
>
>
> TaskSetManager is now using *System.getCurrentTimeMillis* when mark task as 
> finished in *handleSuccessfulTask* and *handleFailedTask*. Thus developer 
> cannot set the tasks finishing time in unit test. When 
> *handleSuccessfulTask*, task's duration = System.getCurrentTimeMillis - 
> launchTime(which can be set by *clock*), the result is not correct.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19882) Pivot with null as the pivot value throws NPE

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903643#comment-15903643
 ] 

Apache Spark commented on SPARK-19882:
--

User 'aray' has created a pull request for this issue:
https://github.com/apache/spark/pull/17226

> Pivot with null as the pivot value throws NPE
> -
>
> Key: SPARK-19882
> URL: https://issues.apache.org/jira/browse/SPARK-19882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> This seems a regression.
> - Spark 1.6
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> +++---+
> |   a|null|  1|
> +++---+
> |null|   0|  0|
> |   1|   0|  1|
> +++---+
> {code}
> - Current master
> {code}
> Seq(Tuple1(None), 
> Tuple1(Some(1))).toDF("a").groupBy($"a").pivot("a").count().show()
> {code}
> prints
> {code}
> java.lang.NullPointerException was thrown.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:145)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst$$anonfun$4.apply(PivotFirst.scala:143)
>   at scala.collection.immutable.List.map(List.scala:273)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.PivotFirst.(PivotFirst.scala:143)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot$$anonfun$apply$7$$anonfun$24.apply(Analyzer.scala:509)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903749#comment-15903749
 ] 

Cheng Lian commented on SPARK-19887:


cc [~cloud_fan]

> __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in 
> partitioned persisted tables
> --
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19885:


 Summary: The config ignoreCorruptFiles doesn't work for CSV
 Key: SPARK-19885
 URL: https://issues.apache.org/jira/browse/SPARK-19885
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Shixiong Zhu


CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
"ignoreCorruptFiles" doesn't work.

{code}
java.io.EOFException: Unexpected end of input stream
at 
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
at 
org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1114)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at 

[jira] [Resolved] (SPARK-19715) Option to Strip Paths in FileSource

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19715.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.2.0

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Liwei Lin
> Fix For: 2.2.0
>
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903680#comment-15903680
 ] 

Apache Spark commented on SPARK-19507:
--

User 'dgingrich' has created a pull request for this issue:
https://github.com/apache/spark/pull/17227

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19861) watermark should not be a negative time.

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19861.
--
   Resolution: Fixed
 Assignee: Genmao Yu
Fix Version/s: 2.2.0
   2.1.1

> watermark should not be a negative time.
> 
>
> Key: SPARK-19861
> URL: https://issues.apache.org/jira/browse/SPARK-19861
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Genmao Yu
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19886:


Assignee: Apache Spark  (was: Burak Yavuz)

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19887:
--

 Summary: __HIVE_DEFAULT_PARTITION__ not interpreted as NULL 
partition value in partitioned persisted tables
 Key: SPARK-19887
 URL: https://issues.apache.org/jira/browse/SPARK-19887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Cheng Lian


The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19886:


Assignee: Burak Yavuz  (was: Apache Spark)

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903733#comment-15903733
 ] 

Apache Spark commented on SPARK-19886:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17228

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-19886:
---

 Summary: reportDataLoss cause != null check is wrong for 
Structured Streaming KafkaSource
 Key: SPARK-19886
 URL: https://issues.apache.org/jira/browse/SPARK-19886
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-09 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903532#comment-15903532
 ] 

Amit Sela edited comment on SPARK-19067 at 3/9/17 6:15 PM:
---

[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions as a user, hope 
you don't mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?


was (Author: amitsela):
[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions, hope you don't 
mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - 

[jira] [Updated] (SPARK-19757) Executor with task scheduled could be killed due to idleness

2017-03-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19757:
---
Reporter: jin xing  (was: Jimmy Xiang)

> Executor with task scheduled could be killed due to idleness
> 
>
> Key: SPARK-19757
> URL: https://issues.apache.org/jira/browse/SPARK-19757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: jin xing
>Assignee: Jimmy Xiang
>Priority: Minor
> Fix For: 2.2.0
>
>
> With dynamic executor allocation enabled on yarn mode, after one job is 
> finished for a while, submit another job, then there is race between killing 
> idle executors and scheduling new task on these executors. Sometimes, some 
> executor is killed right after a task is scheduled on it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Summary: __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition 
value in partitioned persisted tables  (was: __HIVE_DEFAULT_PARTITION__ not 
interpreted as NULL partition value in partitioned persisted tables)

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned table uses magic string 
> {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-09 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903532#comment-15903532
 ] 

Amit Sela commented on SPARK-19067:
---

[~tdas] I just read the PR, and I'm very excited for Spark to provide such a 
powerful stateful operator!
I added a few comments in the PR based on my first impressions, hope you don't 
mind.
I assume that for event-time-timeouts you'd look at the Watermark time instead 
of Wall time, correct ? how would that work ? If I get it right it's all 
represented as a table so the "Watermark Manager" would constantly right 
updates to the table in the "Watermark Column" ?

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903584#comment-15903584
 ] 

Kazuaki Ishizaki commented on SPARK-19875:
--

I got the following stack trace. This stuck seems to occur at constraint 
propagation.
When I applied SPARK-19846 to this, I do not get stuck with a 50-column csv 
dataset.

{code}
"ScalaTest-run-running-Test" #1 prio=5 os_prio=0 tid=0x02a41000 
nid=0x1bb4c runnable [0x02d5b000]
   java.lang.Thread.State: RUNNABLE
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:148)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225)
at 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159)
at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40)
at 
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139)
at scala.collection.mutable.HashSet.addElem(HashSet.scala:40)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59)
at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46)
at scala.collection.mutable.HashSet.clone(HashSet.scala:83)
at scala.collection.mutable.HashSet.clone(HashSet.scala:40)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
at 
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104)
at 
scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151)
at 
scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
at scala.collection.SetLike$class.$plus$plus(SetLike.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:325)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:322)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:322)
at 
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:57)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
- locked <0x000682bd9118> (a 
org.apache.spark.sql.catalyst.plans.logical.Project)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:187)
at 
org.apache.spark.sql.catalyst.plans.logical.Filter.validConstraints(basicLogicalOperators.scala:130)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
- locked <0x000682bd9188> (a 
org.apache.spark.sql.catalyst.plans.logical.Filter)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:187)
at 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints$$anonfun$apply$13.applyOrElse(Optimizer.scala:612)
at 
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints$$anonfun$apply$13.applyOrElse(Optimizer.scala:610)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 

[jira] [Resolved] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2017-03-09 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-12334.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 10307
[https://github.com/apache/spark/pull/10307]

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2017-03-09 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-12334:
---

Assignee: Jeff Zhang

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.2.0
>
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19611.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16944
[https://github.com/apache/spark/pull/16944]

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-03-09 Thread Andrew Milkowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903778#comment-15903778
 ] 

Andrew Milkowski edited comment on SPARK-19364 at 3/9/17 8:17 PM:
--

thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tied to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME


was (Author: amilkowski):
thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tired to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>Priority: Blocker
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> 

[jira] [Commented] (SPARK-19364) Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are enabled and an exception is thrown

2017-03-09 Thread Andrew Milkowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903778#comment-15903778
 ] 

Andrew Milkowski commented on SPARK-19364:
--

thanks @Takeshi Yamamuro , will try to see if I can make this error consistent 
(we see it in prod non stop and it is consistent) I will see if I can throw the 
exception from in the kinesis receiver (java lib) and see stream blocks grow in 
spark, will provide line change to re-produce problem.. it is tired to kinesis 
java lib faulting on checkpoint throwing exception and spark persisting stream 
blocks and never releasing em from memory till eventual OME

> Stream Blocks in Storage Persists Forever when Kinesis Checkpoints are 
> enabled and an exception is thrown 
> --
>
> Key: SPARK-19364
> URL: https://issues.apache.org/jira/browse/SPARK-19364
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: ubuntu unix
> spark 2.0.2
> application is java
>Reporter: Andrew Milkowski
>Priority: Blocker
>
> -- update --- we found that below situation occurs when we encounter
> "com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard"
> https://github.com/awslabs/amazon-kinesis-client/issues/108
> we use s3 directory (and dynamodb) to store checkpoints, but if such occurs 
> blocks should not get stuck but continue to be evicted gracefully from 
> memory, obviously kinesis library race condition is a problem onto itself...
> -- exception leading to a block not being freed up --
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/mnt/yarn/usercache/hadoop/filecache/24/__spark_libs__7928020266533182031.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 17/01/26 13:52:00 ERROR KinesisRecordProcessor: ShutdownException:  Caught 
> shutdown exception, skipping checkpoint.
> com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: 
> Can't update checkpoint - instance doesn't hold the lease for this shard
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
>   at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:144)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
>   at 
> org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
>   at 
> org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
>   at 
> org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
> running standard kinesis stream ingestion with a 

[jira] [Assigned] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19611:
---

Assignee: Adam Budde

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19890:
--

 Summary: Make MetastoreRelation statistics estimation more 
accurately
 Key: SPARK-19890
 URL: https://issues.apache.org/jira/browse/SPARK-19890
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19890) Make MetastoreRelation statistics estimation more accurately

2017-03-09 Thread Zhan Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhan Zhang updated SPARK-19890:
---
Description: 
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Multiple plan 
can use the same meatstorerelation, but the estimation is still much better 
than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.

  was:
Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
and the size is based on the table scope. But for partitioned table, this 
statistics is not useful as table size may 100x+ larger than the input 
partition size. As a result, the join optimization techniques is not applicable.

It would be great if we can postpone the statistics to the optimization phase 
to get partition information but before physical plan generation phase so that 
JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
sortmerjoin).

Although the metastorerelation does not associated with partitions, but through 
PhysicalOperation we can get the partition info for the table. Although 
multiple plan can use the same meatstorerelation, but the estimation still much 
better than table size. This way, retrieving statistics is straightforward.

Another possible way is to have a another data structure associating the 
metastore relation and partitions with the plan to get most accurate estimation.


> Make MetastoreRelation statistics estimation more accurately
> 
>
> Key: SPARK-19890
> URL: https://issues.apache.org/jira/browse/SPARK-19890
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Priority: Minor
>
> Currently the MetastoreRelation statistics is retrieved on the analyze phase, 
> and the size is based on the table scope. But for partitioned table, this 
> statistics is not useful as table size may 100x+ larger than the input 
> partition size. As a result, the join optimization techniques is not 
> applicable.
> It would be great if we can postpone the statistics to the optimization phase 
> to get partition information but before physical plan generation phase so 
> that JoinSelection can choose better join methd (broadcast, shuffledjoin, or 
> sortmerjoin).
> Although the metastorerelation does not associated with partitions, but 
> through PhysicalOperation we can get the partition info for the table. 
> Multiple plan can use the same meatstorerelation, but the estimation is still 
> much better than table size. This way, retrieving statistics is 
> straightforward.
> Another possible way is to have a another data structure associating the 
> metastore relation and partitions with the plan to get most accurate 
> estimation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904033#comment-15904033
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/17229

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19889) Make TaskContext synchronized

2017-03-09 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-19889:
-

 Summary: Make TaskContext synchronized
 Key: SPARK-19889
 URL: https://issues.apache.org/jira/browse/SPARK-19889
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Herman van Hovell


In some cases you want to fork of some part of a task to a different thread. In 
these cases it would be very useful if {{TaskContext}} methods are synchronized 
and that we can synchronize on the task context.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-09 Thread Justin Miller (JIRA)
Justin Miller created SPARK-19888:
-

 Summary: Seeing offsets not resetting even when reset policy is 
configured explicitly
 Key: SPARK-19888
 URL: https://issues.apache.org/jira/browse/SPARK-19888
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Justin Miller


I was told to post this in a Spark ticket from KAFKA-4396:

I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be two 
separate errors, I'm not sure. What's puzzling is that I'm setting 
auto.offset.reset to latest and it's still throwing an 
OffsetOutOfRangeException, behavior that's contrary to the code. Please help! :)

{code}
val kafkaParams = Map[String, Object](
  "group.id" -> consumerGroup,
  "bootstrap.servers" -> bootstrapServers,
  "key.deserializer" -> classOf[ByteArrayDeserializer],
  "value.deserializer" -> classOf[MessageRowDeserializer],
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (false: java.lang.Boolean),
  "max.poll.records" -> persisterConfig.maxPollRecords,
  "request.timeout.ms" -> persisterConfig.requestTimeoutMs,
  "session.timeout.ms" -> persisterConfig.sessionTimeoutMs,
  "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs,
  "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs
)
{code}

{code}
16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory 
on xyz (size: 146.3 KB, free: 8.4 GB)
16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 
38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
Offsets out of range with no configured reset policy for partitions: 
{topic=231884473}
at 
org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
at 
org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 
39388) in 12043 ms on xyz (1/16)
16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 
39375) in 13444 ms on xyz (2/16)
16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 38843, 
xyz): java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
multi-threaded access
at 
org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:73)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 

[jira] [Commented] (SPARK-19353) Support binary I/O in PipedRDD

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904058#comment-15904058
 ] 

Apache Spark commented on SPARK-19353:
--

User 'superbobry' has created a pull request for this issue:
https://github.com/apache/spark/pull/17230

> Support binary I/O in PipedRDD
> --
>
> Key: SPARK-19353
> URL: https://issues.apache.org/jira/browse/SPARK-19353
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sergei Lebedev
>Priority: Minor
>
> The current design of RDD.pipe is very restrictive. 
> It is line-based, each element of the input RDD [gets 
> serialized|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L143]
>  into one or more lines. Similarly for the output of the child process, one 
> line corresponds to a single element of the output RDD. 
> It allows to customize the output format via {{printRDDElement}}, but not the 
> input format.
> It is not designed for extensibility. The only way to get a "BinaryPipedRDD" 
> is to copy/paste most of it and change the relevant parts.
> These limitations have been discussed on 
> [SO|http://stackoverflow.com/questions/27986830/how-to-pipe-binary-data-in-apache-spark]
>  and the mailing list, but alas no issue has been created.
> A possible solution to at least the first two limitations is to factor out 
> the format into a separate object (or objects). For instance, {{InputWriter}} 
> and {{OutputReader}}, following Hadoop streaming API. 
> {code}
> trait InputWriter[T] {
> def write(os: OutputStream, elem: T)
> }
> trait OutputReader[T] {
> def read(is: InputStream): T
> }
> {code}
> The default configuration would be to write and read in line-based format, 
> but the users will also be able to selectively swap those to the appropriate 
> implementations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Tyson Condie (JIRA)
Tyson Condie created SPARK-19891:


 Summary: Await Batch Lock not signaled on stream execution exit
 Key: SPARK-19891
 URL: https://issues.apache.org/jira/browse/SPARK-19891
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tyson Condie






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19891:


Assignee: Apache Spark

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19891:


Assignee: (was: Apache Spark)

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904116#comment-15904116
 ] 

Apache Spark commented on SPARK-19891:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/17231

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-19507:
--

I am reopening this per 
https://github.com/apache/spark/pull/17213#issuecomment-285530248 and resolve 
SPARK-19871 as a duplicate.

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19871) Improve error message in verify_type to indicate which field the error is for

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19871.
--
Resolution: Duplicate

> Improve error message in verify_type to indicate which field the error is for
> -
>
> Key: SPARK-19871
> URL: https://issues.apache.org/jira/browse/SPARK-19871
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Len Frodgers
>
> The error message for _verify_type is too vague. It should specify for which 
> field the type verification failed.
> e.g. error: "This field is not nullable, but got None" – but what is the 
> field?!
> In a dataframe with many non-nullable fields, it is a nightmare trying to 
> hunt down which field is null, since one has to resort to basic (and very 
> slow) trial and error.
> I have happily created a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19507:


Assignee: (was: Apache Spark)

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19507) pyspark.sql.types._verify_type() exceptions too broad to debug collections or nested data

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19507:


Assignee: Apache Spark

> pyspark.sql.types._verify_type() exceptions too broad to debug collections or 
> nested data
> -
>
> Key: SPARK-19507
> URL: https://issues.apache.org/jira/browse/SPARK-19507
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: macOS Sierra 10.12.3
> Spark 2.1.0, installed via Homebrew
>Reporter: David Gingrich
>Assignee: Apache Spark
>Priority: Trivial
>
> The private function pyspark.sql.types._verify_type() recursively checks an 
> object against a datatype, raising an exception if the object does not 
> satisfy the type.  These messages are not specific enough to debug a data 
> error in a collection or nested data, for instance:
> {quote}
> >>> import pyspark.sql.types as typ
> >>> schema = typ.StructType([typ.StructField('nest1', 
> >>> typ.MapType(typ.StringType(), typ.ArrayType(typ.FloatType(])
> >>> typ._verify_type({'nest1': {'nest2': [1]}}, schema)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1355, in 
> _verify_type
> _verify_type(obj.get(f.name), f.dataType, f.nullable, name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1349, in 
> _verify_type
> _verify_type(v, dataType.valueType, dataType.valueContainsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1342, in 
> _verify_type
> _verify_type(i, dataType.elementType, dataType.containsNull, 
> name=new_name)
>   File "/Users/david/src/3p/spark/python/pyspark/sql/types.py", line 1325, in 
> _verify_type
> % (name, dataType, obj, type(obj)))
> TypeError: FloatType can not accept object 1 in type 
> {quote}
> Passing and printing a field name would make debugging easier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19886) reportDataLoss cause != null check is wrong for Structured Streaming KafkaSource

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19886.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> reportDataLoss cause != null check is wrong for Structured Streaming 
> KafkaSource
> 
>
> Key: SPARK-19886
> URL: https://issues.apache.org/jira/browse/SPARK-19886
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19893:


Assignee: Apache Spark  (was: Wenchen Fan)

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19893:


Assignee: Wenchen Fan  (was: Apache Spark)

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)
Yuechen Chen created SPARK-19894:


 Summary: Tasks entirely assigned to one executor on Yarn-cluster 
mode for default-rack
 Key: SPARK-19894
 URL: https://issues.apache.org/jira/browse/SPARK-19894
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, YARN
Affects Versions: 2.1.0
 Environment: Yarn-cluster
Reporter: Yuechen Chen


In YARN-cluster mode, if driver has no rack information on two different hosts, 
these two hosts would both be recoginized as "/default-rack", which may cause 
some bugs.
For example, if hosts of one executor and one external datasource are unknown 
by driver, this two hosts would be recoginized as the same rack 
"/default-rack", and then all tasks would be assigned to the executor.
This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19891) Await Batch Lock not signaled on stream execution exit

2017-03-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19891.
--
   Resolution: Fixed
 Assignee: Tyson Condie
Fix Version/s: 2.2.0
   2.1.1

> Await Batch Lock not signaled on stream execution exit
> --
>
> Key: SPARK-19891
> URL: https://issues.apache.org/jira/browse/SPARK-19891
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19852:


Assignee: Apache Spark

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904590#comment-15904590
 ] 

Apache Spark commented on SPARK-19852:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/17237

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-03-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19852:


Assignee: (was: Apache Spark)

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904606#comment-15904606
 ] 

Wenchen Fan commented on SPARK-19885:
-

This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source.

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> 

[jira] [Comment Edited] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904606#comment-15904606
 ] 

Wenchen Fan edited comment on SPARK-19885 at 3/10/17 7:22 AM:
--

This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source, which doesn't recognize the ignoreCorruptedFiles options

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]


was (Author: cloud_fan):
This is because we support different charset for CSV files, and our text file 
format only supports UTF8, so we have to use `HadoopRDD` when infer schema for 
CSV data source.

I've checked the history, this feature was there the first day we introduce CSV 
data source. However, all other text-based data source support only UTF8, also 
CSV with wholeFile enabled only supports UTF8.

shall we just remove the support for different charsets? or support this 
feature for all text-based data source?

cc [~hyukjin.kwon]

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at 

[jira] [Commented] (SPARK-19894) Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack

2017-03-09 Thread Yuechen Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904621#comment-15904621
 ] 

Yuechen Chen commented on SPARK-19894:
--

https://github.com/apache/spark/pull/17238

> Tasks entirely assigned to one executor on Yarn-cluster mode for default-rack
> -
>
> Key: SPARK-19894
> URL: https://issues.apache.org/jira/browse/SPARK-19894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.1.0
> Environment: Yarn-cluster
>Reporter: Yuechen Chen
>
> In YARN-cluster mode, if driver has no rack information on two different 
> hosts, these two hosts would both be recoginized as "/default-rack", which 
> may cause some bugs.
> For example, if hosts of one executor and one external datasource are unknown 
> by driver, this two hosts would be recoginized as the same rack 
> "/default-rack", and then all tasks would be assigned to the executor.
> This bug would be avoided, if getRackForHost("unknown host") in YarnScheduler 
> returns None, not Some("/default-rack").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16255) Spark2.0 doesn't support the following SQL statement:"insert into directory "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" while Hive supp

2017-03-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16255.
--
Resolution: Duplicate

I think this is a duplicate of SPARK-4131. Please reopen this if I was mistaken.

> Spark2.0 doesn't support the following SQL statement:"insert into directory 
> "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" 
> while Hive supports
> 
>
> Key: SPARK-16255
> URL: https://issues.apache.org/jira/browse/SPARK-16255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: marymwu
>Priority: Minor
>
> Spark2.0 doesn't support the following SQL statement:"insert into directory 
> "/u_qa_user/hive_testdata/test1/t1" select * from d_test_tpc_2g_txt.auction" 
> while Hive supports



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18112:

Priority: Critical  (was: Major)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904362#comment-15904362
 ] 

Xiao Li commented on SPARK-18112:
-

Let me resolve it for supporting Hive 2.1.0 metastore.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18112:

Component/s: (was: Spark Submit)

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-18112:
---

Assignee: Xiao Li

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16754) NPE when defining case class and searching Encoder in the same line

2017-03-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904272#comment-15904272
 ] 

Hyukjin Kwon commented on SPARK-16754:
--

Today, I just tested this for my curiosity. It seems prints a different error 
as below:

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

case class TestCaseClass(value: Int)
import spark.implicits._
Seq(TestCaseClass(1)).toDS().collect()


// Exiting paste mode, now interpreting.

java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException
newInstance(class TestCaseClass)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:303)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:2807)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:2807)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2807)
  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2360)
  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2360)
  at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2791)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2790)
  at org.apache.spark.sql.Dataset.collect(Dataset.scala:2360)
  ... 47 elided
Caused by: java.lang.NullPointerException
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:497)
  at 
org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$5.apply(objects.scala:320)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$5.apply(objects.scala:320)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:320)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:104)
  at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:101)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:101)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:145)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$$anonfun$3.apply(GenerateSafeProjection.scala:142)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:142)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:887)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection$lzycompute(ExpressionEncoder.scala:269)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.constructProjection(ExpressionEncoder.scala:269)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:300)
  ... 62 more
{code}

> NPE when defining case class and searching Encoder in the same line
> ---
>
> Key: SPARK-16754
> URL: https://issues.apache.org/jira/browse/SPARK-16754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Spark Shell for Scala 2.11
>Reporter: Shixiong Zhu
>Priority: Minor
>
> Reproducer:
> 

[jira] [Commented] (SPARK-17322) 'ANY n' clause for SQL queries to increase the ease of use of WHERE clause predicates

2017-03-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904285#comment-15904285
 ] 

Hyukjin Kwon commented on SPARK-17322:
--

Let me leave a link that might be helpful - 
http://stackoverflow.com/questions/8771596/are-sql-any-and-some-keywords-synonyms-in-all-sql-dialects

> 'ANY n' clause for SQL queries to increase the ease of use of WHERE clause 
> predicates
> -
>
> Key: SPARK-17322
> URL: https://issues.apache.org/jira/browse/SPARK-17322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Suman Somasundar
>Priority: Minor
>
> If the user is interested in getting the results that meet 'any n' criteria 
> out of m where clause predicates (m > n), then the 'any n' clause greatly 
> simplifies writing a SQL query.
> An example is given below:
> select symbol from stocks where (market_cap > 5.7b, analysts_recommend > 10, 
> moving_avg > 49.2, pe_ratio >15.4) ANY 3



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19711) Bug in gapply function

2017-03-09 Thread Yeonseop Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904346#comment-15904346
 ] 

Yeonseop Kim commented on SPARK-19711:
--

 It seems to be more nice  Using "stringsAsFactor = FALSE", I think

schema <- structType (structField ("CNPJ", "string"))
result <- gapply(
ds,
c("CNPJ", "PID"),
function(key, x)
{ data.frame(CNPJ = x$CNPJ, stringsAsFactor = FALSE) }
,
schema)

> Bug in gapply function
> --
>
> Key: SPARK-19711
> URL: https://issues.apache.org/jira/browse/SPARK-19711
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: Using Databricks plataform.
>Reporter: Luis Felipe Sant Ana
> Attachments: mv_demand_20170221.csv, resume.R
>
>
> I have a dataframe in SparkR like 
>   CNPJPID   DATA N
> 1 10140281000131 100021 2015-04-23 1
> 2 10140281000131 100021 2015-04-27 1
> 3 10140281000131 100021 2015-04-02 1
> 4 10140281000131 100021 2015-11-10 1
> 5 10140281000131 100021 2016-11-14 1
> 6 10140281000131 100021 2015-04-03 1
> And, I want to group by columns CNPJ and PID using gapply() function, filling 
> in the column DATA with date. Then I fill in the missing dates with zeros.
> The code:
> schema <- structType(structField("CNPJ", "string"), 
>  structField("PID", "string"),
>  structField("DATA", "date"),
>  structField("N", "double"))
> result <- gapply(
>   ds_filtered,
>   c("CNPJ", "PID"),
>   function(key, x) {
> dts <- data.frame(key, DATA = seq(min(as.Date(x$DATA)), as.Date(e_date), 
> "days"))
> colnames(dts)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- data.frame(key, DATA = as.Date(x$DATA), N = x$N)
> colnames(y)[c(1, 2)] <- c("CNPJ", "PID")
> 
> y <- dplyr::left_join(dts, 
>  y,
>  by = c("CNPJ", "PID", "DATA"))
> 
> y[is.na(y$N), 4] <- 0
> 
> data.frame(CNPJ = as.character(y$CNPJ),
>PID = as.character(y$PID),
>DATA = y$DATA,
>N = y$N)
>   }, 
>   schema)
> Error:
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 92.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 92.0 (TID 7032, 10.93.243.111, executor 0): org.apache.spark.SparkException: 
> R computation failed with
>  Error in writeType(con, serdeType) : 
>   Unsupported type for serialization factor
> Calls: outputResult ... serializeRow -> writeList -> writeObject -> writeType
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:404)
>   at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$12.apply(objects.scala:386)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

[jira] [Reopened] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2017-03-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-18112:
-

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang edited comment on SPARK-16283 at 3/10/17 4:09 AM:
---

[~erlu] I think it's been made clear from the above discussions, Spark's result 
doesn't have to be the same as Hive's result.


was (Author: zenwzh):
[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16283) Implement percentile_approx SQL function

2017-03-09 Thread Zhenhua Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904397#comment-15904397
 ] 

Zhenhua Wang commented on SPARK-16283:
--

[~erlu] I think it's been made clear from above discussions, Spark' result 
doesn't have to be the same as Hive's result.

> Implement percentile_approx SQL function
> 
>
> Key: SPARK-16283
> URL: https://issues.apache.org/jira/browse/SPARK-16283
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sean Zhong
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19143) API in Spark for distributing new delegation tokens (Improve delegation token handling in secure clusters)

2017-03-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904497#comment-15904497
 ] 

Saisai Shao commented on SPARK-19143:
-

Hi all, I wrote a rough design doc based on the comments above, here is the 
link 
(https://docs.google.com/document/d/1DFWGHu4_GJapbbfXGWsot_z_W9Wka_39DFNmg9r9SAI/edit?usp=sharing).

[~tgraves] [~vanzin] [~mridulm80] please review and comment, greatly appreciate 
your suggestions.

> API in Spark for distributing new delegation tokens (Improve delegation token 
> handling in secure clusters)
> --
>
> Key: SPARK-19143
> URL: https://issues.apache.org/jira/browse/SPARK-19143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ruslan Dautkhanov
>
> Spin off from SPARK-14743 and comments chain in [recent comments| 
> https://issues.apache.org/jira/browse/SPARK-5493?focusedCommentId=15802179=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15802179]
>  in SPARK-5493.
> Spark currently doesn't have a way for distribution new delegation tokens. 
> Quoting [~vanzin] from SPARK-5493 
> {quote}
> IIRC Livy doesn't yet support delegation token renewal. Once it reaches the 
> TTL, the session is unusable.
> There might be ways to hack support for that without changes in Spark, but 
> I'd like to see a proper API in Spark for distributing new delegation tokens. 
> I mentioned that in SPARK-14743, but although that bug is closed, that 
> particular feature hasn't been implemented yet.
> {quote}
> Other thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904587#comment-15904587
 ] 

Apache Spark commented on SPARK-19893:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17236

> Cannot run intersect/except with map type
> -
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19008:
---

Assignee: Kazuaki Ishizaki

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19893) Cannot run intersect/except with map type

2017-03-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19893:
---

 Summary: Cannot run intersect/except with map type
 Key: SPARK-19893
 URL: https://issues.apache.org/jira/browse/SPARK-19893
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19008) Avoid boxing/unboxing overhead of calling a lambda with primitive type from Dataset program

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19008.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17172
[https://github.com/apache/spark/pull/17172]

> Avoid boxing/unboxing overhead of calling a lambda with primitive type from 
> Dataset program
> ---
>
> Key: SPARK-19008
> URL: https://issues.apache.org/jira/browse/SPARK-19008
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> In a 
> [discussion|https://github.com/apache/spark/pull/16391#discussion_r93788919] 
> between [~cloud_fan] and [~kiszk], we noticed an opportunity to avoid 
> boxing/unboxing overhead when a Dataset program calls a lambda, which 
> operates on a primitive type, written in Scala.
> In such a case, Catalyst can directly call a method {{ 
> apply();}} instead of {{Object apply(Object);}}.
> Of course, the best solution seems to be 
> [here|https://issues.apache.org/jira/browse/SPARK-14083].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-09 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902649#comment-15902649
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] in reference to your [PR 
comment|https://github.com/apache/spark/pull/17090#issuecomment-284827573]:

Really the input schema for evaluation is fairly simple - a set of ground truth 
ids and a (sorted) set of predicted ids, for each query (/user). The exact 
format (arrays like for {{mllib}} version, "exploded" version proposed in this 
JIRA) is not relevant in itself. Rather, the format selected is actually 
dictated by the {{Pipeline}} API - specifically, a model's prediction output 
schema from {{transform}} must be compatible with the evaluator's input schema 
for {{evaluate}}.

The schema proposed above is - I believe - the only one that is compatible with 
both "linear model" style things such as `LogisticRegression` for ad CTR 
prediction and learning-to-rank settings, as well as recommendation tasks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902668#comment-15902668
 ] 

Sean Owen commented on SPARK-19875:
---

It's easier to inline the code in a comment:

{code:scala}
package test.spark

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

object TestFilter extends App {

  val conf = new SparkConf().setMaster("local[1]").setAppName("tester")

  val session = SparkSession.builder().config(conf).getOrCreate()
  val sc = session.sparkContext
  val sqlContext = session.sqlContext

  val df = sqlContext.read.format("csv").load("test50cols.csv")

  // some map operation on all columns
  val cols = df.columns.map { col => upper(df.col(col))  }
  val df2 = df.select(cols: _*)

  // filter header
  val filter = (0 until df.columns.length)
.foldLeft(lit(false))((e, index) => e.or(df2.col(df2.columns(index)) =!= 
s"COLUMN${index+1}"))
  val df3 = df2.filter(filter)

  // some filter operation
  val df4 = df3.filter(df3.col(df3.columns(0)).isNotNull)

  df4.show(100)  // stuck here with a 50 column dataset

}
{code}

What do you mean it gets stuck -- do you have a thread dump?

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2017-03-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902668#comment-15902668
 ] 

Sean Owen edited comment on SPARK-19875 at 3/9/17 8:32 AM:
---

It's easier to inline the code in a comment:

{code}
package test.spark

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

object TestFilter extends App {

  val conf = new SparkConf().setMaster("local[1]").setAppName("tester")

  val session = SparkSession.builder().config(conf).getOrCreate()
  val sc = session.sparkContext
  val sqlContext = session.sqlContext

  val df = sqlContext.read.format("csv").load("test50cols.csv")

  // some map operation on all columns
  val cols = df.columns.map { col => upper(df.col(col))  }
  val df2 = df.select(cols: _*)

  // filter header
  val filter = (0 until df.columns.length)
.foldLeft(lit(false))((e, index) => e.or(df2.col(df2.columns(index)) =!= 
s"COLUMN${index+1}"))
  val df3 = df2.filter(filter)

  // some filter operation
  val df4 = df3.filter(df3.col(df3.columns(0)).isNotNull)

  df4.show(100)  // stuck here with a 50 column dataset

}
{code}

What do you mean it gets stuck -- do you have a thread dump?


was (Author: srowen):
It's easier to inline the code in a comment:

{code:scala}
package test.spark

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

object TestFilter extends App {

  val conf = new SparkConf().setMaster("local[1]").setAppName("tester")

  val session = SparkSession.builder().config(conf).getOrCreate()
  val sc = session.sparkContext
  val sqlContext = session.sqlContext

  val df = sqlContext.read.format("csv").load("test50cols.csv")

  // some map operation on all columns
  val cols = df.columns.map { col => upper(df.col(col))  }
  val df2 = df.select(cols: _*)

  // filter header
  val filter = (0 until df.columns.length)
.foldLeft(lit(false))((e, index) => e.or(df2.col(df2.columns(index)) =!= 
s"COLUMN${index+1}"))
  val df3 = df2.filter(filter)

  // some filter operation
  val df4 = df3.filter(df3.col(df3.columns(0)).isNotNull)

  df4.show(100)  // stuck here with a 50 column dataset

}
{code}

What do you mean it gets stuck -- do you have a thread dump?

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
> Attachments: test10cols.csv, test50cols.csv, TestFilter.scala
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19880.
---
Resolution: Invalid

I'm not clear what this means. It may be better to clarify and re-ask on the 
mailing list.

> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> --
>
> Key: SPARK-19880
> URL: https://issues.apache.org/jira/browse/SPARK-19880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>
> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> .why show databases,use default such operation need execute  a job in  
> spark2.0.2 .When a job task  is very much, time is very long,such as query 
> operation, lead to the back of the show databases, use the default operations 
> such as waiting in line.But When a job task  is very much, time is very 
> long,such as query operation, lead to the back of the show databases, use the 
> default operations no need to wait



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19862:
--
Comment: was deleted

(was: It's true, though I'm not sure this alone is worth changing, let alone a 
JIRA.)

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19862) In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted.

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19862.
---
Resolution: Won't Fix

> In SparkEnv.scala,shortShuffleMgrNames tungsten-sort can be deleted. 
> -
>
> Key: SPARK-19862
> URL: https://issues.apache.org/jira/browse/SPARK-19862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>Priority: Trivial
>
> "tungsten-sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName can be 
> deleted. Because it is the same of "sort" -> 
> classOf[org.apache.spark.shuffle.sort.SortShuffleManager].getName.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12694) The detailed rest API documentation for each field is missing

2017-03-09 Thread Ehsun Behravesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902689#comment-15902689
 ] 

Ehsun Behravesh commented on SPARK-12694:
-

So this JIRA ticket should be close?

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19763) qualified external datasource table location stored in catalog

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19763:
---

Assignee: Song Jun

> qualified external datasource table location stored in catalog
> --
>
> Key: SPARK-19763
> URL: https://issues.apache.org/jira/browse/SPARK-19763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> If we create a external datasource table with a non-qualified location , we 
> should qualified it to store in catalog.
> {code}
> CREATE TABLE t(a string)
> USING parquet
> LOCATION '/path/xx'
> CREATE TABLE t1(a string, b string)
> USING parquet
> PARTITIONED BY(b)
> LOCATION '/path/xx'
> {code}
> when we get the table from catalog, the location should be qualified, 
> e.g.'file:/path/xxx' 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2017-03-09 Thread Swaranga Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902634#comment-15902634
 ] 

Swaranga Sarma edited comment on SPARK-12009 at 3/9/17 7:59 AM:


The JIRA says that the issue is fixed but I still see this error in Spark 2.1.0

{code}
try (JavaSparkContext sc = new JavaSparkContext(new SparkConf())) {
  //run the job
}
{code}


was (Author: swaranga):
The JIRA says that the issue is fixed but I still see this error in Spark 2.1.0

{code}
try (JavaSparkContext sc = new JavaSparkContext(new SparkConf()) {
  //run the job
}
{code}

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Assignee: SuYan
>Priority: Minor
> Fix For: 2.0.0
>
>
> Log based 1.4.0
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12009) Avoid re-allocate yarn container while driver want to stop all Executors

2017-03-09 Thread Swaranga Sarma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902634#comment-15902634
 ] 

Swaranga Sarma commented on SPARK-12009:


The JIRA says that the issue is fixed but I still see this error in Spark 2.1.0

{code}
try (JavaSparkContext sc = new JavaSparkContext(new SparkConf()) {
  //run the job
}
{code}

> Avoid re-allocate yarn container while driver want to stop all Executors
> 
>
> Key: SPARK-12009
> URL: https://issues.apache.org/jira/browse/SPARK-12009
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.2
>Reporter: SuYan
>Assignee: SuYan
>Priority: Minor
> Fix For: 2.0.0
>
>
> Log based 1.4.0
> 2015-11-26,03:05:16,176 WARN 
> org.spark-project.jetty.util.thread.QueuedThreadPool: 8 threads could not be 
> stopped
> 2015-11-26,03:05:16,177 INFO org.apache.spark.ui.SparkUI: Stopped Spark web 
> UI at http://
> 2015-11-26,03:05:16,401 INFO org.apache.spark.scheduler.DAGScheduler: 
> Stopping DAGScheduler
> 2015-11-26,03:05:16,450 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
> all executors
> 2015-11-26,03:05:16,525 INFO 
> org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
> executor to shut down
> 2015-11-26,03:05:16,791 INFO 
> org.apache.spark.deploy.yarn.ApplicationMaster$AMEndpoint: Driver terminated 
> or disconnected! Shutting down. XX.XX.XX.XX:38734
> 2015-11-26,03:05:16,847 ERROR org.apache.spark.scheduler.LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(164,WrappedArray())
> 2015-11-26,03:05:27,242 INFO org.apache.spark.deploy.yarn.YarnAllocator: Will 
> request 13 executor containers, each with 1 cores and 4608 MB memory 
> including 1024 MB overhead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12694) The detailed rest API documentation for each field is missing

2017-03-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12694.
---
Resolution: Duplicate

Yes probably superseded by other changes

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19880) About spark2.0.2 and spark1.4.1 beeline to show the database, use the default operation such as dealing with different

2017-03-09 Thread guoxiaolong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902679#comment-15902679
 ] 

guoxiaolong commented on SPARK-19880:
-

But When a job dispose time is very long,such as query operation.lead to the 
back of the show databases, use the default operations  need to wait a long 
time.

> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> --
>
> Key: SPARK-19880
> URL: https://issues.apache.org/jira/browse/SPARK-19880
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: guoxiaolong
>
> About spark2.0.2 and spark1.4.1 beeline to show the database, use the default 
> operation such as dealing with different
> .why show databases,use default such operation need execute  a job in  
> spark2.0.2 .When a job task  is very much, time is very long,such as query 
> operation, lead to the back of the show databases, use the default operations 
> such as waiting in line.But When a job task  is very much, time is very 
> long,such as query operation, lead to the back of the show databases, use the 
> default operations no need to wait



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12694) The detailed rest API documentation for each field is missing

2017-03-09 Thread Ehsun Behravesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902692#comment-15902692
 ] 

Ehsun Behravesh commented on SPARK-12694:
-

Thanks

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12694) The detailed rest API documentation for each field is missing

2017-03-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902690#comment-15902690
 ] 

Sean Owen commented on SPARK-12694:
---

Already done, yes.

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12694) The detailed rest API documentation for each field is missing

2017-03-09 Thread Ehsun Behravesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ehsun Behravesh updated SPARK-12694:

Comment: was deleted

(was: So this JIRA ticket should be close?)

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19763) qualified external datasource table location stored in catalog

2017-03-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19763.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17095
[https://github.com/apache/spark/pull/17095]

> qualified external datasource table location stored in catalog
> --
>
> Key: SPARK-19763
> URL: https://issues.apache.org/jira/browse/SPARK-19763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
> Fix For: 2.2.0
>
>
> If we create a external datasource table with a non-qualified location , we 
> should qualified it to store in catalog.
> {code}
> CREATE TABLE t(a string)
> USING parquet
> LOCATION '/path/xx'
> CREATE TABLE t1(a string, b string)
> USING parquet
> PARTITIONED BY(b)
> LOCATION '/path/xx'
> {code}
> when we get the table from catalog, the location should be qualified, 
> e.g.'file:/path/xxx' 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >