[jira] [Assigned] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23728:


Assignee: Apache Spark

> ML test with expected exceptions testing streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23728:


Assignee: (was: Apache Spark)

> ML test with expected exceptions testing streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3

2018-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403852#comment-16403852
 ] 

Apache Spark commented on SPARK-23728:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/20852

> ML test with expected exceptions testing streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23728) ML test with expected exceptions testing streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-23728:
---
Summary: ML test with expected exceptions testing streaming fails on 2.3  
(was: ML test with expected exceptions via streaming fails on 2.3)

> ML test with expected exceptions testing streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-23728:
--

 Summary: ML test with expected exceptions via streaming fails on 
2.3
 Key: SPARK-23728
 URL: https://issues.apache.org/jira/browse/SPARK-23728
 Project: Spark
  Issue Type: Bug
  Components: ML, Tests
Affects Versions: 2.3.0
Reporter: Attila Zsolt Piros


The testTransformerByInterceptingException fails to catch the expected message 
as on 2.3 during streaming if an exception happens within an ml feature then 
feature generated message is not at the direct caused by exception but even one 
level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23728) ML test with expected exceptions via streaming fails on 2.3

2018-03-17 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403850#comment-16403850
 ] 

Attila Zsolt Piros commented on SPARK-23728:


I am soon creating a PR.

> ML test with expected exceptions via streaming fails on 2.3
> ---
>
> Key: SPARK-23728
> URL: https://issues.apache.org/jira/browse/SPARK-23728
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The testTransformerByInterceptingException fails to catch the expected 
> message as on 2.3 during streaming if an exception happens within an ml 
> feature then feature generated message is not at the direct caused by 
> exception but even one level deeper. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23727) Support DATE predict push down in parquet

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23727:


Assignee: (was: Apache Spark)

> Support DATE predict push down in parquet
> -
>
> Key: SPARK-23727
> URL: https://issues.apache.org/jira/browse/SPARK-23727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23727) Support DATE predict push down in parquet

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23727:


Assignee: Apache Spark

> Support DATE predict push down in parquet
> -
>
> Key: SPARK-23727
> URL: https://issues.apache.org/jira/browse/SPARK-23727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Assignee: Apache Spark
>Priority: Major
>
> DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23727) Support DATE predict push down in parquet

2018-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403846#comment-16403846
 ] 

Apache Spark commented on SPARK-23727:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/20851

> Support DATE predict push down in parquet
> -
>
> Key: SPARK-23727
> URL: https://issues.apache.org/jira/browse/SPARK-23727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
>
> DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23727) Support DATE predict push down in parquet

2018-03-17 Thread yucai (JIRA)
yucai created SPARK-23727:
-

 Summary: Support DATE predict push down in parquet
 Key: SPARK-23727
 URL: https://issues.apache.org/jira/browse/SPARK-23727
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai


DATE predict push down is missing, should be supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23714) Add metrics for cached KafkaConsumer

2018-03-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403792#comment-16403792
 ] 

Ted Yu commented on SPARK-23714:


Ryan reminded me of the possibility of using the existing metrics system.
I think that implies passing SparkEnv to KafkaDataConsumer

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23623) Avoid concurrent use of cached KafkaConsumer in CachedKafkaConsumer (kafka-0-10-sql)

2018-03-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-23623:
-
Fix Version/s: 2.3.1

> Avoid concurrent use of cached KafkaConsumer in CachedKafkaConsumer 
> (kafka-0-10-sql)
> 
>
> Key: SPARK-23623
> URL: https://issues.apache.org/jira/browse/SPARK-23623
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> CacheKafkaConsumer in the project `kafka-0-10-sql` is designed to maintain a 
> pool of KafkaConsumers that can be reused. However, it was built with the 
> assumption there will be only one task using trying to read the same Kafka 
> TopicPartition at the same time. Hence, the cache was keyed by the 
> TopicPartition a consumer is supposed to read. And any cases where this 
> assumption may not be true, we have SparkPlan flag to disable the use of a 
> cache. So it was up to the planner to correctly identify when it was not safe 
> to use the cache and set the flag accordingly. 
> Fundamentally, this is the wrong way to approach the problem. It is HARD for 
> a high-level planner to reason about the low-level execution model, whether 
> there will be multiple tasks in the same query trying to read the same 
> partition. Case in point, 2.3.0 introduced stream-stream joins, and you can 
> build a streaming self-join query on Kafka. It's pretty non-trivial to figure 
> out how this leads to two tasks reading the same partition twice, possibly 
> concurrently. And due to the non-triviality, it is hard to figure this out in 
> the planner and set the flag to avoid the cache / consumer pool. And this can 
> inadvertently lead to {{ConcurrentModificationException}} ,or worse, silent 
> reading of incorrect data.
> Here is a better way to design this. The planner shouldnt have to understand 
> these low-level optimizations. Rather the consumer pool should be smart 
> enough avoid concurrent use of a cached consumer. Currently, it tries to do 
> so but incorrectly (the flag {{inuse}} is not checked when returning a cached 
> consumer, see 
> [this|[https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala#L403]).]
>  If there is another request for the same partition as a currently in-use 
> consumer, the pool should automatically return a fresh consumer that should 
> be closed when the task is done.
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23726) standalone quickstart fails loading files with Hadoop's java.net.ConnectException: Connection refused

2018-03-17 Thread Tim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim resolved SPARK-23726.
-
Resolution: Invalid

> standalone quickstart fails loading files with Hadoop's 
> java.net.ConnectException: Connection refused
> -
>
> Key: SPARK-23726
> URL: https://issues.apache.org/jira/browse/SPARK-23726
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
> Environment: local mac with jvm "Using Scala version 2.11.8 (Java 
> HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)"
>Reporter: Tim
>Priority: Blocker
>
> 1) downloaded latest 2.3.0 release from 
> [https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz]
> 2) ungzip and startup spark shell with
> {{./bin/spark-shell --master local[2]}}
> 3) once console starts up, try to read in a file per 
> [https://spark.apache.org/docs/latest/quick-start.html]
> scala> val textFile = spark.read.textFile("README.md")
>  
> this produces the following exception:
>  
> Macs-MBP:spark-2.3.0-bin-hadoop2.7 macuser$ ./bin/spark-shell --master 
> local[2]
> 2018-03-17 18:43:33 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://macs-mbp.fios-router.home:4040
> Spark context available as 'sc' (master = local[2], app id = 
> local-1521326617762).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
>   /_/
>      
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val textFile = spark.read.textFile("README.md")
> 2018-03-17 18:43:41 WARN  FileStreamSink:66 - Error while looking for 
> metadata directory.
> java.net.ConnectException: Call From Macs-MBP.fios-router.home/192.168.1.154 
> to localhost:8020 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
>   at 
> 

[jira] [Commented] (SPARK-23726) standalone quickstart fails loading files with Hadoop's java.net.ConnectException: Connection refused

2018-03-17 Thread Tim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403782#comment-16403782
 ] 

Tim commented on SPARK-23726:
-

error is due to locally defined HADOOP_HOME and HADOOP_CONF_DIR env vars. Once 
I cleared those, spark started working locally ok. issue can be closed.

> standalone quickstart fails loading files with Hadoop's 
> java.net.ConnectException: Connection refused
> -
>
> Key: SPARK-23726
> URL: https://issues.apache.org/jira/browse/SPARK-23726
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.3.0
> Environment: local mac with jvm "Using Scala version 2.11.8 (Java 
> HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)"
>Reporter: Tim
>Priority: Blocker
>
> 1) downloaded latest 2.3.0 release from 
> [https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz]
> 2) ungzip and startup spark shell with
> {{./bin/spark-shell --master local[2]}}
> 3) once console starts up, try to read in a file per 
> [https://spark.apache.org/docs/latest/quick-start.html]
> scala> val textFile = spark.read.textFile("README.md")
>  
> this produces the following exception:
>  
> Macs-MBP:spark-2.3.0-bin-hadoop2.7 macuser$ ./bin/spark-shell --master 
> local[2]
> 2018-03-17 18:43:33 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://macs-mbp.fios-router.home:4040
> Spark context available as 'sc' (master = local[2], app id = 
> local-1521326617762).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
>   /_/
>      
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val textFile = spark.read.textFile("README.md")
> 2018-03-17 18:43:41 WARN  FileStreamSink:66 - Error while looking for 
> metadata directory.
> java.net.ConnectException: Call From Macs-MBP.fios-router.home/192.168.1.154 
> to localhost:8020 failed on connection exception: java.net.ConnectException: 
> Connection refused; For more details see:  
> http://wiki.apache.org/hadoop/ConnectionRefused
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>   at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1479)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1412)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>   at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
>   at 
> 

[jira] [Created] (SPARK-23726) standalone quickstart fails loading files with Hadoop's java.net.ConnectException: Connection refused

2018-03-17 Thread Tim (JIRA)
Tim created SPARK-23726:
---

 Summary: standalone quickstart fails loading files with Hadoop's 
java.net.ConnectException: Connection refused
 Key: SPARK-23726
 URL: https://issues.apache.org/jira/browse/SPARK-23726
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.0
 Environment: local mac with jvm "Using Scala version 2.11.8 (Java 
HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)"
Reporter: Tim


1) downloaded latest 2.3.0 release from 
[https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz]

2) ungzip and startup spark shell with

{{./bin/spark-shell --master local[2]}}

3) once console starts up, try to read in a file per 
[https://spark.apache.org/docs/latest/quick-start.html]

scala> val textFile = spark.read.textFile("README.md")

 

this produces the following exception:

 

Macs-MBP:spark-2.3.0-bin-hadoop2.7 macuser$ ./bin/spark-shell --master local[2]
2018-03-17 18:43:33 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://macs-mbp.fios-router.home:4040
Spark context available as 'sc' (master = local[2], app id = 
local-1521326617762).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
  /_/
     
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val textFile = spark.read.textFile("README.md")
2018-03-17 18:43:41 WARN  FileStreamSink:66 - Error while looking for metadata 
directory.
java.net.ConnectException: Call From Macs-MBP.fios-router.home/192.168.1.154 to 
localhost:8020 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
  at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732)
  at org.apache.hadoop.ipc.Client.call(Client.java:1479)
  at org.apache.hadoop.ipc.Client.call(Client.java:1412)
  at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
  at com.sun.proxy.$Proxy19.getFileInfo(Unknown Source)
  at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
  at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
  at com.sun.proxy.$Proxy20.getFileInfo(Unknown Source)
  at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
  at 
org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:714)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at 

[jira] [Updated] (SPARK-23560) Group by on struct field can add extra shuffle

2018-03-17 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23560:
--
Summary: Group by on struct field can add extra shuffle  (was: A joinWith 
followed by groupBy requires extra shuffle)

> Group by on struct field can add extra shuffle
> --
>
> Key: SPARK-23560
> URL: https://issues.apache.org/jira/browse/SPARK-23560
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: debian 8.9, macos x high sierra
>Reporter: Bruce Robbins
>Priority: Major
>
> Depending on the size of the input, a joinWith followed by a groupBy requires 
> more shuffles than a join followed by a groupBy.
> For example, here's a joinWith on two CSV files, followed by a groupBy:
> {noformat}
> import org.apache.spark.sql.types._
> val schema = StructType(StructField("id1", LongType) :: StructField("id2", 
> LongType) :: Nil)
> val df1 = spark.read.schema(schema).csv("ds1.csv")
> val df2 = spark.read.schema(schema).csv("ds2.csv")
> val result1 = df1.joinWith(df2, df1.col("id1") === 
> df2.col("id2")).groupBy("_1.id1").count
> result1.explain
> == Physical Plan ==
> *(6) HashAggregate(keys=[_1#8.id1#19L], functions=[count(1)])
> +- Exchange hashpartitioning(_1#8.id1#19L, 200)
>+- *(5) HashAggregate(keys=[_1#8.id1 AS _1#8.id1#19L], 
> functions=[partial_count(1)])
>   +- *(5) Project [_1#8]
>  +- *(5) SortMergeJoin [_1#8.id1], [_2#9.id2], Inner
> :- *(2) Sort [_1#8.id1 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(_1#8.id1, 200)
> : +- *(1) Project [named_struct(id1, id1#0L, id2, id2#1L) AS 
> _1#8]
> :+- *(1) FileScan csv [id1#0L,id2#1L] Batched: false, 
> Format: CSV, Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> +- *(4) Sort [_2#9.id2 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(_2#9.id2, 200)
>   +- *(3) Project [named_struct(id1, id1#4L, id2, id2#5L) AS 
> _2#9]
>  +- *(3) FileScan csv [id1#4L,id2#5L] Batched: false, 
> Format: CSV, Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {noformat}
> Using join, there is one less shuffle:
> {noformat}
> val result2 = df1.join(df2,  df1.col("id1") === 
> df2.col("id2")).groupBy(df1("id1")).count
> result2.explain
> == Physical Plan ==
> *(5) HashAggregate(keys=[id1#0L], functions=[count(1)])
> +- *(5) HashAggregate(keys=[id1#0L], functions=[partial_count(1)])
>+- *(5) Project [id1#0L]
>   +- *(5) SortMergeJoin [id1#0L], [id2#5L], Inner
>  :- *(2) Sort [id1#0L ASC NULLS FIRST], false, 0
>  :  +- Exchange hashpartitioning(id1#0L, 200)
>  : +- *(1) Project [id1#0L]
>  :+- *(1) Filter isnotnull(id1#0L)
>  :   +- *(1) FileScan csv [id1#0L] Batched: false, Format: 
> CSV, Location: InMemoryFileIndex[file:.../ds1.csv], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id1)], ReadSchema: struct
>  +- *(4) Sort [id2#5L ASC NULLS FIRST], false, 0
> +- Exchange hashpartitioning(id2#5L, 200)
>+- *(3) Project [id2#5L]
>   +- *(3) Filter isnotnull(id2#5L)
>  +- *(3) FileScan csv [id2#5L] Batched: false, Format: 
> CSV, Location: InMemoryFileIndex[file:...ds2.csv], PartitionFilters: [], 
> PushedFilters: [IsNotNull(id2)], ReadSchema: struct
> {noformat}
> T-he extra exchange is reflected in the run time of the query.- Actually, I 
> recant this bit. In my particular tests, the extra exchange has negligible 
> impact on run time. All the difference is in stage 2.
> My tests were on inputs with more than 2 million records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-17 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone's offset, 
but by the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value 

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-17 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string datetime value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone, but by 
the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would 

[jira] [Updated] (SPARK-23715) from_utc_timestamp returns incorrect results for some UTC date/time values

2018-03-17 Thread Bruce Robbins (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-23715:
--
Description: 
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would exercise this mismatch in expectations.

There is a similar mismatch in expectations in the (integer => timestamp) cast 
and FromUTCTimestamp. As a result, Unix time input almost always produces 
incorrect output.
h3. When things work as expected for String input:

When from_utc_timestamp is passed a string time value with no time zone, 
DateTimeUtils.stringToTimestamp (called from a Cast expression) treats the 
datetime string as though it's in the user's local time zone. Because 
DateTimeUtils.stringToTimestamp is a general function, this is reasonable.

As a result, FromUTCTimestamp's input is a timestamp shifted by the local time 
zone's offset. FromUTCTimestamp assumes this (or more accurately, a utility 
function called by FromUTCTimestamp assumes this), so the first thing it does 
is reverse-shift to get it back the correct value. Now that the long value has 
been shifted back to the correct timestamp value, it can now process it (by 
shifting it again based on the specified time zone).
h3. When things go wrong with String input:

When from_utc_timestamp is passed a string time value with an explicit time 
zone, stringToTimestamp honors that timezone and ignores the local time zone. 
stringToTimestamp does not shift the timestamp by the local timezone, but by 
the timezone specified on the datetime string.

Unfortunately, FromUTCTimestamp, which has no insight into the actual input or 
the conversion, still assumes the timestamp is shifted by the local time zone. 
So it reverse-shifts the long value by the local time zone's offset, which 
produces a incorrect timestamp (except in the case where the input datetime 
string just happened to have an explicit timezone that matches the local 
timezone). FromUTCTimestamp then uses this incorrect value for further 
processing.
h3. When things go wrong for Unix time input:

The cast in this case simply multiplies the integer by 100. The cast does 
not shift the resulting timestamp by the local time zone's offset.

Again, because FromUTCTimestamp's evaluation assumes a shifted timestamp, the 
result is wrong.

  was:
This produces the expected answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 07:18:23|
+---+
{noformat}
However, the equivalent UTC input (but with an explicit timezone) produces a 
wrong answer:
{noformat}
df.select(from_utc_timestamp(lit("2018-03-13T06:18:23+00:00"), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Additionally, the equivalent Unix time (1520921903, which is also 
"2018-03-13T06:18:23" in the UTC time zone) produces the same wrong answer:
{noformat}
df.select(from_utc_timestamp(to_timestamp(lit(1520921903)), "GMT+1" 
).as("dt")).show
+---+
| dt|
+---+
|2018-03-13 00:18:23|
+---+
{noformat}
Digging a little into the code, I see the following:

There is sometimes a mismatch in expectations between the (string => timestamp) 
cast and FromUTCTimestamp. Also, since the FromUTCTimestamp expression never 
sees the actual input string (the cast "intercepts" the input and converts it 
to a long timestamp before FromUTCTimestamp uses the value), FromUTCTimestamp 
cannot reject any input value that would 

[jira] [Assigned] (SPARK-23713) Clean-up UnsafeWriter classes

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23713:


Assignee: (was: Apache Spark)

> Clean-up UnsafeWriter classes
> -
>
> Key: SPARK-23713
> URL: https://issues.apache.org/jira/browse/SPARK-23713
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Priority: Major
>
> The current UnsafeWriter classes and its consumers have a few (small) issues:
>  * They contain quite a bit of duplication.
>  * The buffer holder is semi-internal/external. We should make it internal.
>  * The setOffsetAndSize function can be improved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23713) Clean-up UnsafeWriter classes

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23713:


Assignee: Apache Spark

> Clean-up UnsafeWriter classes
> -
>
> Key: SPARK-23713
> URL: https://issues.apache.org/jira/browse/SPARK-23713
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>
> The current UnsafeWriter classes and its consumers have a few (small) issues:
>  * They contain quite a bit of duplication.
>  * The buffer holder is semi-internal/external. We should make it internal.
>  * The setOffsetAndSize function can be improved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23713) Clean-up UnsafeWriter classes

2018-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403719#comment-16403719
 ] 

Apache Spark commented on SPARK-23713:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20850

> Clean-up UnsafeWriter classes
> -
>
> Key: SPARK-23713
> URL: https://issues.apache.org/jira/browse/SPARK-23713
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Priority: Major
>
> The current UnsafeWriter classes and its consumers have a few (small) issues:
>  * They contain quite a bit of duplication.
>  * The buffer holder is semi-internal/external. We should make it internal.
>  * The setOffsetAndSize function can be improved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2018-03-17 Thread Nick Afshartous (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403702#comment-16403702
 ] 

Nick Afshartous commented on SPARK-19767:
-

Sounds good [~c...@koeninger.org].  I'll close the current PR and submit a new 
one. 

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1485) Implement AllReduce

2018-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403463#comment-16403463
 ] 

Apache Spark commented on SPARK-1485:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/506

> Implement AllReduce
> ---
>
> Key: SPARK-1485
> URL: https://issues.apache.org/jira/browse/SPARK-1485
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> The current implementations of machine learning algorithms rely on the driver 
> for some computation and data broadcasting. This will create a bottleneck at 
> the driver for both computation and communication, especially in multi-model 
> training. An efficient implementation of AllReduce (or AllAggregate) can help 
> free the driver:
> allReduce(RDD[T], (T, T) => T): RDD[T]
> This JIRA is created for discussing how to implement AllReduce efficiently 
> and possible alternatives.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23725) Improve Hadoop's LineReader to support charsets different from UTF-8

2018-03-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23725:
--

 Summary: Improve Hadoop's LineReader to support charsets different 
from UTF-8
 Key: SPARK-23725
 URL: https://issues.apache.org/jira/browse/SPARK-23725
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


If the record delimiter is not specified, Hadoop LineReader splits 
lines/records by '\n', '\r' or/and '\r\n' in UTF-8 encoding: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L173-L177]
 . The implementation should be improved to support any charset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23724) Custom record separator for jsons in charsets different from UTF-8

2018-03-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23724:
--

 Summary: Custom record separator for jsons in charsets different 
from UTF-8
 Key: SPARK-23724
 URL: https://issues.apache.org/jira/browse/SPARK-23724
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The option should define a sequence of bytes between two consecutive json 
records. Currently the separator is detected automatically by hadoop library:
 
[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L185-L254]
 
The method is able to recognize only *\r, \n* and *\r\n* in UTF-8 encoding. It 
doesn't work in the cases if encoding of input stream is different from UTF-8. 
The option should allow to users explicitly set separator/delimiter of json 
records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23723) New charset option for json datasource

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23723:


Assignee: Apache Spark

> New charset option for json datasource
> --
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently JSON Reader can read json files in different charset/encodings. The 
> JSON Reader uses the jackson-json library to automatically detect the charset 
> of input text/stream. Here you can see the method which detects encoding: 
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>  
> The detectEncoding method checks the BOM 
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
> The BOM can be in the file but it is not mandatory. If it is not present, the 
> auto detection mechanism can select wrong charset. And as a consequence of 
> that, the user cannot read the json file. *The proposed option will allow to 
> bypass the auto detection mechanism and set the charset explicitly.*
>  
> The charset option is already exposed as a CSV option: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
>  . I propose to add the same option for JSON.
>  
> Regarding to JSON Writer, *the charset option will give to the user 
> opportunity* to read json files in charset different from UTF-8, modify the 
> dataset and *write results back to json files in the original encoding.* At 
> the moment it is not possible to do because the result can be saved in UTF-8 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23723) New charset option for json datasource

2018-03-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403448#comment-16403448
 ] 

Apache Spark commented on SPARK-23723:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20849

> New charset option for json datasource
> --
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently JSON Reader can read json files in different charset/encodings. The 
> JSON Reader uses the jackson-json library to automatically detect the charset 
> of input text/stream. Here you can see the method which detects encoding: 
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>  
> The detectEncoding method checks the BOM 
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
> The BOM can be in the file but it is not mandatory. If it is not present, the 
> auto detection mechanism can select wrong charset. And as a consequence of 
> that, the user cannot read the json file. *The proposed option will allow to 
> bypass the auto detection mechanism and set the charset explicitly.*
>  
> The charset option is already exposed as a CSV option: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
>  . I propose to add the same option for JSON.
>  
> Regarding to JSON Writer, *the charset option will give to the user 
> opportunity* to read json files in charset different from UTF-8, modify the 
> dataset and *write results back to json files in the original encoding.* At 
> the moment it is not possible to do because the result can be saved in UTF-8 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23723) New charset option for json datasource

2018-03-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23723:


Assignee: (was: Apache Spark)

> New charset option for json datasource
> --
>
> Key: SPARK-23723
> URL: https://issues.apache.org/jira/browse/SPARK-23723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently JSON Reader can read json files in different charset/encodings. The 
> JSON Reader uses the jackson-json library to automatically detect the charset 
> of input text/stream. Here you can see the method which detects encoding: 
> [https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
>  
> The detectEncoding method checks the BOM 
> ([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
> The BOM can be in the file but it is not mandatory. If it is not present, the 
> auto detection mechanism can select wrong charset. And as a consequence of 
> that, the user cannot read the json file. *The proposed option will allow to 
> bypass the auto detection mechanism and set the charset explicitly.*
>  
> The charset option is already exposed as a CSV option: 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
>  . I propose to add the same option for JSON.
>  
> Regarding to JSON Writer, *the charset option will give to the user 
> opportunity* to read json files in charset different from UTF-8, modify the 
> dataset and *write results back to json files in the original encoding.* At 
> the moment it is not possible to do because the result can be saved in UTF-8 
> only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23723) New charset option for json datasource

2018-03-17 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-23723:
--

 Summary: New charset option for json datasource
 Key: SPARK-23723
 URL: https://issues.apache.org/jira/browse/SPARK-23723
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently JSON Reader can read json files in different charset/encodings. The 
JSON Reader uses the jackson-json library to automatically detect the charset 
of input text/stream. Here you can see the method which detects encoding: 
[https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174]
 
The detectEncoding method checks the BOM 
([https://en.wikipedia.org/wiki/Byte_order_mark]) at the beginning of a text. 
The BOM can be in the file but it is not mandatory. If it is not present, the 
auto detection mechanism can select wrong charset. And as a consequence of 
that, the user cannot read the json file. *The proposed option will allow to 
bypass the auto detection mechanism and set the charset explicitly.*
 
The charset option is already exposed as a CSV option: 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88]
 . I propose to add the same option for JSON.
 
Regarding to JSON Writer, *the charset option will give to the user 
opportunity* to read json files in charset different from UTF-8, modify the 
dataset and *write results back to json files in the original encoding.* At the 
moment it is not possible to do because the result can be saved in UTF-8 only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-17 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23710:

Description: 
h1. Mainly changes
 * Maven dependency:
 hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change {{hive.classifier}} 
to {{core}}
 calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
 datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
 remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
ORC-174
 add new dependency {{avatica}} and {{hive.storage.api}}

 * ORC compatibility changes:
 OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala

 * hive-thriftserver java file update:
 update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
 update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
hive 2.3.2

 * TestSuite should update:
||TestSuite||Reason||
|StatisticsSuite|HIVE-16098|
|SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
|CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
hive-hcatalog-core-2.3.2.jar|
|SparkExecuteStatementOperationSuite|Interface changed from 
org.apache.hive.service.cli.Type.NULL_TYPE to 
org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
|ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
change to com.esotericsoftware.kryo.Kryo|
|HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
to Seq("1.100\t1", "2.100\t2")|
|HiveOrcFilterSuite|Result format changed|
|HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
|HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
org.datanucleus.identity.OID|

 * Other changes:
Close hive schema verification:  
[HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
 and 
[HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
Update 
[IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't connect 
to Hive 1.x metastore, We should use {{HiveMetaStoreClient.getDelegationToken}} 
instead of {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}

All changes can be found at 
[PR-20659|https://github.com/apache/spark/pull/20659].

  was:
h1. Mainly changes
 * Maven dependency:
 hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change {{hive.classifier}} 
to {{core}}
 calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
 datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
 remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
ORC-174
 add new dependency {{avatica}} and {{hive.storage.api}}

 * ORC compatibility changes:
 OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala

 * hive-thriftserver java file update:
 update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
 update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
hive 2.3.2

 * TestSuite should update:
||TestSuite||Reason||
|StatisticsSuite|HIVE-16098|
|SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
|CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
hive-hcatalog-core-2.3.2.jar|
|SparkExecuteStatementOperationSuite|Interface changed from 
org.apache.hive.service.cli.Type.NULL_TYPE to 
org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
|ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
change to com.esotericsoftware.kryo.Kryo|
|HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
to Seq("1.100\t1", "2.100\t2")|
|HiveOrcFilterSuite|Result format changed|
|HiveDDLSuite|Remove $ (This change needs to be reconsidered)|

 * Other changes:
Close hive schema verification:  
[HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
 and 
[HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
Update