from:"Feng Liu \(JIRA\)"

[jira] [Updated] (SPARK-23518) Avoid metastore access when users only want to read and store data frames

2018-02-26 Thread Feng Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Liu updated SPARK-23518:
-
Description: https://issues.apache.org/jira/browse/SPARK-21732 added one 
patch, which allows a spark session to be created when the hive metastore 
server is down. However, it does not allow running any commands with the spark 
session. So the users could not read / write data frames, when the hive 
metastore server is down.  (was: This is to followup 
https://issues.apache.org/jira/browse/SPARK-21732, which allows a spark session 
to be created when the hive metastore server is down. However, it does not 
allow running any commands with the spark session. )

> Avoid metastore access when users only want to read and store data frames
> -
>
> Key: SPARK-23518
> URL: https://issues.apache.org/jira/browse/SPARK-23518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-21732 added one patch, which 
> allows a spark session to be created when the hive metastore server is down. 
> However, it does not allow running any commands with the spark session. So 
> the users could not read / write data frames, when the hive metastore server 
> is down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23518) Avoid metastore access when users only want to read and store data frames

2018-02-26 Thread Feng Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Liu updated SPARK-23518:
-
Summary: Avoid metastore access when users only want to read and store data 
frames  (was: Completely remove metastore access if the query is not using 
tables)

> Avoid metastore access when users only want to read and store data frames
> -
>
> Key: SPARK-23518
> URL: https://issues.apache.org/jira/browse/SPARK-23518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>Priority: Major
>
> This is to followup https://issues.apache.org/jira/browse/SPARK-21732, which 
> allows a spark session to be created when the hive metastore server is down. 
> However, it does not allow running any commands with the spark session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23518) Completely remove metastore access if the query is not using tables

2018-02-26 Thread Feng Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Liu updated SPARK-23518:
-
Description: This is to followup 
https://issues.apache.org/jira/browse/SPARK-21732, which allows a spark session 
to be created when the hive metastore server is down. However, it does not 
allow running any commands with the spark session.   (was: This is to followup 
https://issues.apache.org/jira/browse/SPARK-21732, which allows a spark session 
to be created when the hive metastore server is down. However, it does not 
allow running any commands with spark sessions. )

> Completely remove metastore access if the query is not using tables
> ---
>
> Key: SPARK-23518
> URL: https://issues.apache.org/jira/browse/SPARK-23518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>Priority: Major
>
> This is to followup https://issues.apache.org/jira/browse/SPARK-21732, which 
> allows a spark session to be created when the hive metastore server is down. 
> However, it does not allow running any commands with the spark session. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23518) Completely remove metastore access if the query is not using tables

2018-02-26 Thread Feng Liu (JIRA)

Feng Liu created SPARK-23518:


 Summary: Completely remove metastore access if the query is not 
using tables
 Key: SPARK-23518
 URL: https://issues.apache.org/jira/browse/SPARK-23518
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Feng Liu


This is to followup https://issues.apache.org/jira/browse/SPARK-21732, which 
allows a spark session to be created when the hive metastore server is down. 
However, it does not allow running any commands with spark sessions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-09 Thread Feng Liu (JIRA)

Feng Liu created SPARK-23379:


 Summary: remove redundant metastore access if the current database 
name is the same
 Key: SPARK-23379
 URL: https://issues.apache.org/jira/browse/SPARK-23379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Feng Liu


We should be able to reduce one metastore access if the target database name is 
as same as the current database:

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-09 Thread Feng Liu (JIRA)

Feng Liu created SPARK-23378:


 Summary: move setCurrentDatabase from HiveExternalCatalog to 
HiveClientImpl
 Key: SPARK-23378
 URL: https://issues.apache.org/jira/browse/SPARK-23378
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Feng Liu


Conceptually, no methods of HiveExternalCatalog, besides the 
`setCurrentDatabase`, should change the `currentDatabase` in the hive session 
state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23259) Clean up legacy code around hive external catalog

2018-01-29 Thread Feng Liu (JIRA)

Feng Liu created SPARK-23259:


 Summary: Clean up legacy code around hive external catalog
 Key: SPARK-23259
 URL: https://issues.apache.org/jira/browse/SPARK-23259
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Feng Liu


Some legacy code around the hive metastore catalog need to be removed for 
further code improvement:
 # in HiveExternalCatalog: The `withClient` wrapper is not necessary for the 
private method `getRawTable`. 
 # in HiveClientImpl: The statement `runSqlHive()` is not necessary for the 
`addJar` method, after the jar being added to the single class loader.
 # in HiveClientImpl: There are some redundant code in both the `tableExists` 
and `getTableOption` method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22891) NullPointerException when use udf

2017-12-28 Thread Feng Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305881#comment-16305881
 ] 

Feng Liu commented on SPARK-22891:
--

A side note: if we don't want to merge 
https://github.com/apache/spark/pull/20029, we should make the creation of hive 
client lazy inside the HiveSessionResourceLoader: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionStateBuilder.scala#L123
 as we know the hive client creation is expensive, so it does not make sense to 
materialize it if we don't use it. 

> NullPointerException when use udf
> -
>
> Key: SPARK-22891
> URL: https://issues.apache.org/jira/browse/SPARK-22891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
> Environment: hadoop 2.7.2
>Reporter: gaoyang
>Priority: Minor
>
> In my application,i use multi threads. Each thread has a SparkSession and use 
> SparkSession.sqlContext.udf.register to register my udf. Sometimes there 
> throws exception like this:
> {code:java}
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
>   at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
>   at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
>   at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
>   at 
> com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
>   at 
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
>   ... 20 more
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210)
>   ... 34 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736)
>   ... 36 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.isCompatibleWith(HiveMetaStoreClient.java:287)
>   at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
>   at

[jira] [Comment Edited] (SPARK-22891) NullPointerException when use udf

2017-12-28 Thread Feng Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305862#comment-16305862
 ] 

Feng Liu edited comment on SPARK-22891 at 12/29/17 12:56 AM:
-

This is definitely caused by the race from 
https://issues.apache.org/jira/browse/HIVE-11935. 

In spark 2.1, spark creates the `metadataHive` lazily until 
`addJar`(https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L40),
 so this can only be triggered by concurrent `addJar` (can't imagine this will 
happen in practice)

In spark 2.2, the `metadataHive` creation is tied to the `resourceLoader` 
creation (see the stack trace), so it starts to be triggered by new spark 
session creation. In https://github.com/apache/spark/pull/20029, I'm trying to 
make an argument that it is safe to remove the new hive client creation. 
Besides this change, I think we should also make the hive client creation 
thread safe: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L251




was (Author: liufeng...@gmail.com):
This is definitely caused by the race from 
https://issues.apache.org/jira/browse/HIVE-11935. 

In spark 2.1, spark creates the `metadataHive` lazily until 
`addJar`(https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L40),
 so this can only be triggered by concurrent `addJar` (can't imagine this will 
happen in practice)

In spark 2.2, the `metadataHive` creation is tied to the `resourceLoader` 
creation (see the stack trace), so it starts to be triggered by new spark 
session creation. In https://github.com/apache/spark/pull/20029, I'm trying to 
make an argument that it is safe to remove the new hive client creation. 
Besides this change, I think should also make the hive client creation thread 
safe: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L251



> NullPointerException when use udf
> -
>
> Key: SPARK-22891
> URL: https://issues.apache.org/jira/browse/SPARK-22891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
> Environment: hadoop 2.7.2
>Reporter: gaoyang
>Priority: Minor
>
> In my application,i use multi threads. Each thread has a SparkSession and use 
> SparkSession.sqlContext.udf.register to register my udf. Sometimes there 
> throws exception like this:
> {code:java}
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
>   at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
>   at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
>   at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
>   at 
> com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
>   at 
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
>   at 
>

[jira] [Comment Edited] (SPARK-22891) NullPointerException when use udf

2017-12-28 Thread Feng Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305862#comment-16305862
 ] 

Feng Liu edited comment on SPARK-22891 at 12/29/17 12:49 AM:
-

This is definitely caused by the race from 
https://issues.apache.org/jira/browse/HIVE-11935. 

In spark 2.1, spark creates the `metadataHive` lazily until 
`addJar`(https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L40),
 so this can only be triggered by concurrent `addJar` (can't imagine this will 
happen in practice)

In spark 2.2, the `metadataHive` creation is tied to the `resourceLoader` 
creation (see the stack trace), so it starts to be triggered by new spark 
session creation. In https://github.com/apache/spark/pull/20029, I'm trying to 
make an argument that it is safe to remove the new hive client creation. 
Besides this change, I think should also make the hive client creation thread 
safe: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L251




was (Author: liufeng...@gmail.com):
This is definitely caused by the race from 
https://issues.apache.org/jira/browse/HIVE-11935. 

In spark 2.1, spark creates the `metadataHive` lazily until 
`addJar`(https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L40),
 so this can only be triggered by concurrent `addJar` (can't imagine this will 
happen in practice)

In spark 2.2, the `metadataHive` creation is tied to the `resourceLoader` (see 
the stack trace), so it starts to be triggered by new spark session creation. 
In https://github.com/apache/spark/pull/20029, I'm trying to make an argument 
that it is safe to remove the new hive client creation. Besides change, I think 
should also make the hive client creation thread safe: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L251



> NullPointerException when use udf
> -
>
> Key: SPARK-22891
> URL: https://issues.apache.org/jira/browse/SPARK-22891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
> Environment: hadoop 2.7.2
>Reporter: gaoyang
>Priority: Minor
>
> In my application,i use multi threads. Each thread has a SparkSession and use 
> SparkSession.sqlContext.udf.register to register my udf. Sometimes there 
> throws exception like this:
> {code:java}
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
>   at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
>   at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
>   at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
>   at 
> com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
>   at 
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
>   at 
>

[jira] [Commented] (SPARK-22891) NullPointerException when use udf

2017-12-28 Thread Feng Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16305862#comment-16305862
 ] 

Feng Liu commented on SPARK-22891:
--

This is definitely caused by the race from 
https://issues.apache.org/jira/browse/HIVE-11935. 

In spark 2.1, spark creates the `metadataHive` lazily until 
`addJar`(https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala#L40),
 so this can only be triggered by concurrent `addJar` (can't imagine this will 
happen in practice)

In spark 2.2, the `metadataHive` creation is tied to the `resourceLoader` (see 
the stack trace), so it starts to be triggered by new spark session creation. 
In https://github.com/apache/spark/pull/20029, I'm trying to make an argument 
that it is safe to remove the new hive client creation. Besides change, I think 
should also make the hive client creation thread safe: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L251



> NullPointerException when use udf
> -
>
> Key: SPARK-22891
> URL: https://issues.apache.org/jira/browse/SPARK-22891
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1
> Environment: hadoop 2.7.2
>Reporter: gaoyang
>Priority: Minor
>
> In my application,i use multi threads. Each thread has a SparkSession and use 
> SparkSession.sqlContext.udf.register to register my udf. Sometimes there 
> throws exception like this:
> {code:java}
> java.lang.IllegalArgumentException: Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1062)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:137)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:136)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:136)
>   at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
>   at org.apache.spark.sql.SparkSession.udf(SparkSession.scala:207)
>   at org.apache.spark.sql.SQLContext.udf(SQLContext.scala:203)
>   at 
> com.game.data.stat.clusterTask.tools.standard.IpConverterRegister$.run(IpConverterRegister.scala:63)
>   at 
>   ... 20 more
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:789)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newSession(HiveClientImpl.scala:79)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader$lzycompute(HiveSessionStateBuilder.scala:45)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.resourceLoader(HiveSessionStateBuilder.scala:44)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:61)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:35)
>   at 
> org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:289)
>   at 
> org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1059)
>   ... 20 more
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hive.ql.metadata.HiveException
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:744)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.getAuthenticator(SessionState.java:1391)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:210)
>   ... 34 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setAuthorizerV2Config(SessionState.java:769)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.setupAuth(SessionState.java:736)
>   ... 36 more
> Caused by:

[jira] [Created] (SPARK-22916) shouldn't bias towards build right if user does not specify

2017-12-27 Thread Feng Liu (JIRA)

Feng Liu created SPARK-22916:


 Summary: shouldn't bias towards build right if user does not 
specify
 Key: SPARK-22916
 URL: https://issues.apache.org/jira/browse/SPARK-22916
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Feng Liu


This is an issue very similar to SPARK-22489. When there are no broadcast 
hints, the current spark strategies will prefer to build right, without 
considering the sizes of the two sides. To reproduce:

{code:java}
import org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value").createTempView("table1")
spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", 
"value").createTempView("table2")

val bl = sql(s"SELECT * FROM table1 t1 JOIN table2 t2 ON t1.key = 
t2.key").queryExecution.executedPlan
{code}

The plan is going to broadcast right side (`t2`), even though it is larger.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

2017-10-11 Thread Feng Liu (JIRA)

Feng Liu created SPARK-22254:


 Summary: clean up the implementation of `growToSize` in 
CompactBuffer
 Key: SPARK-22254
 URL: https://issues.apache.org/jira/browse/SPARK-22254
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Feng Liu


two issues:

1. the arrayMax should be `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH `
2. I believe some `-2` were introduced because `Integer.Max_Value` was used 
previously. We should make the calculation of newArrayLen concise. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22222) Fix the ARRAY_MAX in BufferHolder and add a test

2017-10-08 Thread Feng Liu (JIRA)

Feng Liu created SPARK-2:


 Summary: Fix the ARRAY_MAX in BufferHolder and add a test
 Key: SPARK-2
 URL: https://issues.apache.org/jira/browse/SPARK-2
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Feng Liu


This is actually a followup for SPARK-22033, which set the `ARRAY_MAX` to 
`Int.MaxValue - 8`. It is not a valid number because it will cause the 
following line fail when such a large byte array is allocated: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L86
 We need to make sure the new length is a multiple of 8.

We need to add one test for the fix. Note that the test should work 
independently with the heap size of the test JVM. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22003) vectorized reader does not work with UDF when the column is array

2017-09-13 Thread Feng Liu (JIRA)

Feng Liu created SPARK-22003:


 Summary: vectorized reader does not work with UDF when the column 
is array
 Key: SPARK-22003
 URL: https://issues.apache.org/jira/browse/SPARK-22003
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Feng Liu


The UDF needs to deserialize the UnsafeRow. When the column type is Array, the 
`get` method from the ColumnVector, which is used by the vectorized reader, is 
called, but this method is not implemented, unfortunately. 

Code to reproduce the issue:

{code:java}
val fileName = "testfile"
val str = """{ "choices": ["key1", "key2", "key3"] }"""
val rdd = sc.parallelize(Seq(str))
val df = spark.read.json(rdd)
df.write.mode("overwrite").parquet(s"file:///tmp/$fileName ")


import org.apache.spark.sql._
import org.apache.spark.sql.functions._
spark.udf.register("acf", (rows: Seq[Row]) => Option[String](null))
spark.read.parquet(s"file:///tmp/$fileName 
").select(expr("""acf(choices)""")).show
{code}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

2017-06-22 Thread Feng Liu (JIRA)

Feng Liu created SPARK-21188:


 Summary: releaseAllLocksForTask should synchronize the whole method
 Key: SPARK-21188
 URL: https://issues.apache.org/jira/browse/SPARK-21188
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 2.1.0, 2.2.0
Reporter: Feng Liu


Since the objects readLocksByTask, writeLocksByTask and infos are coupled and 
supposed to be modified by other threads concurrently, all the read and writes 
of them in the releaseAllLocksForTask method should be protected by a single 
synchronized block. The fine-grained synchronization in the current code can 
cause some test flakiness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20991) BROADCAST_TIMEOUT conf should be a timeoutConf

2017-06-05 Thread Feng Liu (JIRA)

Feng Liu created SPARK-20991:


 Summary: BROADCAST_TIMEOUT conf should be a timeoutConf
 Key: SPARK-20991
 URL: https://issues.apache.org/jira/browse/SPARK-20991
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.2.1
Reporter: Feng Liu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23518) Avoid metastore access when users only want to read and store data frames

[jira] [Updated] (SPARK-23518) Avoid metastore access when users only want to read and store data frames

[jira] [Updated] (SPARK-23518) Completely remove metastore access if the query is not using tables

[jira] [Created] (SPARK-23518) Completely remove metastore access if the query is not using tables

[jira] [Created] (SPARK-23379) remove redundant metastore access if the current database name is the same

[jira] [Created] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

[jira] [Created] (SPARK-23259) Clean up legacy code around hive external catalog

[jira] [Commented] (SPARK-22891) NullPointerException when use udf

[jira] [Comment Edited] (SPARK-22891) NullPointerException when use udf

[jira] [Comment Edited] (SPARK-22891) NullPointerException when use udf

[jira] [Commented] (SPARK-22891) NullPointerException when use udf

[jira] [Created] (SPARK-22916) shouldn't bias towards build right if user does not specify

[jira] [Created] (SPARK-22254) clean up the implementation of `growToSize` in CompactBuffer

[jira] [Created] (SPARK-22222) Fix the ARRAY_MAX in BufferHolder and add a test

[jira] [Created] (SPARK-22003) vectorized reader does not work with UDF when the column is array

[jira] [Created] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

[jira] [Created] (SPARK-20991) BROADCAST_TIMEOUT conf should be a timeoutConf

17 matches

Site Navigation

Mail list logo

Footer information