[jira] [Updated] (SPARK-26407) For an external non-partitioned table, if add a directory named with k=v to the table path, select result will be wrong

2018-12-23 Thread Bao Yunz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bao Yunz updated SPARK-26407:
-
Description: 
Scenario 1

Create an external non-partitioned table, in which location directory has a 
directory named with "part=1" and its schema is (id, name), for example. And 
there is some data in the "part=1" directory. Then desc the table, we will find 
the "part" is added in table schema as table column. when insert into the table 
with two columns data, will throw a exception that  target table has 3 columns 
but the inserted data has 2 columns. 

Scenario 2

Create an external non-partitioned table, which location path is empty and its 
scema is (id, name), for example. After several times insert operation, we add 
a directory named with "part=1" in the table location directory.  And there is 
some data in the "part=1" directory.  Then do insert and select operation, we 
will find the scan path is changed to "tablePath/part=1",so that we will get a 
wrong result.

 The right logic should be that if a table is a non-partitioned table, adding a 
partition-like folder under tablePath should not change its schema and select 
result.

  was:
Scenario 1

Create an external non-partitioned table, in which location directory has a 
directory named with "part=1" and its schema is (id, name), for example. And 
there is some data in the "part=1" directory. Then desc the table, we will find 
the "part" is added in table scehma as table column. when insert into the table 
with two columns data, will throw a exception that  target table has 3 columns 
but the inserted data has 2 columns. 

Scenario 2

Create an external non-partitioned table, which location path is empty and its 
scema is (id, name), for example. After several times insert operation, we add 
a directory named with "part=1" in the table location directory.  And there is 
some data in the "part=1" directory.  Then do insert and select operation, we 
will find the scan path is changed to "tablePath/part=1",so that we will get a 
wrong result.

 The right logic should be that if a table is a non-partitioned table, adding a 
partition-like folder under tablePath should not change its schema and select 
result.


> For an external non-partitioned table, if add a directory named with k=v to 
> the table path, select result will be wrong
> ---
>
> Key: SPARK-26407
> URL: https://issues.apache.org/jira/browse/SPARK-26407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bao Yunz
>Priority: Major
>  Labels: usability
>
> Scenario 1
> Create an external non-partitioned table, in which location directory has a 
> directory named with "part=1" and its schema is (id, name), for example. And 
> there is some data in the "part=1" directory. Then desc the table, we will 
> find the "part" is added in table schema as table column. when insert into 
> the table with two columns data, will throw a exception that  target table 
> has 3 columns but the inserted data has 2 columns. 
> Scenario 2
> Create an external non-partitioned table, which location path is empty and 
> its scema is (id, name), for example. After several times insert operation, 
> we add a directory named with "part=1" in the table location directory.  And 
> there is some data in the "part=1" directory.  Then do insert and select 
> operation, we will find the scan path is changed to "tablePath/part=1",so 
> that we will get a wrong result.
>  The right logic should be that if a table is a non-partitioned table, adding 
> a partition-like folder under tablePath should not change its schema and 
> select result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26419) spark metric source

2018-12-23 Thread Si Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Si Chen resolved SPARK-26419.
-
Resolution: Invalid

> spark metric source
> ---
>
> Key: SPARK-26419
> URL: https://issues.apache.org/jira/browse/SPARK-26419
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Si Chen
>Priority: Major
> Attachments: image-2018-12-20-17-05-42-245.png, 
> image-2018-12-20-17-07-40-920.png, image-2018-12-20-17-07-48-020.png, 
> image-2018-12-20-17-11-44-568.png, image-2018-12-20-17-14-35-157.png
>
>
> Today I write a metric source to collect HikariCp metrics.
>  My source code like this:
>  !image-2018-12-20-17-05-42-245.png|width=475,height=184!
>  Metrics.properties
>  !image-2018-12-20-17-07-48-020.png|width=533,height=121!
>  My applicaiton run in yarn-cluster mode.
>  Driver normal running. In graphite i can see the hikaricp metrics
>  !image-2018-12-20-17-11-44-568.png|width=468,height=118!
> But the executor didn`t collect the hikaricp metric to graphite, So I see the 
> executor`s log I found some thing
>  !image-2018-12-20-17-14-35-157.png|width=666,height=331!
>  So it can`t reflet this class because it can not find this class,But I`m 
> sure this class has packaged in the jar.And Drive is OK can find this class. 
> Why can`t executor find this class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14023) Make exceptions consistent regarding fields and columns

2018-12-23 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-14023.
---
   Resolution: Fixed
 Assignee: Sean Owen  (was: Rekha Joshi)
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23373

> Make exceptions consistent regarding fields and columns
> ---
>
> Key: SPARK-14023
> URL: https://issues.apache.org/jira/browse/SPARK-14023
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 3.0.0
>
>
> As you can see below, a column is called a field depending on where an 
> exception is thrown. I think it should be "column" everywhere (since that's 
> what has a type from a schema).
> {code}
> scala> lr
> res32: org.apache.spark.ml.regression.LinearRegression = linReg_d9bfe808e743
> scala> lr.fit(ds)
> java.lang.IllegalArgumentException: Field "features" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:214)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:213)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
>   at 
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
>   at 
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>   at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
>   ... 51 elided
> scala> lr.fit(ds)
> java.lang.IllegalArgumentException: requirement failed: Column label must be 
> of type DoubleType but was actually StringType.
>   at scala.Predef$.require(Predef.scala:219)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>   at 
> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
>   at 
> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
>   at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
>   ... 51 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26358) Spark deployed mode question

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722121#comment-16722121
 ] 

Hyukjin Kwon edited comment on SPARK-26358 at 12/24/18 5:10 AM:


Hm, can you copy and paste the codes you run? I have a yarn cluster. I could 
copy what you did exactly step by step and verify the issue.


was (Author: hyukjin.kwon):
Hm, can you copy and paste the comments you run? I have a yarn cluster. I could 
copy what you did exactly step by step and verify the issue.

> Spark deployed mode question
> 
>
> Key: SPARK-26358
> URL: https://issues.apache.org/jira/browse/SPARK-26358
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0
> Environment: spark2.3.0
> hadoop2.7.3
>Reporter: Si Chen
>Priority: Major
> Attachments: sparkbug.jpg
>
>
> When submit my job with yarn-client mode.When I didn`t visit application web 
> UI if executor have exception the application will exit. But I had visit 
> application web UI if executor have exception the application will not exit!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26383) NPE when use DataFrameReader.jdbc with wrong URL

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728146#comment-16728146
 ] 

Hyukjin Kwon commented on SPARK-26383:
--

Yea, can you open a PR to fix the error message?

> NPE when use DataFrameReader.jdbc with wrong URL
> 
>
> Key: SPARK-26383
> URL: https://issues.apache.org/jira/browse/SPARK-26383
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: clouds
>Priority: Minor
>
> When passing wrong url to jdbc:
> {code:java}
> val opts = Map(
>   "url" -> "jdbc:mysql://localhost/db",
>   "dbtable" -> "table",
>   "driver" -> "org.postgresql.Driver"
> )
> var df = spark.read.format("jdbc").options(opts).load
> {code}
> It would throw an NPE instead of complaining about connection failed. (Note 
> url and driver not matched here)
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> {code}
> as [postgresql jdbc driver 
> document|https://jdbc.postgresql.org/development/privateapi/org/postgresql/Driver.html#connect-java.lang.String-java.util.Properties-]
>  saying, The driver should return "null" if it realizes it is the wrong kind 
> of driver to connect to the given URL.
> while 
> [ConnectionFactory|https://github.com/apache/spark/blob/e743e848484bf7d97e1b4f33ea83f8520ae7da04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L56]
>  would not check if conn is null.
> {code:java}
> val conn: Connection = JdbcUtils.createConnectionFactory(options)()
> {code}
>  and trying to close the conn anyway
> {code:java}
> try {
>   ...
> } finally {
>   conn.close()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728145#comment-16728145
 ] 

Hyukjin Kwon commented on SPARK-26385:
--

Mind adding the code you ran? It's also better to iterate with dev mailing list 
before filing it as an issue in JIRA.

> YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in 
> cache
> ---
>
> Key: SPARK-26385
> URL: https://issues.apache.org/jira/browse/SPARK-26385
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0, Spark 2.4.0
>Reporter: T M
>Priority: Major
>
>  
> Hello,
>  
> I have Spark Structured Streaming job which is runnning on YARN(Hadoop 2.6.0, 
> Spark 2.4.0). After 25-26 hours, my job stops working with following error:
> {code:java}
> 2018-12-16 22:35:17 ERROR 
> org.apache.spark.internal.Logging$class.logError(Logging.scala:91): Query 
> TestQuery[id = a61ce197-1d1b-4e82-a7af-60162953488b, runId = 
> a56878cf-dfc7-4f6a-ad48-02cf738ccc2f] terminated with error 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (token for REMOVED: HDFS_DELEGATION_TOKEN owner=REMOVED, renewer=yarn, 
> realUser=, issueDate=1544903057122, maxDate=1545507857122, 
> sequenceNumber=10314, masterKeyId=344) can't be found in cache at 
> org.apache.hadoop.ipc.Client.call(Client.java:1470) at 
> org.apache.hadoop.ipc.Client.call(Client.java:1401) at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>  at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:752)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source) at 
> org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1977) at 
> org.apache.hadoop.fs.Hdfs.getFileStatus(Hdfs.java:133) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1120) at 
> org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1116) at 
> org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at 
> org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1116) at 
> org.apache.hadoop.fs.FileContext$Util.exists(FileContext.java:1581) at 
> org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.exists(CheckpointFileManager.scala:326)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:142)
>  at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.add(HDFSMetadataLog.scala:110)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply$mcV$sp(MicroBatchExecution.scala:544)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.withProgressLocked(MicroBatchExecution.scala:554)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:542)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> 

[jira] [Resolved] (SPARK-26396) Kafka consumer cache overflow since 2.4.x

2018-12-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26396.
--
Resolution: Invalid

> Kafka consumer cache overflow since 2.4.x
> -
>
> Key: SPARK-26396
> URL: https://issues.apache.org/jira/browse/SPARK-26396
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark 2.4 standalone client mode
>Reporter: Kaspar Tint
>Priority: Major
>
> We are experiencing an issue where the Kafka consumer cache seems to overflow 
> constantly upon starting the application. This issue appeared after upgrading 
> to Spark 2.4.
> We would get constant warnings like this:
> {code:java}
> 18/12/18 07:03:29 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-76)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-6f66e0d2-beaf-4ff2-ade8-8996611de6ae--1081651087-executor,kafka-topic-30)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-57)
> 18/12/18 07:03:32 WARN KafkaDataConsumer: KafkaConsumer cache hitting max 
> capacity of 180, removing consumer for 
> CacheKey(spark-kafka-source-f41d1f9e-1700-4994-9d26-2b9c0ee57881--215746753-executor,kafka-topic-43)
> {code}
> This application is running 4 different Spark Structured Streaming queries 
> against the same Kafka topic that has 90 partitions. We used to run it with 
> just the default settings so it defaulted to cache size 64 on Spark 2.3 but 
> now we tried to put it to 180 or 360. With 360 we will have a lot less noise 
> about the overflow but resource need will increase substantially.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26406) Add option to skip rows when reading csv files

2018-12-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26406.
--
Resolution: Won't Fix

I won't fix this although I get the usecase and problem. Spark's being very 
conservative so let's only add absolutely required APIs or options only.

> Add option to skip rows when reading csv files
> --
>
> Key: SPARK-26406
> URL: https://issues.apache.org/jira/browse/SPARK-26406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Thomas Kastl
>Priority: Minor
>
> Real-world data can contain multiple header lines. Spark currently does not 
> offer any way to skip more than one header row.
> Several workarounds are proposed on stackoverflow (manually editing each csv 
> file by adding "#" to the rows and using the comment option, or filtering 
> after reading) but all of them are workarounds with more or less obvious 
> drawbacks and restrictions.
> The option
> {code:java}
> header=True{code}
> already treats the first row of csv files differently, so the argument that 
> Spark wants to be row-order agnostic does not really hold here in my opinion. 
> A solution like pandas'
> {code:java}
> skiprows={code}
> would be highly preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26406) Add option to skip rows when reading csv files

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728143#comment-16728143
 ] 

Hyukjin Kwon commented on SPARK-26406:
--

Spark allow RDD operations. You can also read it as text, skip few lines 
explicitly and load it via `csv(Dataset[String])` APIs.

> Add option to skip rows when reading csv files
> --
>
> Key: SPARK-26406
> URL: https://issues.apache.org/jira/browse/SPARK-26406
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Thomas Kastl
>Priority: Minor
>
> Real-world data can contain multiple header lines. Spark currently does not 
> offer any way to skip more than one header row.
> Several workarounds are proposed on stackoverflow (manually editing each csv 
> file by adding "#" to the rows and using the comment option, or filtering 
> after reading) but all of them are workarounds with more or less obvious 
> drawbacks and restrictions.
> The option
> {code:java}
> header=True{code}
> already treats the first row of csv files differently, so the argument that 
> Spark wants to be row-order agnostic does not really hold here in my opinion. 
> A solution like pandas'
> {code:java}
> skiprows={code}
> would be highly preferable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26407) For an external non-partitioned table, if add a directory named with k=v to the table path, select result will be wrong

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728142#comment-16728142
 ] 

Hyukjin Kwon commented on SPARK-26407:
--

Why don't you just avoid the directory names like part=1 or empty strings? It 
doesn't looks a good practice to allow.

> For an external non-partitioned table, if add a directory named with k=v to 
> the table path, select result will be wrong
> ---
>
> Key: SPARK-26407
> URL: https://issues.apache.org/jira/browse/SPARK-26407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bao Yunz
>Priority: Major
>  Labels: usability
>
> Scenario 1
> Create an external non-partitioned table, in which location directory has a 
> directory named with "part=1" and its schema is (id, name), for example. And 
> there is some data in the "part=1" directory. Then desc the table, we will 
> find the "part" is added in table scehma as table column. when insert into 
> the table with two columns data, will throw a exception that  target table 
> has 3 columns but the inserted data has 2 columns. 
> Scenario 2
> Create an external non-partitioned table, which location path is empty and 
> its scema is (id, name), for example. After several times insert operation, 
> we add a directory named with "part=1" in the table location directory.  And 
> there is some data in the "part=1" directory.  Then do insert and select 
> operation, we will find the scan path is changed to "tablePath/part=1",so 
> that we will get a wrong result.
>  The right logic should be that if a table is a non-partitioned table, adding 
> a partition-like folder under tablePath should not change its schema and 
> select result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26408) java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347)

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728140#comment-16728140
 ] 

Hyukjin Kwon commented on SPARK-26408:
--

Yes, it doesn't looks like Spark's problem.

> java.util.NoSuchElementException: None.get at 
> scala.None$.get(Option.scala:347)
> ---
>
> Key: SPARK-26408
> URL: https://issues.apache.org/jira/browse/SPARK-26408
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Spark version 2.3.2
> Scala version 2.11.8
> Hbase version 1.4.7
>Reporter: Amit Siddhu
>Priority: Major
>
> {code:java}
> sudo spark-shell --packages com.hortonworks:shc-core:1.1.1-2.1-s_2.11 
> --repositories http://repo.hortonworks.com/content/groups/public/
> {code}
> {code:java}
> import org.apache.spark.sql.{SQLContext, _}
> import org.apache.spark.sql.execution.datasources.hbase._
> import org.apache.spark.{SparkConf, SparkContext}
> import spark.sqlContext.implicits._
> {code}
> {code:java}
> def withCatalog(cat: String): DataFrame = {
>   spark.sqlContext
>   .read
>   .options(Map(HBaseTableCatalog.tableCatalog->cat))
>   .format("org.apache.spark.sql.execution.datasources.hbase")
>   .load()
> }
> {code}
> {code:java}
> def motorQuoteCatatog = s"""{ |"table":{"namespace":"default", 
> "name":"public.motor_product_quote", "tableCoder":"PrimitiveType"}, 
> |"rowkey":"id", |"columns":{ |"id":{"cf":"rowkey", "col":"id", 
> "type":"string"}, |"quote_id":{"cf":"motor_product_quote", "col":"quote_id", 
> "type":"string"}, |"vehicle_id":{"cf":"motor_product_quote", 
> "col":"vehicle_id", "type":"bigint"}, |"is_new":{"cf":"motor_product_quote", 
> "col":"is_new", "type":"boolean"}, 
> |"date_of_manufacture":{"cf":"motor_product_quote", 
> "col":"date_of_manufacture", "type":"string"}, 
> |"raw_data":{"cf":"motor_product_quote", "col":"raw_data", "type":"string"}, 
> |"is_processed":{"cf":"motor_product_quote", "col":"is_processed", 
> "type":"boolean"}, |"created_on":{"cf":"motor_product_quote", 
> "col":"created_on", "type":"string"}, |"type":{"cf":"motor_product_quote", 
> "col":"type", "type":"string"}, 
> |"requirement_id":{"cf":"motor_product_quote", "col":"requirement_id", 
> "type":"int"}, |"previous_policy_id":{"cf":"motor_product_quote", 
> "col":"type", "previous_policy_id":"int"}, 
> |"parent_quote_id":{"cf":"motor_product_quote", "col":"type", 
> "parent_quote_id":"int"}, |"ticket_id":{"cf":"motor_product_quote", 
> "col":"type", "ticket_id":"int"}, |"tracker_id":{"cf":"motor_product_quote", 
> "col":"tracker_id", "type":"int"}, |"category":{"cf":"motor_product_quote", 
> "col":"category", "type":"string"}, 
> |"sales_channel_id":{"cf":"motor_product_quote", "col":"sales_channel_id", 
> "type":"int"}, |"policy_type":{"cf":"motor_product_quote", 
> "col":"policy_type", "type":"string"}, 
> |"original_quote_created_by_id":{"cf":"motor_product_quote", "col":"type", 
> "original_quote_created_by_id":"int"}, 
> |"created_by_id":{"cf":"motor_product_quote", "col":"created_by_id", 
> "type":"int"}, |"mobile":{"cf":"motor_product_quote", "col":"mobile", 
> "type":"string"}, |"registration_number":{"cf":"motor_product_quote", 
> "col":"registration_number", "type":"string"} |} |}""".stripMargin
> {code}
>  
> {code:java}
> val df = withCatalog(motorQuoteCatatog){code}
> {code:java}
> java.util.NoSuchElementException: None.get
>  at scala.None$.get(Option.scala:347)
>  at scala.None$.get(Option.scala:345)
>  at org.apache.spark.sql.execution.datasources.hbase.Field.   
> (HBaseTableCatalog.scala:102)
>  at  
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$$anonfun$ap
>  ply$3.apply(HBaseTableCatalog.scala:286)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$$anonfun$apply$3.apply(HBaseTableCatalog.scala:281)
> at scala.collection.immutable.List.foreach(List.scala:381)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:51)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
>  at withCatalog(:38)
>  ... 55 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-26413) SPIP: RDD Arrow Support in Spark Core and PySpark

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728136#comment-16728136
 ] 

Hyukjin Kwon commented on SPARK-26413:
--

I was thinking if it's better if we expose simply, for instance, some kind of 
utils that converts from Arrow format to Spark internal rows, for instance, 
rather than exposing RDD APIs. If it's something needed. Could this be an 
alternative as well?

> SPIP: RDD Arrow Support in Spark Core and PySpark
> -
>
> Key: SPARK-26413
> URL: https://issues.apache.org/jira/browse/SPARK-26413
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Richard Whitcomb
>Priority: Minor
>
> h2. Background and Motivation
> Arrow is becoming an standard interchange format for columnar Structured 
> Data.  This is already true in Spark with the use of arrow in the pandas udf 
> functions in the dataframe API.
> However the current implementation of arrow in spark is limited to two use 
> cases.
>  * Pandas UDF that allows for operations on one or more columns in the 
> DataFrame API.
>  * Collect as Pandas which pulls back the entire dataset to the driver in a 
> Pandas Dataframe.
> What is still hard however is making use of all of the columns in a Dataframe 
> while staying distributed across the workers.  The only way to do this 
> currently is to drop down into RDDs and collect the rows into a dataframe. 
> However pickling is very slow and the collecting is expensive.
> The proposal is to extend spark in a way that allows users to operate on an 
> Arrow Table fully while still making use of Spark's underlying technology.  
> Some examples of possibilities with this new API. 
>  * Pass the Arrow Table with Zero Copy to PyTorch for predictions.
>  * Pass to Nvidia Rapids for an algorithm to be run on the GPU.
>  * Distribute data across many GPUs making use of the new Barriers API.
> h2. Targets users and personas
> ML, Data Scientists, and future library authors..
> h2. Goals
>  * Conversion from any Dataset[Row] or PySpark Dataframe to RDD[Table]
>  * Conversion back from any RDD[Table] to Dataset[Row], RDD[Row], Pyspark 
> Dataframe
>  * Open the possibilities to tighter integration between Arrow/Pandas/Spark 
> especially at a library level.
> h2. Non-Goals
>  * Not creating a new API but instead using existing APIs.
> h2. Proposed API changes
> h3. Data Objects
> case class ArrowTable(schema: Schema, batches: Iterable[ArrowRecordBatch])
> h3. Dataset.scala
> {code:java}
> // Converts a Dataset to an RDD of Arrow Tables
> // Each RDD row is an Interable of Arrow Batches.
> def arrowRDD: RDD[ArrowTable]
>  
> // Utility Function to convert to RDD Arrow Table for PySpark
> private[sql] def javaToPythonArrow: JavaRDD[Array[Byte]]
> {code}
> h3. RDD.scala
> {code:java}
>  // Converts RDD[ArrowTable] to an Dataframe by inspecting the Arrow Schema
>  def arrowToDataframe(implicit ev: T <:< ArrowTable): Dataframe
>   
>  // Converts RDD[ArrowTable] to an RDD of Rows
>  def arrowToRDD(implicit ev: T <:< ArrowTable): RDD[Row]{code}
> h3. Serializers.py
> {code:java}
> # Serializer that takes a Serialized Arrow Tables and returns a pyarrow Table.
> class ArrowSerializer(FramedSerializer)
> {code}
> h3. RDD.py
> {code}
> # New RDD Class that has an RDD[ArrowTable] behind it and uses the new 
> ArrowSerializer instead of the normal Pickle Serializer
> class ArrowRDD(RDD){code}
>  
> h3. Dataframe.py
> {code}
> // New Function that converts a pyspark dataframe into an ArrowRDD
> def arrow(self):
> {code}
>  
> h2. Example API Usage
> h3. Pyspark
> {code}
> # Select a Single Column Using Pandas
> def map_table(arrow_table):
>   import pyarrow as pa
>   pdf = arrow_table.to_pandas()
>   pdf = pdf[['email']]
>   return pa.Table.from_pandas(pdf)
> # Convert to Arrow RDD, map over tables, convert back to dataframe
> df.arrow.map(map_table).dataframe 
> {code}
> h3. Scala
>  
> {code:java}
> // Find N Centroids using Cuda Rapids kMeans
> def runCuKmeans(table: ArrowTable, clusters: Int): ArrowTable
>  
> // Convert Dataset[Row] to RDD[ArrowTable] and back to Dataset[Row]
> df.arrowRDD.map(table => runCuKmeans(table, N)).arrowToDataframe.show(10)
> {code}
>  
> h2. Implementation Details
> As mentioned in the first section, the goal is to make it easier for Spark 
> users to interact with Arrow tools and libraries.  This however does come 
> with some considerations from a Spark perspective.
>  Arrow is column based instead of Row based.  In the above API proposal of 
> RDD[ArrowTable] each RDD row will in fact be a block of data.  Another 
> proposal in this regard is to introduce a new parameter to Spark called 
> arrow.sql.execution.arrow.maxRecordsPerTable.  The goal of this parameter is 
> to decide how 

[jira] [Commented] (SPARK-26413) SPIP: RDD Arrow Support in Spark Core and PySpark

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728135#comment-16728135
 ] 

Hyukjin Kwon commented on SPARK-26413:
--

For clarification, Arrow column vector APIs are exposed under 
[https://github.com/apache/spark/tree/86cc907448f0102ad0c185e87fcc897d0a32707f/sql/core/src/main/java/org/apache/spark/sql/vectorized].
 So it is pretty much feasible for a thirdparty to implement this. For 
instance, the company I belong used this approach to use Arrow format (see also 
https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html).
 It's feasible to integrate with {{ColumnarBatch}}. 

Since RDD APIs are being very conservative, I would like to be very sure if we 
have a strong reason to add RDD APIs. For instance, is it impossible to use the 
API I pointed out? Since Arrow is self-described and structural, it mostly only 
makes sense to use it with SparkSQL within Apahce Spark. In this way, wouldn't 
it more make sense to make the current vector APIs easier to use?

> SPIP: RDD Arrow Support in Spark Core and PySpark
> -
>
> Key: SPARK-26413
> URL: https://issues.apache.org/jira/browse/SPARK-26413
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Richard Whitcomb
>Priority: Minor
>
> h2. Background and Motivation
> Arrow is becoming an standard interchange format for columnar Structured 
> Data.  This is already true in Spark with the use of arrow in the pandas udf 
> functions in the dataframe API.
> However the current implementation of arrow in spark is limited to two use 
> cases.
>  * Pandas UDF that allows for operations on one or more columns in the 
> DataFrame API.
>  * Collect as Pandas which pulls back the entire dataset to the driver in a 
> Pandas Dataframe.
> What is still hard however is making use of all of the columns in a Dataframe 
> while staying distributed across the workers.  The only way to do this 
> currently is to drop down into RDDs and collect the rows into a dataframe. 
> However pickling is very slow and the collecting is expensive.
> The proposal is to extend spark in a way that allows users to operate on an 
> Arrow Table fully while still making use of Spark's underlying technology.  
> Some examples of possibilities with this new API. 
>  * Pass the Arrow Table with Zero Copy to PyTorch for predictions.
>  * Pass to Nvidia Rapids for an algorithm to be run on the GPU.
>  * Distribute data across many GPUs making use of the new Barriers API.
> h2. Targets users and personas
> ML, Data Scientists, and future library authors..
> h2. Goals
>  * Conversion from any Dataset[Row] or PySpark Dataframe to RDD[Table]
>  * Conversion back from any RDD[Table] to Dataset[Row], RDD[Row], Pyspark 
> Dataframe
>  * Open the possibilities to tighter integration between Arrow/Pandas/Spark 
> especially at a library level.
> h2. Non-Goals
>  * Not creating a new API but instead using existing APIs.
> h2. Proposed API changes
> h3. Data Objects
> case class ArrowTable(schema: Schema, batches: Iterable[ArrowRecordBatch])
> h3. Dataset.scala
> {code:java}
> // Converts a Dataset to an RDD of Arrow Tables
> // Each RDD row is an Interable of Arrow Batches.
> def arrowRDD: RDD[ArrowTable]
>  
> // Utility Function to convert to RDD Arrow Table for PySpark
> private[sql] def javaToPythonArrow: JavaRDD[Array[Byte]]
> {code}
> h3. RDD.scala
> {code:java}
>  // Converts RDD[ArrowTable] to an Dataframe by inspecting the Arrow Schema
>  def arrowToDataframe(implicit ev: T <:< ArrowTable): Dataframe
>   
>  // Converts RDD[ArrowTable] to an RDD of Rows
>  def arrowToRDD(implicit ev: T <:< ArrowTable): RDD[Row]{code}
> h3. Serializers.py
> {code:java}
> # Serializer that takes a Serialized Arrow Tables and returns a pyarrow Table.
> class ArrowSerializer(FramedSerializer)
> {code}
> h3. RDD.py
> {code}
> # New RDD Class that has an RDD[ArrowTable] behind it and uses the new 
> ArrowSerializer instead of the normal Pickle Serializer
> class ArrowRDD(RDD){code}
>  
> h3. Dataframe.py
> {code}
> // New Function that converts a pyspark dataframe into an ArrowRDD
> def arrow(self):
> {code}
>  
> h2. Example API Usage
> h3. Pyspark
> {code}
> # Select a Single Column Using Pandas
> def map_table(arrow_table):
>   import pyarrow as pa
>   pdf = arrow_table.to_pandas()
>   pdf = pdf[['email']]
>   return pa.Table.from_pandas(pdf)
> # Convert to Arrow RDD, map over tables, convert back to dataframe
> df.arrow.map(map_table).dataframe 
> {code}
> h3. Scala
>  
> {code:java}
> // Find N Centroids using Cuda Rapids kMeans
> def runCuKmeans(table: ArrowTable, clusters: Int): ArrowTable
>  
> // Convert Dataset[Row] to RDD[ArrowTable] and back to Dataset[Row]

[jira] [Updated] (SPARK-26419) spark metric source

2018-12-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26419:
-
Priority: Major  (was: Blocker)

> spark metric source
> ---
>
> Key: SPARK-26419
> URL: https://issues.apache.org/jira/browse/SPARK-26419
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Si Chen
>Priority: Major
> Attachments: image-2018-12-20-17-05-42-245.png, 
> image-2018-12-20-17-07-40-920.png, image-2018-12-20-17-07-48-020.png, 
> image-2018-12-20-17-11-44-568.png, image-2018-12-20-17-14-35-157.png
>
>
> Today I write a metric source to collect HikariCp metrics.
>  My source code like this:
>  !image-2018-12-20-17-05-42-245.png|width=475,height=184!
>  Metrics.properties
>  !image-2018-12-20-17-07-48-020.png|width=533,height=121!
>  My applicaiton run in yarn-cluster mode.
>  Driver normal running. In graphite i can see the hikaricp metrics
>  !image-2018-12-20-17-11-44-568.png|width=468,height=118!
> But the executor didn`t collect the hikaricp metric to graphite, So I see the 
> executor`s log I found some thing
>  !image-2018-12-20-17-14-35-157.png|width=666,height=331!
>  So it can`t reflet this class because it can not find this class,But I`m 
> sure this class has packaged in the jar.And Drive is OK can find this class. 
> Why can`t executor find this class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26419) spark metric source

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728125#comment-16728125
 ] 

Hyukjin Kwon commented on SPARK-26419:
--

Please avoid to set a Critical+ which is usually reserved for committers.

> spark metric source
> ---
>
> Key: SPARK-26419
> URL: https://issues.apache.org/jira/browse/SPARK-26419
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Si Chen
>Priority: Major
> Attachments: image-2018-12-20-17-05-42-245.png, 
> image-2018-12-20-17-07-40-920.png, image-2018-12-20-17-07-48-020.png, 
> image-2018-12-20-17-11-44-568.png, image-2018-12-20-17-14-35-157.png
>
>
> Today I write a metric source to collect HikariCp metrics.
>  My source code like this:
>  !image-2018-12-20-17-05-42-245.png|width=475,height=184!
>  Metrics.properties
>  !image-2018-12-20-17-07-48-020.png|width=533,height=121!
>  My applicaiton run in yarn-cluster mode.
>  Driver normal running. In graphite i can see the hikaricp metrics
>  !image-2018-12-20-17-11-44-568.png|width=468,height=118!
> But the executor didn`t collect the hikaricp metric to graphite, So I see the 
> executor`s log I found some thing
>  !image-2018-12-20-17-14-35-157.png|width=666,height=331!
>  So it can`t reflet this class because it can not find this class,But I`m 
> sure this class has packaged in the jar.And Drive is OK can find this class. 
> Why can`t executor find this class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26405) OOM

2018-12-23 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728122#comment-16728122
 ] 

Hyukjin Kwon commented on SPARK-26405:
--

BTW, please don't report a JIRA with the title like OOM. I had no idea of 
what's the JIRA about.

> OOM
> ---
>
> Key: SPARK-26405
> URL: https://issues.apache.org/jira/browse/SPARK-26405
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Shuffle, Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: lu
>Priority: Major
>
> Heap memory overflow occurred in the user portrait analysis, and the data 
> volume analyzed was about 10 million records
> spark work memory:4G
> using RestSubmissionClient to submit the job
> boht the driver memory and executor memory :4g
> total executor cores: 6
> spark cores:2
> the cluster size:3
>  
> INFO worker.WorkerWatcher: Connecting to worker 
> spark://Worker@192.168.44.181:45315
> Exception in thread "broadcast-exchange-3" java.lang.OutOfMemoryError: Not 
> enough memory to build and broadcast the table to all worker nodes. As a 
> workaround, you can either disable broadcast by setting 
> spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver 
> memory by setting spark.driver.memory to a higher value
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:102)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>  at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
>  at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>  at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>  at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
>  at 
> 

[jira] [Resolved] (SPARK-26405) OOM

2018-12-23 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26405.
--
Resolution: Invalid

> OOM
> ---
>
> Key: SPARK-26405
> URL: https://issues.apache.org/jira/browse/SPARK-26405
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Shuffle, Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: lu
>Priority: Major
>
> Heap memory overflow occurred in the user portrait analysis, and the data 
> volume analyzed was about 10 million records
> spark work memory:4G
> using RestSubmissionClient to submit the job
> boht the driver memory and executor memory :4g
> total executor cores: 6
> spark cores:2
> the cluster size:3
>  
> INFO worker.WorkerWatcher: Connecting to worker 
> spark://Worker@192.168.44.181:45315
> Exception in thread "broadcast-exchange-3" java.lang.OutOfMemoryError: Not 
> enough memory to build and broadcast the table to all worker nodes. As a 
> workaround, you can either disable broadcast by setting 
> spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver 
> memory by setting spark.driver.memory to a higher value
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:102)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:103)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>  at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
> [300 seconds]
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
>  at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
>  at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:126)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>  at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
>  at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
>  at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>  at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
>  at 
> org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
>  at 
> 

[jira] [Commented] (SPARK-26432) Not able to connect Hbase 2.1 service Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service.

2018-12-23 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728121#comment-16728121
 ] 

Dongjoon Hyun commented on SPARK-26432:
---

Thank you for reporting, [~S71955].
If you update the JIRA description with a reproducible example, that would be 
helpful. I have two questions.
- HBase 2.1 is the only one broken among HBase versions? Could you link the 
Apache HBase issue which removes that API here?
- Is it enough to make `HBaseDelegationTokenProvider` support HBase 2.1?

> Not able to connect Hbase 2.1 service Getting NoSuchMethodException while 
> trying to obtain token from Hbase 2.1 service.
> 
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.
>  
> Attached the snapshot of error logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23693) SQL function uuid()

2018-12-23 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728078#comment-16728078
 ] 

Reynold Xin commented on SPARK-23693:
-

[~tashoyan] the issue with calling uuid directly is that it is 
non-deterministic, and when recompute happens due to fault, the ids are not 
stable. We'd need a different way to generate uuid that can be deterministic 
based on some seed.

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26432) Not able to connect Hbase 2.1 service Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service.

2018-12-23 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26432:
---
Description: 
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the token and the same has been removed from hbase 2.1 version.

 

Attached the snapshot of error logs

  was:
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the token and the same has been removed from hbase 2.1 version.


> Not able to connect Hbase 2.1 service Getting NoSuchMethodException while 
> trying to obtain token from Hbase 2.1 service.
> 
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.
>  
> Attached the snapshot of error logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2018-12-23 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728020#comment-16728020
 ] 

Debasish Das commented on SPARK-24374:
--

Hi [~mengxr] with barrier mode available is it not possible to use native TF 
parameter server in place of using MPI ? Although we are offloading compute 
from spark to tf workers/ps, still if there is an exception that comes out, 
tracking it with native TF API might be easier than MPI exception...great work 
by the way...I was looking for a cloud-ml alternative using spark over 
aws/azure/gcp and looks like barrier should help a lot although I am still not 
clear on the limitations of TensorflowOnSpark project from Yahoo 
[https://github.com/yahoo/TensorFlowOnSpark] which tried to put barrier like 
syntax but not sure if few partitions fails on some tfrecord read / 
communication exceptions whether it can re-run full job or it will only re-run 
the failed partition...I guess the exception from few partitions can be thrown 
back to spark driver and driver can take the action for re-run..when multiple 
tf training jobs get scheduled on the same spark cluster I suspect TFoS might 
have issues as well... 

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26432) Not able to connect Hbase 2.1 service Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service.

2018-12-23 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728014#comment-16728014
 ] 

Sujith edited comment on SPARK-26432 at 12/23/18 5:49 PM:
--

This is mainly happening because spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the kerberos security token and the API has been removed from 
hbase 2.1 version , as i analyzed there is one more stable API in 

public static Token obtainToken(Connection conn) 
in TokenUtil class , i think spark shall use this stable api for getting the 
delegation token.

To invoke this api first connection object has to be retrieved from 
ConnectionFactory and the same connection can be passed to 
obtainToken(Connection conn) for getting token.

 

I can raise a PR soon for handling this issue, please let me know for any 
clarifications or suggestions.


was (Author: s71955):
This is mainly happening because spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the kerberos security token and the API has been removed from 
hbase 2.1 version , as i analyzed there is one more stable API in 

public static Token obtainToken(Connection conn) 
in TokenUtil class , i think spark shall use this stable api for getting the 
delegation token.

To invoke this api first connection object has to be retrieved from 
ConnectionFactory and the same connection can be passed to 
obtainToken(Connection conn) for getting token.

 

I can rase a PR soon for handling this issue, please let me know for any 
clarifications or suggestions.

> Not able to connect Hbase 2.1 service Getting NoSuchMethodException while 
> trying to obtain token from Hbase 2.1 service.
> 
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.
>  
> Attached the snapshot of error logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26432) Not able to connect Hbase 2.1 service Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service.

2018-12-23 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728014#comment-16728014
 ] 

Sujith edited comment on SPARK-26432 at 12/23/18 5:48 PM:
--

This is mainly happening because spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the kerberos security token and the API has been removed from 
hbase 2.1 version , as i analyzed there is one more stable API in 

public static Token obtainToken(Connection conn) 
in TokenUtil class , i think spark shall use this stable api for getting the 
delegation token.

To invoke this api first connection object has to be retrieved from 
ConnectionFactory and the same connection can be passed to 
obtainToken(Connection conn) for getting token.

 

I can rase a PR soon for handling this issue, please let me know for any 
clarifications or suggestions.


was (Author: s71955):
This is mainly happening because in spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the kerberos security token and the API has been removed from 
hbase 2.1 version , as i analyzed there is one more stable API in 


 public static Token obtainToken(Connection 
conn) in TokenUtil class , i think spark shall use this stable api for getting 
the delegation token.

To invoke this api first connection object has to be retrieved from 
ConnectionFactory and the same connection can be passed to 
obtainToken(Connection conn) for getting token.

 

I can rase a PR soon for handling this issue, please let me know for any 
clarifications or suggestions.

> Not able to connect Hbase 2.1 service Getting NoSuchMethodException while 
> trying to obtain token from Hbase 2.1 service.
> 
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.
>  
> Attached the snapshot of error logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26432) Not able to connect Hbase 2.1 service Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service.

2018-12-23 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26432:
---
Summary: Not able to connect Hbase 2.1 service Getting 
NoSuchMethodException while trying to obtain token from Hbase 2.1 service.  
(was: Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
service)

> Not able to connect Hbase 2.1 service Getting NoSuchMethodException while 
> trying to obtain token from Hbase 2.1 service.
> 
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728015#comment-16728015
 ] 

Sujith commented on SPARK-26432:


cc [~cloud_fan]  [~vanzin]

> Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
> service
> -
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728014#comment-16728014
 ] 

Sujith commented on SPARK-26432:


This is mainly happening because in spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the kerberos security token and the API has been removed from 
hbase 2.1 version , as i analyzed there is one more stable API in 


 public static Token obtainToken(Connection 
conn) in TokenUtil class , i think spark shall use this stable api for getting 
the delegation token.

To invoke this api first connection object has to be retrieved from 
ConnectionFactory and the same connection can be passed to 
obtainToken(Connection conn) for getting token.

 

I can rase a PR soon for handling this issue, please let me know for any 
clarifications or suggestions.

> Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
> service
> -
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26432:
---
Description: 
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark uses  a deprecated hbase api 

public static Token obtainToken(Configuration 
conf)  

for obtaining the token and the same has been removed from hbase 2.1 version.

  was:
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark uses  a deprecated hbase api  for 
obtaining the token

and the same has been removed from hbase 2.1 version


> Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
> service
> -
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api 
> public static Token obtainToken(Configuration 
> conf)  
> for obtaining the token and the same has been removed from hbase 2.1 version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26432:
---
Attachment: hbase-dep-obtaintok.png

> Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
> service
> -
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark we were using  a deprecated hbase 
> api  for obtaining the token and the same has been removed from hbase 2.1 
> version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sujith updated SPARK-26432:
---
Description: 
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark uses  a deprecated hbase api  for 
obtaining the token

and the same has been removed from hbase 2.1 version

  was:
Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark we were using  a deprecated hbase api 
 for obtaining the token and the same has been removed from hbase 2.1 version


> Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 
> service
> -
>
> Key: SPARK-26432
> URL: https://issues.apache.org/jira/browse/SPARK-26432
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sujith
>Priority: Major
> Attachments: hbase-dep-obtaintok.png
>
>
> Getting NoSuchMethodException :
> org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)
> while trying  connect hbase 2.1 service from spark.
> This is mainly happening because in spark uses  a deprecated hbase api  for 
> obtaining the token
> and the same has been removed from hbase 2.1 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26432) Getting NoSuchMethodException while trying to obtain token from Hbase 2.1 service

2018-12-23 Thread Sujith (JIRA)
Sujith created SPARK-26432:
--

 Summary: Getting NoSuchMethodException while trying to obtain 
token from Hbase 2.1 service
 Key: SPARK-26432
 URL: https://issues.apache.org/jira/browse/SPARK-26432
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.2
Reporter: Sujith


Getting NoSuchMethodException :

org.apache.hadoop.hbase.security.token.TokenUtil(org.apache.hadoop.conf.Configuration)

while trying  connect hbase 2.1 service from spark.

This is mainly happening because in spark we were using  a deprecated hbase api 
 for obtaining the token and the same has been removed from hbase 2.1 version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2018-12-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26431:


Assignee: Apache Spark

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2018-12-23 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26431:


Assignee: (was: Apache Spark)

> Update availableSlots by availableCpus for barrier taskset
> --
>
> Key: SPARK-26431
> URL: https://issues.apache.org/jira/browse/SPARK-26431
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
> Fix For: 2.4.0
>
>
> availableCpus decrease as  tasks allocated, so, we should update 
> availableSlots by availableCpus for barrier taskset to avoid unnecessary 
> resourceOffer process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26431) Update availableSlots by availableCpus for barrier taskset

2018-12-23 Thread wuyi (JIRA)
wuyi created SPARK-26431:


 Summary: Update availableSlots by availableCpus for barrier taskset
 Key: SPARK-26431
 URL: https://issues.apache.org/jira/browse/SPARK-26431
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: wuyi
 Fix For: 2.4.0


availableCpus decrease as  tasks allocated, so, we should update availableSlots 
by availableCpus for barrier taskset to avoid unnecessary resourceOffer process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org