[jira] [Commented] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162520#comment-17162520
 ] 

Apache Spark commented on SPARK-21117:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29183

> Built-in SQL Function Support - WIDTH_BUCKET
> 
>
> Key: SPARK-21117
> URL: https://issues.apache.org/jira/browse/SPARK-21117
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> For a given expression, the {{WIDTH_BUCKET}} function returns the bucket 
> number into which the value of this expression would fall after being 
> evaluated.
> {code:sql}
> WIDTH_BUCKET (expr , min_value , max_value , num_buckets)
> {code}
> Ref: 
> https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31922) "RpcEnv already stopped" error when exit spark-shell with local-cluster mode

2020-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31922:
-

Assignee: wuyi

> "RpcEnv already stopped" error when exit spark-shell with local-cluster mode
> 
>
> Key: SPARK-31922
> URL: https://issues.apache.org/jira/browse/SPARK-31922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> There's always an error from TransportRequestHandler when exiting spark-shell 
> under local-cluster mode:
>  
> {code:java}
> 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
> at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) 
> at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>  at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>  at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.nett

[jira] [Resolved] (SPARK-31922) "RpcEnv already stopped" error when exit spark-shell with local-cluster mode

2020-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31922.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28746
[https://github.com/apache/spark/pull/28746]

> "RpcEnv already stopped" error when exit spark-shell with local-cluster mode
> 
>
> Key: SPARK-31922
> URL: https://issues.apache.org/jira/browse/SPARK-31922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> There's always an error from TransportRequestHandler when exiting spark-shell 
> under local-cluster mode:
>  
> {code:java}
> 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
> at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) 
> at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>  at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>  at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHand

[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162479#comment-17162479
 ] 

Apache Spark commented on SPARK-32003:
--

User 'wypoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29182

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32351) Partially pushed partition filters are not explained

2020-07-21 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162477#comment-17162477
 ] 

pavithra ramachandran commented on SPARK-32351:
---

i would like to check this

> Partially pushed partition filters are not explained
> 
>
> Key: SPARK-32351
> URL: https://issues.apache.org/jira/browse/SPARK-32351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:scala}
> spark.sql(
>   s"""
>  |CREATE TABLE t(i INT, p STRING)
>  |USING parquet
>  |PARTITIONED BY (p)""".stripMargin)
> spark.range(0, 1000).selectExpr("id as col").createOrReplaceTempView("temp")
> for (part <- Seq(1, 2, 3, 4)) {
>   sql(s"""
>  |INSERT OVERWRITE TABLE t PARTITION (p='$part')
>  |SELECT col FROM temp""".stripMargin)
> }
> spark.sql("SELECT * FROM t WHERE  WHERE (p = '1' AND i = 1) OR (p = '2' and i 
> = 2)").explain
> {code}
> We have pushed down {{p = '1' or p = '2'}} since SPARK-28169, but this pushed 
> down filter not in explain
> {noformat}
> == Physical Plan ==
> *(1) Filter (((p#21 = 1) AND (i#20 = 1)) OR ((p#21 = 2) AND (i#20 = 2)))
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32289/sql/core/spark-warehouse/org.apache.spark...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions

2020-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32059:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Nested Schema Pruning not Working in Window Functions
> -
>
> Key: SPARK-32059
> URL: https://issues.apache.org/jira/browse/SPARK-32059
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Frank Yin
>Priority: Major
>
> Using tables and data structures in `SchemaPruningSuite.scala`
>  
> {code:java}
> // code placeholder
> case class FullName(first: String, middle: String, last: String)
> case class Company(name: String, address: String)
> case class Employer(id: Int, company: Company)
> case class Contact(
>   id: Int,
>   name: FullName,
>   address: String,
>   pets: Int,
>   friends: Array[FullName] = Array.empty,
>   relatives: Map[String, FullName] = Map.empty,
>   employer: Employer = null,
>   relations: Map[FullName, String] = Map.empty)
> case class Department(
>   depId: Int,
>   depName: String,
>   contactId: Int,
>   employer: Employer)
> {code}
>  
> The query to run:
> {code:java}
> // code placeholder
> select a.name.first from (select row_number() over (partition by address 
> order by id desc) as __rank, contacts.* from contacts) a where a.name.first = 
> 'A' AND a.__rank = 1
> {code}
>  
> The current physical plan:
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [name#46.first AS first#74]
> +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND 
> (name#46.first = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46, address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}
>  
> The desired physical plan:
>  
> {code:java}
> // code placeholder
> == Physical Plan ==
> *(3) Project [_gen_alias_77#77 AS first#74]
> +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND 
> (_gen_alias_77#77 = A)) AND (__rank#71 = 1))
>+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS 
> LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) 
> AS __rank#71], [address#47], [id#45 DESC NULLS LAST]
>   +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], 
> false, 0
>  +- Exchange hashpartitioning(address#47, 5), true, [id=#52]
> +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, 
> address#47]
>+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: 
> false, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct,address:string>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-07-21 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-32350.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29149
[https://github.com/apache/spark/pull/29149]

> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore

2020-07-21 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-32350:


Assignee: Baohe Zhang

> Add batch write support on LevelDB to improve performance of HybridStore
> 
>
> Key: SPARK-32350
> URL: https://issues.apache.org/jira/browse/SPARK-32350
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning

2020-07-21 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32330:

Priority: Major  (was: Trivial)

> Preserve shuffled hash join build side partitioning
> ---
>
> Key: SPARK-32330
> URL: https://issues.apache.org/jira/browse/SPARK-32330
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently `ShuffledHashJoin.outputPartitioning` inherits from 
> `HashJoin.outputPartitioning`, which only preserves stream side partitioning:
> `HashJoin.scala`
> {code:java}
> override def outputPartitioning: Partitioning = 
> streamedPlan.outputPartitioning
> {code}
> This loses build side partitioning information, and causes extra shuffle if 
> there's another join / group-by after this join.
> Example:
>  
> {code:java}
> // code placeholder
> withSQLConf(
> SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
> SQLConf.SHUFFLE_PARTITIONS.key -> "2",
> SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
>   val df1 = spark.range(10).select($"id".as("k1"))
>   val df2 = spark.range(30).select($"id".as("k2"))
>   Seq("inner", "cross").foreach(joinType => {
> val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
>   .queryExecution.executedPlan
> assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
> // No extra shuffle before aggregate
> assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
>   })
> }{code}
>  
> Current physical plan (having an extra shuffle on `k1` before aggregate)
>  
> {code:java}
> *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
>+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>   +- *(3) Project [k1#220L]
>  +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
> :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
> :  +- *(1) Project [id#218L AS k1#220L]
> : +- *(1) Range (0, 10, step=1, splits=2)
> +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
>+- *(2) Project [id#222L AS k2#224L]
>   +- *(2) Range (0, 30, step=1, splits=2){code}
>  
> Ideal physical plan (no shuffle on `k1` before aggregate)
> {code:java}
>  *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, 
> count#235L])
> +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], 
> output=[k1#220L, count#239L])
>+- *(3) Project [k1#220L]
>   +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
>  :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
>  :  +- *(1) Project [id#218L AS k1#220L]
>  : +- *(1) Range (0, 10, step=1, splits=2)
>  +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
> +- *(2) Project [id#222L AS k2#224L]
>+- *(2) Range (0, 30, step=1, splits=2){code}
>  
> This can be fixed by overriding `outputPartitioning` method in 
> `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23844) Socket Stream recovering from checkpoint will throw exception

2020-07-21 Thread pengzhiwei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162438#comment-17162438
 ] 

pengzhiwei commented on SPARK-23844:


Thanks [~jerryshao2015],I have meet the same issue with you.

> Socket Stream recovering from checkpoint will throw exception
> -
>
> Key: SPARK-23844
> URL: https://issues.apache.org/jira/browse/SPARK-23844
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Priority: Major
>
> When we specified checkpoint location as well as using socket streaming, it 
> will throw exception after rerun:
> {noformat}
> 18/04/02 14:11:28 ERROR MicroBatchExecution: Query test [id = 
> c5ca82b2-550b-4c3d-9127-869f1aeae477, runId = 
> 552d5bd4-a7e7-44e5-a85a-2f04f666ff6a] terminated with error
> java.lang.RuntimeException: Offsets committed out of order: 0 followed by -1
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:196)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:373)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:370)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:370)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:353)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:353)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:353)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:142)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:135)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:135)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:135)
> at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:131)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){noformat}
> Basically it means that {{TextSocketMicroBatchReader}} is honoring the 
> offsets recovered from checkpoint, this is not correct for socket source, as 
> it doesn't support recovering from checkpoint. Even though the offset is 
> recovered, the real data is still unmatched from this offset.
> To reproduce this issue,
> {code:java}
> val socket = spark.readStream.format("socket").options(Map("host" -> 
> "localhost", "port" -> "")).load
> spark.conf.set("spark.sql.streaming.checkpointLocation", "./checkpoint")
> socket.writeStream.format("parquet").option("path", 
> "./result").queryName("test").s

[jira] [Assigned] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32383:


Assignee: (was: Apache Spark)

> Preserve hash join (BHJ and SHJ) stream side ordering
> -
>
> Key: SPARK-32383
> URL: https://issues.apache.org/jira/browse/SPARK-32383
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve 
> children output ordering information (inherit from 
> `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in 
> complex queries involved multiple joins.
> Example:
>  
> {code:java}
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
>   val df1 = spark.range(100).select($"id".as("k1"))
>   val df2 = spark.range(100).select($"id".as("k2"))
>   val df3 = spark.range(3).select($"id".as("k3"))
>   val df4 = spark.range(100).select($"id".as("k4"))
>   val plan = df1.join(df2, $"k1" === $"k2")
> .join(df3, $"k1" === $"k3")
> .join(df4, $"k1" === $"k4")
> .queryExecution
> .executedPlan
> }
> {code}
>  
> Current physical plan (extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> : :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> : :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
> : :  : +- *(1) Project [id#218L AS k1#220L]
> : :  :+- *(1) Range (0, 100, step=1, splits=2)
> : :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
> : :+- *(3) Project [id#222L AS k2#224L]
> : :   +- *(3) Range (0, 100, step=1, splits=2)
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#141]
> :+- *(5) Project [id#226L AS k3#228L]
> :   +- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2)
> {code}
> Ideal physical plan (no extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> :  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> :  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
> :  :  : +- *(1) Project [id#218L AS k1#220L]
> :  :  :+- *(1) Range (0, 100, step=1, splits=2)
> :  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> :  : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
> :  :+- *(3) Project [id#222L AS k2#224L]
> :  :   +- *(3) Range (0, 100, step=1, splits=2)
> :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#140]
> : +- *(5) Project [id#226L AS k3#228L]
> :+- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162413#comment-17162413
 ] 

Apache Spark commented on SPARK-32383:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/29181

> Preserve hash join (BHJ and SHJ) stream side ordering
> -
>
> Key: SPARK-32383
> URL: https://issues.apache.org/jira/browse/SPARK-32383
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve 
> children output ordering information (inherit from 
> `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in 
> complex queries involved multiple joins.
> Example:
>  
> {code:java}
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
>   val df1 = spark.range(100).select($"id".as("k1"))
>   val df2 = spark.range(100).select($"id".as("k2"))
>   val df3 = spark.range(3).select($"id".as("k3"))
>   val df4 = spark.range(100).select($"id".as("k4"))
>   val plan = df1.join(df2, $"k1" === $"k2")
> .join(df3, $"k1" === $"k3")
> .join(df4, $"k1" === $"k4")
> .queryExecution
> .executedPlan
> }
> {code}
>  
> Current physical plan (extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> : :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> : :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
> : :  : +- *(1) Project [id#218L AS k1#220L]
> : :  :+- *(1) Range (0, 100, step=1, splits=2)
> : :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
> : :+- *(3) Project [id#222L AS k2#224L]
> : :   +- *(3) Range (0, 100, step=1, splits=2)
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#141]
> :+- *(5) Project [id#226L AS k3#228L]
> :   +- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2)
> {code}
> Ideal physical plan (no extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> :  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> :  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
> :  :  : +- *(1) Project [id#218L AS k1#220L]
> :  :  :+- *(1) Range (0, 100, step=1, splits=2)
> :  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> :  : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
> :  :+- *(3) Project [id#222L AS k2#224L]
> :  :   +- *(3) Range (0, 100, step=1, splits=2)
> :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#140]
> : +- *(5) Project [id#226L AS k3#228L]
> :+- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32383:


Assignee: Apache Spark

> Preserve hash join (BHJ and SHJ) stream side ordering
> -
>
> Key: SPARK-32383
> URL: https://issues.apache.org/jira/browse/SPARK-32383
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve 
> children output ordering information (inherit from 
> `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in 
> complex queries involved multiple joins.
> Example:
>  
> {code:java}
> withSQLConf(
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
>   val df1 = spark.range(100).select($"id".as("k1"))
>   val df2 = spark.range(100).select($"id".as("k2"))
>   val df3 = spark.range(3).select($"id".as("k3"))
>   val df4 = spark.range(100).select($"id".as("k4"))
>   val plan = df1.join(df2, $"k1" === $"k2")
> .join(df3, $"k1" === $"k3")
> .join(df4, $"k1" === $"k4")
> .queryExecution
> .executedPlan
> }
> {code}
>  
> Current physical plan (extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> : :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> : :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
> : :  : +- *(1) Project [id#218L AS k1#220L]
> : :  :+- *(1) Range (0, 100, step=1, splits=2)
> : :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
> : :+- *(3) Project [id#222L AS k2#224L]
> : :   +- *(3) Range (0, 100, step=1, splits=2)
> : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#141]
> :+- *(5) Project [id#226L AS k3#228L]
> :   +- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2)
> {code}
> Ideal physical plan (no extra sort on `k1` before top sort merge join):
> {code:java}
> *(9) SortMergeJoin [k1#220L], [k4#232L], Inner
> :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
> :  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
> :  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
> :  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
> :  :  : +- *(1) Project [id#218L AS k1#220L]
> :  :  :+- *(1) Range (0, 100, step=1, splits=2)
> :  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
> :  : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
> :  :+- *(3) Project [id#222L AS k2#224L]
> :  :   +- *(3) Range (0, 100, step=1, splits=2)
> :  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false])), [id=#140]
> : +- *(5) Project [id#226L AS k3#228L]
> :+- *(5) Range (0, 3, step=1, splits=2)
> +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
>   +- *(7) Project [id#230L AS k4#232L]
>  +- *(7) Range (0, 100, step=1, splits=2){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering

2020-07-21 Thread Cheng Su (Jira)
Cheng Su created SPARK-32383:


 Summary: Preserve hash join (BHJ and SHJ) stream side ordering
 Key: SPARK-32383
 URL: https://issues.apache.org/jira/browse/SPARK-32383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve 
children output ordering information (inherit from `SparkPlan.outputOrdering`, 
which is Nil). This can add unnecessary sort in complex queries involved 
multiple joins.

Example:

 
{code:java}
withSQLConf(
  SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
  val df1 = spark.range(100).select($"id".as("k1"))
  val df2 = spark.range(100).select($"id".as("k2"))
  val df3 = spark.range(3).select($"id".as("k3"))
  val df4 = spark.range(100).select($"id".as("k4"))
  val plan = df1.join(df2, $"k1" === $"k2")
.join(df3, $"k1" === $"k3")
.join(df4, $"k1" === $"k4")
.queryExecution
.executedPlan
}
{code}
 

Current physical plan (extra sort on `k1` before top sort merge join):
{code:java}
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
:  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
: :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
: :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
: :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
: :  : +- *(1) Project [id#218L AS k1#220L]
: :  :+- *(1) Range (0, 100, step=1, splits=2)
: :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
: :+- *(3) Project [id#222L AS k2#224L]
: :   +- *(3) Range (0, 100, step=1, splits=2)
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false])), [id=#141]
:+- *(5) Project [id#226L AS k3#228L]
:   +- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
  +- *(7) Project [id#230L AS k4#232L]
 +- *(7) Range (0, 100, step=1, splits=2)
{code}
Ideal physical plan (no extra sort on `k1` before top sort merge join):
{code:java}
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
:  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
:  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
:  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
:  :  : +- *(1) Project [id#218L AS k1#220L]
:  :  :+- *(1) Range (0, 100, step=1, splits=2)
:  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
:  : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
:  :+- *(3) Project [id#222L AS k2#224L]
:  :   +- *(3) Range (0, 100, step=1, splits=2)
:  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false])), [id=#140]
: +- *(5) Project [id#226L AS k3#228L]
:+- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
  +- *(7) Project [id#230L AS k4#232L]
 +- *(7) Range (0, 100, step=1, splits=2){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32286) Coalesce bucketed tables for shuffled hash join if applicable

2020-07-21 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32286.
--
Fix Version/s: 3.1.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/29079|https://github.com/apache/spark/pull/29079#]

> Coalesce bucketed tables for shuffled hash join if applicable
> -
>
> Key: SPARK-32286
> URL: https://issues.apache.org/jira/browse/SPARK-32286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Based on a follow up comment in PR 
> [#28123|https://github.com/apache/spark/pull/28123], where we can coalesce 
> buckets for shuffled hash join as well. The note here is we only coalesce the 
> buckets from shuffled hash join stream side (i.e. the side not building hash 
> map), so we don't need to worry about OOM when coalescing multiple buckets in 
> one task for building hash map.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24266) Spark client terminates while driver is still running

2020-07-21 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-24266.
--
   Fix Version/s: 3.1.0
Target Version/s: 3.1.0  (was: 2.4.7, 3.1.0)
  Resolution: Fixed

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Priority: Major
> Fix For: 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hado

[jira] [Updated] (SPARK-32377) CaseInsensitiveMap should be deterministic for addition

2020-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32377:
--
Reporter: Girish A Pandit  (was: Dongjoon Hyun)

> CaseInsensitiveMap should be deterministic for addition
> ---
>
> Key: SPARK-32377
> URL: https://issues.apache.org/jira/browse/SPARK-32377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Girish A Pandit
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> {code}
> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
> var m = CaseInsensitiveMap(Map.empty[String, String])
> Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", 
> "5")).foreach { kv =>
>   m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]]
>   println(m.get("path"))
> }
> Some(1)
> Some(2)
> Some(3)
> Some(4)
> Some(1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17333:


Assignee: Apache Spark

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Assignee: Apache Spark
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17333:


Assignee: (was: Apache Spark)

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17333) Make pyspark interface friendly with mypy static analysis

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162339#comment-17162339
 ] 

Apache Spark commented on SPARK-17333:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/29180

> Make pyspark interface friendly with mypy static analysis
> -
>
> Key: SPARK-17333
> URL: https://issues.apache.org/jira/browse/SPARK-17333
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Assaf Mendelson
>Priority: Trivial
>
> Static analysis tools such as those common to IDE for auto completion and 
> error marking, tend to have poor results with pyspark.
> This is cause by two separate issues:
> The first is that many elements are created programmatically such as the max 
> function in pyspark.sql.functions.
> The second is that we tend to use pyspark in a functional manner, meaning 
> that we chain many actions (e.g. df.filter().groupby().agg()) and since 
> python has no type information this can become difficult to understand.
> I would suggest changing the interface to improve it. 
> The way I see it we can either change the interface or provide interface 
> enhancements.
> Changing the interface means defining (when possible) all functions directly, 
> i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py 
> and then generating the functions programmatically by using _create_function, 
> create the function directly. 
> def max(col):
>"""
>docstring
>"""
>_create_function(max,"docstring")
> Second we can add type indications to all functions as defined in pep 484 or 
> pycharm's legacy type hinting 
> (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy).
> So for example max might look like this:
> def max(col):
>"""
>does  a max.
>   :type col: Column
>   :rtype Column
>"""
> This would provide a wide range of support as these types of hints, while old 
> are pretty common.
> A second option is to use PEP 3107 to define interfaces (pyi files)
> in this case we might have a functions.pyi file which would contain something 
> like:
> def max(col: Column) -> Column:
> """
> Aggregate function: returns the maximum value of the expression in a 
> group.
> """
> ...
> This has the advantage of easier to understand types and not touching the 
> code (only supported code) but has the disadvantage of being separately 
> managed (i.e. greater chance of doing a mistake) and the fact that some 
> configuration would be needed in the IDE/static analysis tool instead of 
> working out of the box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32382) Override table renaming in JDBC dialects

2020-07-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32382:
--

 Summary: Override table renaming in JDBC dialects
 Key: SPARK-32382
 URL: https://issues.apache.org/jira/browse/SPARK-32382
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Maxim Gekk


SPARK-32375 adds new method renameTable to JdbcDialect with the default 
implementation:
{code:sql}
ALTER TABLE table_name RENAME TO new_table_name;
{code}
which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other 
dialects might not support such syntax, for instance SQL Server (using the 
stored procedure called sp_rename):
{code:sql}
sp_rename 'table_name', 'new_table_name';
{code}

The ticket aims to support table renaming in all JDBC dialects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs

2020-07-21 Thread Holden Karau (Jira)
Holden Karau created SPARK-32381:


 Summary: Expose the ability for users to use parallel file & avoid 
location information discovery in RDDs
 Key: SPARK-32381
 URL: https://issues.apache.org/jira/browse/SPARK-32381
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Holden Karau


We already have this in SQL so it's mostly a matter of re-organizing the code a 
bit and agreeing on how to best expose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32381:


Assignee: (was: Apache Spark)

> Expose the ability for users to use parallel file & avoid location 
> information discovery in RDDs
> 
>
> Key: SPARK-32381
> URL: https://issues.apache.org/jira/browse/SPARK-32381
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We already have this in SQL so it's mostly a matter of re-organizing the code 
> a bit and agreeing on how to best expose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32381:


Assignee: Apache Spark

> Expose the ability for users to use parallel file & avoid location 
> information discovery in RDDs
> 
>
> Key: SPARK-32381
> URL: https://issues.apache.org/jira/browse/SPARK-32381
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> We already have this in SQL so it's mostly a matter of re-organizing the code 
> a bit and agreeing on how to best expose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162314#comment-17162314
 ] 

Apache Spark commented on SPARK-32381:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/29179

> Expose the ability for users to use parallel file & avoid location 
> information discovery in RDDs
> 
>
> Key: SPARK-32381
> URL: https://issues.apache.org/jira/browse/SPARK-32381
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We already have this in SQL so it's mostly a matter of re-organizing the code 
> a bit and agreeing on how to best expose this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162313#comment-17162313
 ] 

Holden Karau commented on SPARK-26345:
--

We don't assign issues normally until after the merge. Leaving a comment when 
you start working on it is a best practice to avoid people stepping on each 
others toes.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162311#comment-17162311
 ] 

Felix Kizhakkel Jose commented on SPARK-26345:
--

[~sha...@uber.com] I don't have permission to assign it to you. Probably 
someone who is part of committers list can assign it to you.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32348) Get tests working for Scala 2.13 build

2020-07-21 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162280#comment-17162280
 ] 

Sean R. Owen commented on SPARK-32348:
--

I've found a few more easy test fixes, but also discovered we need scalatest 
3.2.0 for this fix: 
https://github.com/scalatest/scalatest/commit/7c89416aa9f3e7f2730a343ad6d3bdcff65809de

> Get tests working for Scala 2.13 build
> --
>
> Key: SPARK-32348
> URL: https://issues.apache.org/jira/browse/SPARK-32348
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>
> This is a placeholder for the general task of getting the tests to pass in 
> the Scala 2.13 build, after it compiles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32334) Investigate commonizing Columnar and Row data transformations

2020-07-21 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162147#comment-17162147
 ] 

Robert Joseph Evans commented on SPARK-32334:
-

I think I can get the conversation started here.

{{SparkPlan}} supports a few APIs for columnar processing right now.  
* {{supportsColumnar}} which returns true if {{executeColumnar}} should be 
called to process columnar data.
* {{vectorTypes}} an optional set of class names for the columnar output of 
this stage which is a performance improvement for the code generation  phase of 
converting the data to rows.
* {{executeColumnar}} the main entry point to columnar execution.
* {{doExecuteColumnar}} what users are expected to implement if 
{{supportsColumnar}} returns true.

When {{supportsColumnar}} returns true it is assumed that both the input and 
the output of the stage will be columnar data. With this information 
{{ApplyColumnarRulesAndInsertTransitions}} will insert {{RowToColumnarExec}} 
and {{ColumnarToRowExec}} transitions.  {{ColumnarToRowExec}} is by far the 
more optimized because it is widely used today.

One of the goals of this issue is to try and make something like 
{{ArrowEvalPythonExec}} be columnar.  If we just set {{executeColumnar}} to 
true for it the incoming data layout would be columnar, but it most likely 
would not be Arrow formatted, so it would still require some kind of transition 
from one columnar format to an Arrow based format.  There is also no guarantee 
that the size of the batch will correspond to what this operator wants. 
{{RowToColumnarExec}} goes off of the 
{{spark.sql.inMemoryColumnarStorage.batchSize}} config, but 
{{ArrowEvalPythonExec}} uses {{spark.sql.execution.arrow.maxRecordsPerBatch}}.

To get around both of these issues I would propose that we let `SparkPlan` 
optionally ask for both a specific type of input and a specific target size. We 
might also want a better way to say what type of output it is going to produce 
so we can optimize away some transitions if they are not needed.



> Investigate commonizing Columnar and Row data transformations 
> --
>
> Key: SPARK-32334
> URL: https://issues.apache.org/jira/browse/SPARK-32334
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We introduced more Columnar Support with SPARK-27396.
> With that we recognized that there is code that is doing very similar 
> transformations from ColumnarBatch or Arrow into InternalRow and vice versa.  
> For instance: 
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L56-L58]
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L389]
> We should investigate if we can commonize that code.
> We are also looking at making the internal caching serialization pluggable to 
> allow for different cache implementations. 
> ([https://github.com/apache/spark/pull/29067]). 
> It was recently brought up that we should investigate if using the data 
> source v2 api makes sense and is feasible for some of these transformations 
> to allow it to be easily extended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32377) CaseInsensitiveMap should be deterministic for addition

2020-07-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32377:
--
Fix Version/s: 2.4.7

> CaseInsensitiveMap should be deterministic for addition
> ---
>
> Key: SPARK-32377
> URL: https://issues.apache.org/jira/browse/SPARK-32377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> {code}
> import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
> var m = CaseInsensitiveMap(Map.empty[String, String])
> Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", 
> "5")).foreach { kv =>
>   m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]]
>   println(m.get("path"))
> }
> Some(1)
> Some(2)
> Some(3)
> Some(4)
> Some(1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162063#comment-17162063
 ] 

Xinli Shang commented on SPARK-26345:
-

[~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I 
can start working on it. 

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162059#comment-17162059
 ] 

Apache Spark commented on SPARK-32380:
--

User 'DeyinZhong' has created a pull request for this issue:
https://github.com/apache/spark/pull/29178

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.It

[jira] [Assigned] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32380:


Assignee: Apache Spark

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator

[jira] [Assigned] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32380:


Assignee: (was: Apache Spark)

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.

[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162058#comment-17162058
 ] 

Apache Spark commented on SPARK-32380:
--

User 'DeyinZhong' has created a pull request for this issue:
https://github.com/apache/spark/pull/29178

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.It

[jira] [Assigned] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32363:


Assignee: Hyukjin Kwon

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32363) Flaky pip installation test in Jenkins

2020-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32363.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29117
[https://github.com/apache/spark/pull/29117]

> Flaky pip installation test in Jenkins
> --
>
> Key: SPARK-32363
> URL: https://issues.apache.org/jira/browse/SPARK-32363
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently pip packaging test is flaky in Jenkins:
> {code}
> Installing collected packages: py4j, pyspark
>   Attempting uninstall: py4j
> Found existing installation: py4j 0.10.9
> Uninstalling py4j-0.10.9:
>   Successfully uninstalled py4j-0.10.9
>   Attempting uninstall: pyspark
> Found existing installation: pyspark 3.1.0.dev0
> ERROR: Exception:
> Traceback (most recent call last):
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py",
>  line 188, in _main
> status = self.run(options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py",
>  line 185, in wrapper
> return func(self, options, args)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py",
>  line 407, in run
> use_user_site=options.use_user_site,
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py",
>  line 64, in install_given_reqs
> auto_confirm=True
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py",
>  line 675, in uninstall
> uninstalled_pathset = UninstallPathSet.from_dist(dist)
>   File 
> "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py",
>  line 545, in from_dist
> link_pointer, dist.project_name, dist.location)
> AssertionError: Egg-link 
> /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does 
> not match installed location of pyspark (at 
> /home/jenkins/workspace/SparkPullRequestBuilder@2/python)
> Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table

{code:java}
 hbase(main):001:0>create 'hbase_test1', 'cf1'
 hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
{code}
 * step2: create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
 {code}
 * step3: sparksql query hive table while data in hbase

{code:java}
spark-sql --master yarn -e "select * from hivetest.hbase_test"
{code}
 

The error log as follow: 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.refle

[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161959#comment-17161959
 ] 

deyzhong commented on SPARK-32380:
--

I have solved this bug by modified TableReader.scala.

The solution is when the inputformat class is 
org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat,  will create 
OldHadoopRDD. I have tested in my product env as well.

Can I submit a pr to spark ?

[~apachespark]

 

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserve

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table

{code:java}
 hbase(main):001:0>create 'hbase_test1', 'cf1'
 hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
{code}
 * step2: create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
 {code}
 * sparksql query hive table while data in hbase

{code:java}
spark-sql --master yarn -e "select * from hivetest.hbase_test"
{code}
 

The error log as follow: 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.Nati

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table

{code:java}
 create 'hbase_test2', 'cf1'{code}
 * create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
{code}
 

 
 * sparksql query hive table while data in hbase

{code:java}
spark-sql --master yarn -e "select * from hivetest.hbase_test"
{code}
 

 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImp

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table
 * 
{code:java}
 create 'hbase_test2', 'cf1'
{code}

 * create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
{code}
 

 
 * sparksql query hive table while data in hbase

{code:java}
spark-sql --master yarn -e "select * from hivetest.hbase_test"
{code}
 

 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccess

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table
 * 
{code:java}
 create 'hbase_test2', 'cf1'
{code}

 * create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
{code}
 

 
 * sparksql query hive table while data in hbase

{code:java}
spark-sql --master yarn -e "select * from hivetest.hbase_test"
{code}
 

 

 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAcc

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Summary: sparksql cannot access hive table while data in hbase  (was: 
sparksql cannot access hive table while data on hbase)

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
>  * 
> {code:java}
>  create 'hbase_test2', 'cf1'
> {code}
>  * create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
> {code}
>  
>  
> sparksql query hive table while data in hbase
>  
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data on hbase

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Summary: sparksql cannot access hive table while data on hbase  (was: 
sparksql cannot access hbase external table in hive)

> sparksql cannot access hive table while data on hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
>  * 
> {code:java}
>  create 'hbase_test2', 'cf1'
> {code}
>  * create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
> {code}
>  
>  
> sparksql query hive table while data in hbase
>  
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at scala.collection.Iterator.foreach$(Iterator.scala:941)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>  at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>  at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> 

[jira] [Updated] (SPARK-32380) sparksql cannot access hbase external table in hive

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table
 * 
{code:java}
 create 'hbase_test2', 'cf1'
{code}

 * create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
{code}
 

 

sparksql query hive table while data in hbase

 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke

[jira] [Updated] (SPARK-32380) sparksql cannot access hbase external table in hive

2020-07-21 Thread deyzhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

deyzhong updated SPARK-32380:
-
Description: 
* step1: create hbase table
 * 
{code:java}
 create 'hbase_test2', 'cf1'
{code}

 * create hive table related to hbase table

 
{code:java}
hive> 
CREATE EXTERNAL TABLE `hivetest.hbase_test`(
  `key` string COMMENT '', 
  `value` string COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.hbase.HBaseSerDe' 
STORED BY 
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
  'hbase.columns.mapping'=':key,cf1:v1', 
  'serialization.format'='1')
TBLPROPERTIES (
  'hbase.table.name'='hbase_test')
{code}
 

 

sparksql query hive table

 

java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 a

[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-07-21 Thread Krish (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161952#comment-17161952
 ] 

Krish commented on SPARK-32317:
---

Yes I do agree with your second point, if we map required schema and the schema 
stored in file, we will be able to achieve the desired result.  Will look for 
an update from you on this.

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(Dat

[jira] [Created] (SPARK-32380) sparksql cannot access hbase external table in hive

2020-07-21 Thread deyzhong (Jira)
deyzhong created SPARK-32380:


 Summary: sparksql cannot access hbase external table in hive
 Key: SPARK-32380
 URL: https://issues.apache.org/jira/browse/SPARK-32380
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: ||component||version||
|hadoop|2.8.5|
|hive|2.3.7|
|spark|3.0.0|
|hbase|1.4.9|
Reporter: deyzhong


java.io.IOException: Cannot create a record reader because of a previous error. 
Please look at the previous logs lines from the task's full log for more 
details.
 at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
 at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
 at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
 at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
 at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
 at 
org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
 at 
org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490)
 at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
 at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474)
 at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
 at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
 at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
 at org.apache.spark.deplo

[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32379:


Assignee: Prashant Sharma

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Blocker
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32379.
--
Fix Version/s: 2.4.7
   Resolution: Fixed

Issue resolved by pull request 29177
[https://github.com/apache/spark/pull/29177]

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 2.4.7
>
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161848#comment-17161848
 ] 

Apache Spark commented on SPARK-32379:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/29177

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Priority: Blocker
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161846#comment-17161846
 ] 

Apache Spark commented on SPARK-32379:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/29177

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Priority: Blocker
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32379:


Assignee: Apache Spark

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Blocker
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32379:


Assignee: (was: Apache Spark)

> docker based spark release script should use correct CRAN repo.
> ---
>
> Key: SPARK-32379
> URL: https://issues.apache.org/jira/browse/SPARK-32379
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.6
>Reporter: Prashant Sharma
>Priority: Blocker
>
> While running, dev/create-release/do-release-docker.sh script, it is failing 
> with following errors
> {code}
> [root@kyok-test-1 ~]# tail docker-build.log 
> distribution that some required packages have not yet been created
> or been moved out of Incoming.
> The following information may help to resolve the situation:
> The following packages have unmet dependencies:
>  r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
> installed
>   Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
> installed
>  r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
> be installed
> E: Unable to correct problems, you have held broken packages.
> The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
> ca-certificates apt-transport-https &&   echo 'deb 
> https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
> /etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
> E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
> apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   
> apt-get clean &&   apt-get update &&   $APT_INSTALL 
> software-properties-common &&   apt-add-repository -y ppa:brightbox/ruby-ng 
> &&   apt-get update &&   $APT_INSTALL openjdk-8-jdk &&   update-alternatives 
> --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL 
> curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc 
> pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
> /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
> https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
> $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
> install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
> install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
> python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&  
>  pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL 
> r-base r-base-dev &&   $APT_INSTALL texlive-latex-base texlive 
> texlive-fonts-extra texinfo qpdf &&   Rscript -e "install.packages(c('curl', 
> 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 
> 'e1071', 'survival'), repos='https://cloud.r-project.org/')" &&   Rscript -e 
> "devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
> ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   
> gem install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' 
> returned a non-zero code: 100
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32379) docker based spark release script should use correct CRAN repo.

2020-07-21 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-32379:
---

 Summary: docker based spark release script should use correct CRAN 
repo.
 Key: SPARK-32379
 URL: https://issues.apache.org/jira/browse/SPARK-32379
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.6
Reporter: Prashant Sharma


While running, dev/create-release/do-release-docker.sh script, it is failing 
with following errors

{code}
[root@kyok-test-1 ~]# tail docker-build.log 
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be 
installed
  Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be 
installed
 r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to 
be installed
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg 
ca-certificates apt-transport-https &&   echo 'deb 
https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> 
/etc/apt/sources.list &&   gpg --keyserver keyserver.ubuntu.com --recv-key 
E298A3A825C0D65DFD57CBB651716619E084DAB9 &&   gpg -a --export E084DAB9 | 
apt-key add - &&   apt-get clean &&   rm -rf /var/lib/apt/lists/* &&   apt-get 
clean &&   apt-get update &&   $APT_INSTALL software-properties-common &&   
apt-add-repository -y ppa:brightbox/ruby-ng &&   apt-get update &&   
$APT_INSTALL openjdk-8-jdk &&   update-alternatives --set java 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java &&   $APT_INSTALL curl wget git 
maven ivy subversion make gcc lsof libffi-dev pandoc pandoc-citeproc 
libssl-dev libcurl4-openssl-dev libxml2-dev &&   ln -s -T 
/usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar &&   curl -sL 
https://deb.nodesource.com/setup_4.x | bash &&   $APT_INSTALL nodejs &&   
$APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip &&   pip 
install --upgrade pip && hash -r pip &&   pip install setuptools &&   pip 
install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   cd &&   virtualenv -p 
python3 /opt/p35 &&   . /opt/p35/bin/activate &&   pip install setuptools &&   
pip install $BASE_PIP_PKGS &&   pip install $PIP_PKGS &&   $APT_INSTALL r-base 
r-base-dev &&   $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra 
texinfo qpdf &&   Rscript -e "install.packages(c('curl', 'xml2', 'httr', 
'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), 
repos='https://cloud.r-project.org/')" &&   Rscript -e 
"devtools::install_github('jimhester/lintr')" &&   $APT_INSTALL ruby2.3 
ruby2.3-dev mkdocs &&   gem install jekyll --no-rdoc --no-ri -v 3.8.6 &&   gem 
install jekyll-redirect-from -v 0.15.0 &&   gem install pygments.rb' returned a 
non-zero code: 100

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org