[jira] [Commented] (SPARK-21117) Built-in SQL Function Support - WIDTH_BUCKET
[ https://issues.apache.org/jira/browse/SPARK-21117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162520#comment-17162520 ] Apache Spark commented on SPARK-21117: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/29183 > Built-in SQL Function Support - WIDTH_BUCKET > > > Key: SPARK-21117 > URL: https://issues.apache.org/jira/browse/SPARK-21117 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.1.0 > > > For a given expression, the {{WIDTH_BUCKET}} function returns the bucket > number into which the value of this expression would fall after being > evaluated. > {code:sql} > WIDTH_BUCKET (expr , min_value , max_value , num_buckets) > {code} > Ref: > https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31922) "RpcEnv already stopped" error when exit spark-shell with local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31922: - Assignee: wuyi > "RpcEnv already stopped" error when exit spark-shell with local-cluster mode > > > Key: SPARK-31922 > URL: https://issues.apache.org/jira/browse/SPARK-31922 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > There's always an error from TransportRequestHandler when exiting spark-shell > under local-cluster mode: > > {code:java} > 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) > at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.nett
[jira] [Resolved] (SPARK-31922) "RpcEnv already stopped" error when exit spark-shell with local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31922. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28746 [https://github.com/apache/spark/pull/28746] > "RpcEnv already stopped" error when exit spark-shell with local-cluster mode > > > Key: SPARK-31922 > URL: https://issues.apache.org/jira/browse/SPARK-31922 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > There's always an error from TransportRequestHandler when exiting spark-shell > under local-cluster mode: > > {code:java} > 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) > at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHand
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162479#comment-17162479 ] Apache Spark commented on SPARK-32003: -- User 'wypoon' has created a pull request for this issue: https://github.com/apache/spark/pull/29182 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32351) Partially pushed partition filters are not explained
[ https://issues.apache.org/jira/browse/SPARK-32351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162477#comment-17162477 ] pavithra ramachandran commented on SPARK-32351: --- i would like to check this > Partially pushed partition filters are not explained > > > Key: SPARK-32351 > URL: https://issues.apache.org/jira/browse/SPARK-32351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:scala} > spark.sql( > s""" > |CREATE TABLE t(i INT, p STRING) > |USING parquet > |PARTITIONED BY (p)""".stripMargin) > spark.range(0, 1000).selectExpr("id as col").createOrReplaceTempView("temp") > for (part <- Seq(1, 2, 3, 4)) { > sql(s""" > |INSERT OVERWRITE TABLE t PARTITION (p='$part') > |SELECT col FROM temp""".stripMargin) > } > spark.sql("SELECT * FROM t WHERE WHERE (p = '1' AND i = 1) OR (p = '2' and i > = 2)").explain > {code} > We have pushed down {{p = '1' or p = '2'}} since SPARK-28169, but this pushed > down filter not in explain > {noformat} > == Physical Plan == > *(1) Filter (((p#21 = 1) AND (i#20 = 1)) OR ((p#21 = 2) AND (i#20 = 2))) > +- *(1) ColumnarToRow >+- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], > Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-32289/sql/core/spark-warehouse/org.apache.spark..., > PartitionFilters: [], PushedFilters: [], ReadSchema: struct > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32059) Nested Schema Pruning not Working in Window Functions
[ https://issues.apache.org/jira/browse/SPARK-32059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32059: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Nested Schema Pruning not Working in Window Functions > - > > Key: SPARK-32059 > URL: https://issues.apache.org/jira/browse/SPARK-32059 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Frank Yin >Priority: Major > > Using tables and data structures in `SchemaPruningSuite.scala` > > {code:java} > // code placeholder > case class FullName(first: String, middle: String, last: String) > case class Company(name: String, address: String) > case class Employer(id: Int, company: Company) > case class Contact( > id: Int, > name: FullName, > address: String, > pets: Int, > friends: Array[FullName] = Array.empty, > relatives: Map[String, FullName] = Map.empty, > employer: Employer = null, > relations: Map[FullName, String] = Map.empty) > case class Department( > depId: Int, > depName: String, > contactId: Int, > employer: Employer) > {code} > > The query to run: > {code:java} > // code placeholder > select a.name.first from (select row_number() over (partition by address > order by id desc) as __rank, contacts.* from contacts) a where a.name.first = > 'A' AND a.__rank = 1 > {code} > > The current physical plan: > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [name#46.first AS first#74] > +- *(3) Filter (((isnotnull(name#46) AND isnotnull(__rank#71)) AND > (name#46.first = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46, address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-85d173af-42..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} > > The desired physical plan: > > {code:java} > // code placeholder > == Physical Plan == > *(3) Project [_gen_alias_77#77 AS first#74] > +- *(3) Filter (((isnotnull(_gen_alias_77#77) AND isnotnull(__rank#71)) AND > (_gen_alias_77#77 = A)) AND (__rank#71 = 1)) >+- Window [row_number() windowspecdefinition(address#47, id#45 DESC NULLS > LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) > AS __rank#71], [address#47], [id#45 DESC NULLS LAST] > +- *(2) Sort [address#47 ASC NULLS FIRST, id#45 DESC NULLS LAST], > false, 0 > +- Exchange hashpartitioning(address#47, 5), true, [id=#52] > +- *(1) Project [id#45, name#46.first AS _gen_alias_77#77, > address#47] >+- FileScan parquet [id#45,name#46,address#47,p#53] Batched: > false, DataFilters: [], Format: Parquet, Location: > InMemoryFileIndex[file:/private/var/folders/_c/4r2j33dd14n9ldfc2xqyzs40gn/T/spark-c64e0b29-d9..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct,address:string> > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-32350. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29149 [https://github.com/apache/spark/pull/29149] > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32350) Add batch write support on LevelDB to improve performance of HybridStore
[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-32350: Assignee: Baohe Zhang > Add batch write support on LevelDB to improve performance of HybridStore > > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.1, 3.1.0 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32330) Preserve shuffled hash join build side partitioning
[ https://issues.apache.org/jira/browse/SPARK-32330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-32330: Priority: Major (was: Trivial) > Preserve shuffled hash join build side partitioning > --- > > Key: SPARK-32330 > URL: https://issues.apache.org/jira/browse/SPARK-32330 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Major > Fix For: 3.1.0 > > > Currently `ShuffledHashJoin.outputPartitioning` inherits from > `HashJoin.outputPartitioning`, which only preserves stream side partitioning: > `HashJoin.scala` > {code:java} > override def outputPartitioning: Partitioning = > streamedPlan.outputPartitioning > {code} > This loses build side partitioning information, and causes extra shuffle if > there's another join / group-by after this join. > Example: > > {code:java} > // code placeholder > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50", > SQLConf.SHUFFLE_PARTITIONS.key -> "2", > SQLConf.PREFER_SORTMERGEJOIN.key -> "false") { > val df1 = spark.range(10).select($"id".as("k1")) > val df2 = spark.range(30).select($"id".as("k2")) > Seq("inner", "cross").foreach(joinType => { > val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count() > .queryExecution.executedPlan > assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1) > // No extra shuffle before aggregate > assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2) > }) > }{code} > > Current physical plan (having an extra shuffle on `k1` before aggregate) > > {code:java} > *(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- Exchange hashpartitioning(k1#220L, 2), true, [id=#117] >+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) > +- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111] >+- *(2) Project [id#222L AS k2#224L] > +- *(2) Range (0, 30, step=1, splits=2){code} > > Ideal physical plan (no shuffle on `k1` before aggregate) > {code:java} > *(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, > count#235L]) > +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], > output=[k1#220L, count#239L]) >+- *(3) Project [k1#220L] > +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft > :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107] > : +- *(1) Project [id#218L AS k1#220L] > : +- *(1) Range (0, 10, step=1, splits=2) > +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109] > +- *(2) Project [id#222L AS k2#224L] >+- *(2) Range (0, 30, step=1, splits=2){code} > > This can be fixed by overriding `outputPartitioning` method in > `ShuffledHashJoinExec`, similar to `SortMergeJoinExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23844) Socket Stream recovering from checkpoint will throw exception
[ https://issues.apache.org/jira/browse/SPARK-23844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162438#comment-17162438 ] pengzhiwei commented on SPARK-23844: Thanks [~jerryshao2015],I have meet the same issue with you. > Socket Stream recovering from checkpoint will throw exception > - > > Key: SPARK-23844 > URL: https://issues.apache.org/jira/browse/SPARK-23844 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Saisai Shao >Priority: Major > > When we specified checkpoint location as well as using socket streaming, it > will throw exception after rerun: > {noformat} > 18/04/02 14:11:28 ERROR MicroBatchExecution: Query test [id = > c5ca82b2-550b-4c3d-9127-869f1aeae477, runId = > 552d5bd4-a7e7-44e5-a85a-2f04f666ff6a] terminated with error > java.lang.RuntimeException: Offsets committed out of order: 0 followed by -1 > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.streaming.sources.TextSocketMicroBatchReader.commit(socket.scala:196) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:373) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1$$anonfun$apply$mcV$sp$5.apply(MicroBatchExecution.scala:370) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:370) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:353) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:353) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:353) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:142) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:135) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:135) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:135) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:131) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){noformat} > Basically it means that {{TextSocketMicroBatchReader}} is honoring the > offsets recovered from checkpoint, this is not correct for socket source, as > it doesn't support recovering from checkpoint. Even though the offset is > recovered, the real data is still unmatched from this offset. > To reproduce this issue, > {code:java} > val socket = spark.readStream.format("socket").options(Map("host" -> > "localhost", "port" -> "")).load > spark.conf.set("spark.sql.streaming.checkpointLocation", "./checkpoint") > socket.writeStream.format("parquet").option("path", > "./result").queryName("test").s
[jira] [Assigned] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering
[ https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32383: Assignee: (was: Apache Spark) > Preserve hash join (BHJ and SHJ) stream side ordering > - > > Key: SPARK-32383 > URL: https://issues.apache.org/jira/browse/SPARK-32383 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve > children output ordering information (inherit from > `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in > complex queries involved multiple joins. > Example: > > {code:java} > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { > val df1 = spark.range(100).select($"id".as("k1")) > val df2 = spark.range(100).select($"id".as("k2")) > val df3 = spark.range(3).select($"id".as("k3")) > val df4 = spark.range(100).select($"id".as("k4")) > val plan = df1.join(df2, $"k1" === $"k2") > .join(df3, $"k1" === $"k3") > .join(df4, $"k1" === $"k4") > .queryExecution > .executedPlan > } > {code} > > Current physical plan (extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 > : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#141] > :+- *(5) Project [id#226L AS k3#228L] > : +- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2) > {code} > Ideal physical plan (no extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#140] > : +- *(5) Project [id#226L AS k3#228L] > :+- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering
[ https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162413#comment-17162413 ] Apache Spark commented on SPARK-32383: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/29181 > Preserve hash join (BHJ and SHJ) stream side ordering > - > > Key: SPARK-32383 > URL: https://issues.apache.org/jira/browse/SPARK-32383 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve > children output ordering information (inherit from > `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in > complex queries involved multiple joins. > Example: > > {code:java} > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { > val df1 = spark.range(100).select($"id".as("k1")) > val df2 = spark.range(100).select($"id".as("k2")) > val df3 = spark.range(3).select($"id".as("k3")) > val df4 = spark.range(100).select($"id".as("k4")) > val plan = df1.join(df2, $"k1" === $"k2") > .join(df3, $"k1" === $"k3") > .join(df4, $"k1" === $"k4") > .queryExecution > .executedPlan > } > {code} > > Current physical plan (extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 > : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#141] > :+- *(5) Project [id#226L AS k3#228L] > : +- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2) > {code} > Ideal physical plan (no extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#140] > : +- *(5) Project [id#226L AS k3#228L] > :+- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering
[ https://issues.apache.org/jira/browse/SPARK-32383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32383: Assignee: Apache Spark > Preserve hash join (BHJ and SHJ) stream side ordering > - > > Key: SPARK-32383 > URL: https://issues.apache.org/jira/browse/SPARK-32383 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve > children output ordering information (inherit from > `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in > complex queries involved multiple joins. > Example: > > {code:java} > withSQLConf( > SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { > val df1 = spark.range(100).select($"id".as("k1")) > val df2 = spark.range(100).select($"id".as("k2")) > val df3 = spark.range(3).select($"id".as("k3")) > val df4 = spark.range(100).select($"id".as("k4")) > val plan = df1.join(df2, $"k1" === $"k2") > .join(df3, $"k1" === $"k3") > .join(df4, $"k1" === $"k4") > .queryExecution > .executedPlan > } > {code} > > Current physical plan (extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 > : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#141] > :+- *(5) Project [id#226L AS k3#228L] > : +- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2) > {code} > Ideal physical plan (no extra sort on `k1` before top sort merge join): > {code:java} > *(9) SortMergeJoin [k1#220L], [k4#232L], Inner > :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight > : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner > : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 > : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] > : : : +- *(1) Project [id#218L AS k1#220L] > : : :+- *(1) Range (0, 100, step=1, splits=2) > : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] > : :+- *(3) Project [id#222L AS k2#224L] > : : +- *(3) Range (0, 100, step=1, splits=2) > : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false])), [id=#140] > : +- *(5) Project [id#226L AS k3#228L] > :+- *(5) Range (0, 3, step=1, splits=2) > +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] > +- *(7) Project [id#230L AS k4#232L] > +- *(7) Range (0, 100, step=1, splits=2){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32383) Preserve hash join (BHJ and SHJ) stream side ordering
Cheng Su created SPARK-32383: Summary: Preserve hash join (BHJ and SHJ) stream side ordering Key: SPARK-32383 URL: https://issues.apache.org/jira/browse/SPARK-32383 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Cheng Su Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: {code:java} withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } {code} Current physical plan (extra sort on `k1` before top sort merge join): {code:java} *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- *(1) Project [id#218L AS k1#220L] : : :+- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : :+- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] :+- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) {code} Ideal physical plan (no extra sort on `k1` before top sort merge join): {code:java} *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- *(1) Project [id#218L AS k1#220L] : : :+- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : :+- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- *(5) Project [id#226L AS k3#228L] :+- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32286) Coalesce bucketed tables for shuffled hash join if applicable
[ https://issues.apache.org/jira/browse/SPARK-32286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32286. -- Fix Version/s: 3.1.0 Assignee: Cheng Su Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/29079|https://github.com/apache/spark/pull/29079#] > Coalesce bucketed tables for shuffled hash join if applicable > - > > Key: SPARK-32286 > URL: https://issues.apache.org/jira/browse/SPARK-32286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > Based on a follow up comment in PR > [#28123|https://github.com/apache/spark/pull/28123], where we can coalesce > buckets for shuffled hash join as well. The note here is we only coalesce the > buckets from shuffled hash join stream side (i.e. the side not building hash > map), so we don't need to worry about OOM when coalescing multiple buckets in > one task for building hash map. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-24266. -- Fix Version/s: 3.1.0 Target Version/s: 3.1.0 (was: 2.4.7, 3.1.0) Resolution: Fixed > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.3.0, 3.0.0 >Reporter: Chun Chen >Priority: Major > Fix For: 3.1.0 > > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hado
[jira] [Updated] (SPARK-32377) CaseInsensitiveMap should be deterministic for addition
[ https://issues.apache.org/jira/browse/SPARK-32377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32377: -- Reporter: Girish A Pandit (was: Dongjoon Hyun) > CaseInsensitiveMap should be deterministic for addition > --- > > Key: SPARK-32377 > URL: https://issues.apache.org/jira/browse/SPARK-32377 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Girish A Pandit >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > {code} > import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap > var m = CaseInsensitiveMap(Map.empty[String, String]) > Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", > "5")).foreach { kv => > m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]] > println(m.get("path")) > } > Some(1) > Some(2) > Some(3) > Some(4) > Some(1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17333) Make pyspark interface friendly with mypy static analysis
[ https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17333: Assignee: Apache Spark > Make pyspark interface friendly with mypy static analysis > - > > Key: SPARK-17333 > URL: https://issues.apache.org/jira/browse/SPARK-17333 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Assaf Mendelson >Assignee: Apache Spark >Priority: Trivial > > Static analysis tools such as those common to IDE for auto completion and > error marking, tend to have poor results with pyspark. > This is cause by two separate issues: > The first is that many elements are created programmatically such as the max > function in pyspark.sql.functions. > The second is that we tend to use pyspark in a functional manner, meaning > that we chain many actions (e.g. df.filter().groupby().agg()) and since > python has no type information this can become difficult to understand. > I would suggest changing the interface to improve it. > The way I see it we can either change the interface or provide interface > enhancements. > Changing the interface means defining (when possible) all functions directly, > i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py > and then generating the functions programmatically by using _create_function, > create the function directly. > def max(col): >""" >docstring >""" >_create_function(max,"docstring") > Second we can add type indications to all functions as defined in pep 484 or > pycharm's legacy type hinting > (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy). > So for example max might look like this: > def max(col): >""" >does a max. > :type col: Column > :rtype Column >""" > This would provide a wide range of support as these types of hints, while old > are pretty common. > A second option is to use PEP 3107 to define interfaces (pyi files) > in this case we might have a functions.pyi file which would contain something > like: > def max(col: Column) -> Column: > """ > Aggregate function: returns the maximum value of the expression in a > group. > """ > ... > This has the advantage of easier to understand types and not touching the > code (only supported code) but has the disadvantage of being separately > managed (i.e. greater chance of doing a mistake) and the fact that some > configuration would be needed in the IDE/static analysis tool instead of > working out of the box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17333) Make pyspark interface friendly with mypy static analysis
[ https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17333: Assignee: (was: Apache Spark) > Make pyspark interface friendly with mypy static analysis > - > > Key: SPARK-17333 > URL: https://issues.apache.org/jira/browse/SPARK-17333 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Assaf Mendelson >Priority: Trivial > > Static analysis tools such as those common to IDE for auto completion and > error marking, tend to have poor results with pyspark. > This is cause by two separate issues: > The first is that many elements are created programmatically such as the max > function in pyspark.sql.functions. > The second is that we tend to use pyspark in a functional manner, meaning > that we chain many actions (e.g. df.filter().groupby().agg()) and since > python has no type information this can become difficult to understand. > I would suggest changing the interface to improve it. > The way I see it we can either change the interface or provide interface > enhancements. > Changing the interface means defining (when possible) all functions directly, > i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py > and then generating the functions programmatically by using _create_function, > create the function directly. > def max(col): >""" >docstring >""" >_create_function(max,"docstring") > Second we can add type indications to all functions as defined in pep 484 or > pycharm's legacy type hinting > (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy). > So for example max might look like this: > def max(col): >""" >does a max. > :type col: Column > :rtype Column >""" > This would provide a wide range of support as these types of hints, while old > are pretty common. > A second option is to use PEP 3107 to define interfaces (pyi files) > in this case we might have a functions.pyi file which would contain something > like: > def max(col: Column) -> Column: > """ > Aggregate function: returns the maximum value of the expression in a > group. > """ > ... > This has the advantage of easier to understand types and not touching the > code (only supported code) but has the disadvantage of being separately > managed (i.e. greater chance of doing a mistake) and the fact that some > configuration would be needed in the IDE/static analysis tool instead of > working out of the box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17333) Make pyspark interface friendly with mypy static analysis
[ https://issues.apache.org/jira/browse/SPARK-17333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162339#comment-17162339 ] Apache Spark commented on SPARK-17333: -- User 'Fokko' has created a pull request for this issue: https://github.com/apache/spark/pull/29180 > Make pyspark interface friendly with mypy static analysis > - > > Key: SPARK-17333 > URL: https://issues.apache.org/jira/browse/SPARK-17333 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Assaf Mendelson >Priority: Trivial > > Static analysis tools such as those common to IDE for auto completion and > error marking, tend to have poor results with pyspark. > This is cause by two separate issues: > The first is that many elements are created programmatically such as the max > function in pyspark.sql.functions. > The second is that we tend to use pyspark in a functional manner, meaning > that we chain many actions (e.g. df.filter().groupby().agg()) and since > python has no type information this can become difficult to understand. > I would suggest changing the interface to improve it. > The way I see it we can either change the interface or provide interface > enhancements. > Changing the interface means defining (when possible) all functions directly, > i.e. instead of having a __functions__ dictionary in pyspark.sql.functions.py > and then generating the functions programmatically by using _create_function, > create the function directly. > def max(col): >""" >docstring >""" >_create_function(max,"docstring") > Second we can add type indications to all functions as defined in pep 484 or > pycharm's legacy type hinting > (https://www.jetbrains.com/help/pycharm/2016.1/type-hinting-in-pycharm.html#legacy). > So for example max might look like this: > def max(col): >""" >does a max. > :type col: Column > :rtype Column >""" > This would provide a wide range of support as these types of hints, while old > are pretty common. > A second option is to use PEP 3107 to define interfaces (pyi files) > in this case we might have a functions.pyi file which would contain something > like: > def max(col: Column) -> Column: > """ > Aggregate function: returns the maximum value of the expression in a > group. > """ > ... > This has the advantage of easier to understand types and not touching the > code (only supported code) but has the disadvantage of being separately > managed (i.e. greater chance of doing a mistake) and the fact that some > configuration would be needed in the IDE/static analysis tool instead of > working out of the box. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32382) Override table renaming in JDBC dialects
Maxim Gekk created SPARK-32382: -- Summary: Override table renaming in JDBC dialects Key: SPARK-32382 URL: https://issues.apache.org/jira/browse/SPARK-32382 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Maxim Gekk SPARK-32375 adds new method renameTable to JdbcDialect with the default implementation: {code:sql} ALTER TABLE table_name RENAME TO new_table_name; {code} which is supported by Oracle, MySQL, MariaDB, PostgreSQL and SQLite but other dialects might not support such syntax, for instance SQL Server (using the stored procedure called sp_rename): {code:sql} sp_rename 'table_name', 'new_table_name'; {code} The ticket aims to support table renaming in all JDBC dialects. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs
Holden Karau created SPARK-32381: Summary: Expose the ability for users to use parallel file & avoid location information discovery in RDDs Key: SPARK-32381 URL: https://issues.apache.org/jira/browse/SPARK-32381 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Holden Karau We already have this in SQL so it's mostly a matter of re-organizing the code a bit and agreeing on how to best expose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs
[ https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32381: Assignee: (was: Apache Spark) > Expose the ability for users to use parallel file & avoid location > information discovery in RDDs > > > Key: SPARK-32381 > URL: https://issues.apache.org/jira/browse/SPARK-32381 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We already have this in SQL so it's mostly a matter of re-organizing the code > a bit and agreeing on how to best expose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs
[ https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32381: Assignee: Apache Spark > Expose the ability for users to use parallel file & avoid location > information discovery in RDDs > > > Key: SPARK-32381 > URL: https://issues.apache.org/jira/browse/SPARK-32381 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > > We already have this in SQL so it's mostly a matter of re-organizing the code > a bit and agreeing on how to best expose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32381) Expose the ability for users to use parallel file & avoid location information discovery in RDDs
[ https://issues.apache.org/jira/browse/SPARK-32381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162314#comment-17162314 ] Apache Spark commented on SPARK-32381: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/29179 > Expose the ability for users to use parallel file & avoid location > information discovery in RDDs > > > Key: SPARK-32381 > URL: https://issues.apache.org/jira/browse/SPARK-32381 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We already have this in SQL so it's mostly a matter of re-organizing the code > a bit and agreeing on how to best expose this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162313#comment-17162313 ] Holden Karau commented on SPARK-26345: -- We don't assign issues normally until after the merge. Leaving a comment when you start working on it is a best practice to avoid people stepping on each others toes. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162311#comment-17162311 ] Felix Kizhakkel Jose commented on SPARK-26345: -- [~sha...@uber.com] I don't have permission to assign it to you. Probably someone who is part of committers list can assign it to you. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32348) Get tests working for Scala 2.13 build
[ https://issues.apache.org/jira/browse/SPARK-32348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162280#comment-17162280 ] Sean R. Owen commented on SPARK-32348: -- I've found a few more easy test fixes, but also discovered we need scalatest 3.2.0 for this fix: https://github.com/scalatest/scalatest/commit/7c89416aa9f3e7f2730a343ad6d3bdcff65809de > Get tests working for Scala 2.13 build > -- > > Key: SPARK-32348 > URL: https://issues.apache.org/jira/browse/SPARK-32348 > Project: Spark > Issue Type: Sub-task > Components: ML, Spark Core, SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > > This is a placeholder for the general task of getting the tests to pass in > the Scala 2.13 build, after it compiles. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32334) Investigate commonizing Columnar and Row data transformations
[ https://issues.apache.org/jira/browse/SPARK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162147#comment-17162147 ] Robert Joseph Evans commented on SPARK-32334: - I think I can get the conversation started here. {{SparkPlan}} supports a few APIs for columnar processing right now. * {{supportsColumnar}} which returns true if {{executeColumnar}} should be called to process columnar data. * {{vectorTypes}} an optional set of class names for the columnar output of this stage which is a performance improvement for the code generation phase of converting the data to rows. * {{executeColumnar}} the main entry point to columnar execution. * {{doExecuteColumnar}} what users are expected to implement if {{supportsColumnar}} returns true. When {{supportsColumnar}} returns true it is assumed that both the input and the output of the stage will be columnar data. With this information {{ApplyColumnarRulesAndInsertTransitions}} will insert {{RowToColumnarExec}} and {{ColumnarToRowExec}} transitions. {{ColumnarToRowExec}} is by far the more optimized because it is widely used today. One of the goals of this issue is to try and make something like {{ArrowEvalPythonExec}} be columnar. If we just set {{executeColumnar}} to true for it the incoming data layout would be columnar, but it most likely would not be Arrow formatted, so it would still require some kind of transition from one columnar format to an Arrow based format. There is also no guarantee that the size of the batch will correspond to what this operator wants. {{RowToColumnarExec}} goes off of the {{spark.sql.inMemoryColumnarStorage.batchSize}} config, but {{ArrowEvalPythonExec}} uses {{spark.sql.execution.arrow.maxRecordsPerBatch}}. To get around both of these issues I would propose that we let `SparkPlan` optionally ask for both a specific type of input and a specific target size. We might also want a better way to say what type of output it is going to produce so we can optimize away some transitions if they are not needed. > Investigate commonizing Columnar and Row data transformations > -- > > Key: SPARK-32334 > URL: https://issues.apache.org/jira/browse/SPARK-32334 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Priority: Major > > We introduced more Columnar Support with SPARK-27396. > With that we recognized that there is code that is doing very similar > transformations from ColumnarBatch or Arrow into InternalRow and vice versa. > For instance: > [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L56-L58] > [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L389] > We should investigate if we can commonize that code. > We are also looking at making the internal caching serialization pluggable to > allow for different cache implementations. > ([https://github.com/apache/spark/pull/29067]). > It was recently brought up that we should investigate if using the data > source v2 api makes sense and is feasible for some of these transformations > to allow it to be easily extended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32377) CaseInsensitiveMap should be deterministic for addition
[ https://issues.apache.org/jira/browse/SPARK-32377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-32377: -- Fix Version/s: 2.4.7 > CaseInsensitiveMap should be deterministic for addition > --- > > Key: SPARK-32377 > URL: https://issues.apache.org/jira/browse/SPARK-32377 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > {code} > import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap > var m = CaseInsensitiveMap(Map.empty[String, String]) > Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", > "5")).foreach { kv => > m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]] > println(m.get("path")) > } > Some(1) > Some(2) > Some(3) > Some(4) > Some(1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162063#comment-17162063 ] Xinli Shang commented on SPARK-26345: - [~yumwang][~FelixKJose], you can assign this JIra to me. When I have time, I can start working on it. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162059#comment-17162059 ] Apache Spark commented on SPARK-32380: -- User 'DeyinZhong' has created a pull request for this issue: https://github.com/apache/spark/pull/29178 > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.It
[jira] [Assigned] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32380: Assignee: Apache Spark > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Assignee: Apache Spark >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator
[jira] [Assigned] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32380: Assignee: (was: Apache Spark) > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162058#comment-17162058 ] Apache Spark commented on SPARK-32380: -- User 'DeyinZhong' has created a pull request for this issue: https://github.com/apache/spark/pull/29178 > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * step3: sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.It
[jira] [Assigned] (SPARK-32363) Flaky pip installation test in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32363: Assignee: Hyukjin Kwon > Flaky pip installation test in Jenkins > -- > > Key: SPARK-32363 > URL: https://issues.apache.org/jira/browse/SPARK-32363 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently pip packaging test is flaky in Jenkins: > {code} > Installing collected packages: py4j, pyspark > Attempting uninstall: py4j > Found existing installation: py4j 0.10.9 > Uninstalling py4j-0.10.9: > Successfully uninstalled py4j-0.10.9 > Attempting uninstall: pyspark > Found existing installation: pyspark 3.1.0.dev0 > ERROR: Exception: > Traceback (most recent call last): > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", > line 188, in _main > status = self.run(options, args) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", > line 185, in wrapper > return func(self, options, args) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py", > line 407, in run > use_user_site=options.use_user_site, > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py", > line 64, in install_given_reqs > auto_confirm=True > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py", > line 675, in uninstall > uninstalled_pathset = UninstallPathSet.from_dist(dist) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", > line 545, in from_dist > link_pointer, dist.project_name, dist.location) > AssertionError: Egg-link > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does > not match installed location of pyspark (at > /home/jenkins/workspace/SparkPullRequestBuilder@2/python) > Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32363) Flaky pip installation test in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-32363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32363. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29117 [https://github.com/apache/spark/pull/29117] > Flaky pip installation test in Jenkins > -- > > Key: SPARK-32363 > URL: https://issues.apache.org/jira/browse/SPARK-32363 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Currently pip packaging test is flaky in Jenkins: > {code} > Installing collected packages: py4j, pyspark > Attempting uninstall: py4j > Found existing installation: py4j 0.10.9 > Uninstalling py4j-0.10.9: > Successfully uninstalled py4j-0.10.9 > Attempting uninstall: pyspark > Found existing installation: pyspark 3.1.0.dev0 > ERROR: Exception: > Traceback (most recent call last): > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", > line 188, in _main > status = self.run(options, args) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/cli/req_command.py", > line 185, in wrapper > return func(self, options, args) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/commands/install.py", > line 407, in run > use_user_site=options.use_user_site, > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/__init__.py", > line 64, in install_given_reqs > auto_confirm=True > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_install.py", > line 675, in uninstall > uninstalled_pathset = UninstallPathSet.from_dist(dist) > File > "/tmp/tmp.GX6lHKLHZK/3.6/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", > line 545, in from_dist > link_pointer, dist.project_name, dist.location) > AssertionError: Egg-link > /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/python does > not match installed location of pyspark (at > /home/jenkins/workspace/SparkPullRequestBuilder@2/python) > Cleaning up temporary directory - /tmp/tmp.GX6lHKLHZK > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table {code:java} hbase(main):001:0>create 'hbase_test1', 'cf1' hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' {code} * step2: create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} * step3: sparksql query hive table while data in hbase {code:java} spark-sql --master yarn -e "select * from hivetest.hbase_test" {code} The error log as follow: java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.refle
[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161959#comment-17161959 ] deyzhong commented on SPARK-32380: -- I have solved this bug by modified TableReader.scala. The solution is when the inputformat class is org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat, will create OldHadoopRDD. I have tested in my product env as well. Can I submit a pr to spark ? [~apachespark] > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > {code:java} > hbase(main):001:0>create 'hbase_test1', 'cf1' > hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' > {code} > * step2: create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > * sparksql query hive table while data in hbase > {code:java} > spark-sql --master yarn -e "select * from hivetest.hbase_test" > {code} > > The error log as follow: > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserve
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table {code:java} hbase(main):001:0>create 'hbase_test1', 'cf1' hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' {code} * step2: create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} * sparksql query hive table while data in hbase {code:java} spark-sql --master yarn -e "select * from hivetest.hbase_test" {code} The error log as follow: java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.Nati
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table {code:java} create 'hbase_test2', 'cf1'{code} * create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} * sparksql query hive table while data in hbase {code:java} spark-sql --master yarn -e "select * from hivetest.hbase_test" {code} java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImp
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table * {code:java} create 'hbase_test2', 'cf1' {code} * create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} * sparksql query hive table while data in hbase {code:java} spark-sql --master yarn -e "select * from hivetest.hbase_test" {code} java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccess
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table * {code:java} create 'hbase_test2', 'cf1' {code} * create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} * sparksql query hive table while data in hbase {code:java} spark-sql --master yarn -e "select * from hivetest.hbase_test" {code} java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAcc
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data in hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Summary: sparksql cannot access hive table while data in hbase (was: sparksql cannot access hive table while data on hbase) > sparksql cannot access hive table while data in hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > * > {code:java} > create 'hbase_test2', 'cf1' > {code} > * create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > > > sparksql query hive table while data in hbase > > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
[jira] [Updated] (SPARK-32380) sparksql cannot access hive table while data on hbase
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Summary: sparksql cannot access hive table while data on hbase (was: sparksql cannot access hbase external table in hive) > sparksql cannot access hive table while data on hbase > - > > Key: SPARK-32380 > URL: https://issues.apache.org/jira/browse/SPARK-32380 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: ||component||version|| > |hadoop|2.8.5| > |hive|2.3.7| > |spark|3.0.0| > |hbase|1.4.9| >Reporter: deyzhong >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > * step1: create hbase table > * > {code:java} > create 'hbase_test2', 'cf1' > {code} > * create hive table related to hbase table > > {code:java} > hive> > CREATE EXTERNAL TABLE `hivetest.hbase_test`( > `key` string COMMENT '', > `value` string COMMENT '') > ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( > 'hbase.columns.mapping'=':key,cf1:v1', > 'serialization.format'='1') > TBLPROPERTIES ( > 'hbase.table.name'='hbase_test') > {code} > > > sparksql query hive table while data in hbase > > java.io.IOException: Cannot create a record reader because of a previous > error. Please look at the previous logs lines from the task's full log for > more details. > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) > at > org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) >
[jira] [Updated] (SPARK-32380) sparksql cannot access hbase external table in hive
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table * {code:java} create 'hbase_test2', 'cf1' {code} * create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} sparksql query hive table while data in hbase java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke
[jira] [Updated] (SPARK-32380) sparksql cannot access hbase external table in hive
[ https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deyzhong updated SPARK-32380: - Description: * step1: create hbase table * {code:java} create 'hbase_test2', 'cf1' {code} * create hive table related to hbase table {code:java} hive> CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') {code} sparksql query hive table java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) a
[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161952#comment-17161952 ] Krish commented on SPARK-32317: --- Yes I do agree with your second point, if we map required schema and the schema stored in file, we will be able to achieve the desired result. Will look for an update from you on this. > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(Dat
[jira] [Created] (SPARK-32380) sparksql cannot access hbase external table in hive
deyzhong created SPARK-32380: Summary: sparksql cannot access hbase external table in hive Key: SPARK-32380 URL: https://issues.apache.org/jira/browse/SPARK-32380 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: ||component||version|| |hadoop|2.8.5| |hive|2.3.7| |spark|3.0.0| |hbase|1.4.9| Reporter: deyzhong java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:272) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at org.apache.spark.rdd.RDD.collect(RDD.scala:1003) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412) at org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:490) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:474) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:490) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:206) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deplo
[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32379: Assignee: Prashant Sharma > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Blocker > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32379. -- Fix Version/s: 2.4.7 Resolution: Fixed Issue resolved by pull request 29177 [https://github.com/apache/spark/pull/29177] > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 2.4.7 > > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161848#comment-17161848 ] Apache Spark commented on SPARK-32379: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/29177 > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Priority: Blocker > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161846#comment-17161846 ] Apache Spark commented on SPARK-32379: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/29177 > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Priority: Blocker > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32379: Assignee: Apache Spark > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Blocker > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32379) docker based spark release script should use correct CRAN repo.
[ https://issues.apache.org/jira/browse/SPARK-32379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32379: Assignee: (was: Apache Spark) > docker based spark release script should use correct CRAN repo. > --- > > Key: SPARK-32379 > URL: https://issues.apache.org/jira/browse/SPARK-32379 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Prashant Sharma >Priority: Blocker > > While running, dev/create-release/do-release-docker.sh script, it is failing > with following errors > {code} > [root@kyok-test-1 ~]# tail docker-build.log > distribution that some required packages have not yet been created > or been moved out of Incoming. > The following information may help to resolve the situation: > The following packages have unmet dependencies: > r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be > installed > Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be > installed > r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to > be installed > E: Unable to correct problems, you have held broken packages. > The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg > ca-certificates apt-transport-https && echo 'deb > https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> > /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key > E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | > apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && > apt-get clean && apt-get update && $APT_INSTALL > software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng > && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives > --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL > curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc > pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T > /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL > https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && > $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip > install --upgrade pip && hash -r pip && pip install setuptools && pip > install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p > python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && > pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL > r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive > texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', > 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', > 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e > "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 > ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && > gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' > returned a non-zero code: 100 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32379) docker based spark release script should use correct CRAN repo.
Prashant Sharma created SPARK-32379: --- Summary: docker based spark release script should use correct CRAN repo. Key: SPARK-32379 URL: https://issues.apache.org/jira/browse/SPARK-32379 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.6 Reporter: Prashant Sharma While running, dev/create-release/do-release-docker.sh script, it is failing with following errors {code} [root@kyok-test-1 ~]# tail docker-build.log distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: r-base : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be installed Depends: r-recommended (= 4.0.2-1.1804.0) but it is not going to be installed r-base-dev : Depends: r-base-core (>= 4.0.2-1.1804.0) but it is not going to be installed E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates apt-transport-https && echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> /etc/apt/sources.list && gpg --keyserver keyserver.ubuntu.com --recv-key E298A3A825C0D65DFD57CBB651716619E084DAB9 && gpg -a --export E084DAB9 | apt-key add - && apt-get clean && rm -rf /var/lib/apt/lists/* && apt-get clean && apt-get update && $APT_INSTALL software-properties-common && apt-add-repository -y ppa:brightbox/ruby-ng && apt-get update && $APT_INSTALL openjdk-8-jdk && update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java && $APT_INSTALL curl wget git maven ivy subversion make gcc lsof libffi-dev pandoc pandoc-citeproc libssl-dev libcurl4-openssl-dev libxml2-dev && ln -s -T /usr/share/java/ivy.jar /usr/share/ant/lib/ivy.jar && curl -sL https://deb.nodesource.com/setup_4.x | bash && $APT_INSTALL nodejs && $APT_INSTALL libpython2.7-dev libpython3-dev python-pip python3-pip && pip install --upgrade pip && hash -r pip && pip install setuptools && pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && cd && virtualenv -p python3 /opt/p35 && . /opt/p35/bin/activate && pip install setuptools && pip install $BASE_PIP_PKGS && pip install $PIP_PKGS && $APT_INSTALL r-base r-base-dev && $APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && Rscript -e "devtools::install_github('jimhester/lintr')" && $APT_INSTALL ruby2.3 ruby2.3-dev mkdocs && gem install jekyll --no-rdoc --no-ri -v 3.8.6 && gem install jekyll-redirect-from -v 0.15.0 && gem install pygments.rb' returned a non-zero code: 100 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org