date:20180327

[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416835#comment-16416835
 ] 

Kazuaki Ishizaki commented on SPARK-23801:
--

While we know it is time-consuming work to narrow down to create a repro, we 
would appreciate it if you could prepare it.

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03

[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2018-03-27 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416821#comment-16416821
 ] 

Wenchen Fan commented on SPARK-22618:
-

Looks like we can apply the same fix to `Broadcast.unpersist`. Do you want to 
send a PR to fix? thanks!

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Assignee: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
>

[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-03-27 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416763#comment-16416763
 ] 

Wenchen Fan commented on SPARK-23291:
-

shall we backport this bug fix to 2.3?

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-27 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23699.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20839
[https://github.com/apache/spark/pull/20839]

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.4.0
>
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled

2018-03-27 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23699:


Assignee: Bryan Cutler

> PySpark should raise same Error when Arrow fallback is disabled
> ---
>
> Key: SPARK-23699
> URL: https://issues.apache.org/jira/browse/SPARK-23699
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When a schema or import error is encountered when using Arrow for 
> createDataFrame or toPandas and fallback is disabled, a RuntimeError is 
> raised.  It would be better to raise the same type of error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22618) RDD.unpersist can cause fatal exception when used with dynamic allocation

2018-03-27 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416309#comment-16416309
 ] 

Thomas Graves commented on SPARK-22618:
---

thanks for fixing this, hitting it now in spark 2.2, I think this same issue 
can happen with broadcast variables if its told to wait, did you happen to look 
at that at the same time?  

> RDD.unpersist can cause fatal exception when used with dynamic allocation
> -
>
> Key: SPARK-22618
> URL: https://issues.apache.org/jira/browse/SPARK-22618
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Brad
>Assignee: Brad
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you use rdd.unpersist() with dynamic allocation, then an executor can be 
> deallocated while your rdd is being removed, which will throw an uncaught 
> exception killing your job. 
> I looked into different ways of preventing this error from occurring but 
> couldn't come up with anything that wouldn't require a big change. I propose 
> the best fix is just to catch and log IOExceptions in unpersist() so they 
> don't kill your job. This will match the effective behavior when executors 
> are lost from dynamic allocation in other parts of the code.
> In the worst case scenario I think this could lead to RDD partitions getting 
> left on executors after they were unpersisted, but this is probably better 
> than the whole job failing. I think in most cases the IOException would be 
> due to the executor dieing for some reason, which is effectively the same 
> result as unpersisting the rdd from that executor anyway.
> I noticed this exception in a job that loads a 100GB dataset on a cluster 
> where we use dynamic allocation heavily. Here is the relevant stack trace
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
> at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:276)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> at java.lang.Thread.run(Thread.java:748)
> Exception in thread "main" org.apache.spark.SparkException: Exception thrown 
> in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.storage.BlockManagerMaster.removeRdd(BlockManagerMaster.scala:131)
> at org.apache.spark.SparkContext.unpersistRDD(SparkContext.scala:1806)
> at org.apache.spark.rdd.RDD.unpersist(RDD.scala:217)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.doWorkload(CacheTest.scala:62)
> at 
> com.ibm.sparktc.sparkbench.workload.Workload$class.run(Workload.scala:40)
> at 
> com.ibm.sparktc.sparkbench.workload.exercise.CacheTest.run(CacheTest.scala:33)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> com.ibm.sparktc.sparkbench.workload.SuiteKickoff$$anonfun$com$ibm$sparktc$sparkbench$workload$SuiteKickoff$$runSerially$1.apply(SuiteKickoff.scala:78)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
>

[jira] [Resolved] (SPARK-23096) Migrate rate source to v2

2018-03-27 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23096.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20688
[https://github.com/apache/spark/pull/20688]

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23096) Migrate rate source to v2

2018-03-27 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23096:
-

Assignee: Saisai Shao

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416239#comment-16416239
 ] 

Apache Spark commented on SPARK-19964:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20916

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23804) Flaky test: SparkSubmitSuite "repositories"

2018-03-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23804.

Resolution: Duplicate

Same root cause as SPARK-19964.

> Flaky test: SparkSubmitSuite "repositories"
> ---
>
> Key: SPARK-23804
> URL: https://issues.apache.org/jira/browse/SPARK-23804
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Seen on an unrelated PR:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88624/testReport/org.apache.spark.deploy/SparkSubmitSuite/repositories/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
>   at java.lang.Thread.getStackTrace(Thread.java:1552)
>   at 
> org.scalatest.concurrent.TimeLimits$class.failAfterImpl(TimeLimits.scala:234)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$.failAfterImpl(SparkSubmitSuite.scala:1066)
>   at 
> org.scalatest.concurrent.TimeLimits$class.failAfter(TimeLimits.scala:230)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$.failAfter(SparkSubmitSuite.scala:1066)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$.runSparkSubmit(SparkSubmitSuite.scala:1085)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10$$anonfun$apply$mcV$sp$2.apply(SparkSubmitSuite.scala:545)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10$$anonfun$apply$mcV$sp$2.apply(SparkSubmitSuite.scala:534)
>   at 
> org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:377)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply$mcV$sp(SparkSubmitSuite.scala:534)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply(SparkSubmitSuite.scala:529)
>   at 
> org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply(SparkSubmitSuite.scala:529)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19964:


Assignee: (was: Apache Spark)

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19964:


Assignee: Apache Spark

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Assignee: Apache Spark
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416141#comment-16416141
 ] 

Marcelo Vanzin commented on SPARK-19964:


http://dl.bintray.com doesn't seem to be loading right now, and that repo is 
hardcoded in {{SparkSubmit.scala}}. Might be good to make tests skip these 
remote repos if they don't need them.

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23714) Add metrics for cached KafkaConsumer

2018-03-27 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416138#comment-16416138
 ] 

Ted Yu commented on SPARK-23714:


[~tdas]:
What do you think ?

Thanks

> Add metrics for cached KafkaConsumer
> 
>
> Key: SPARK-23714
> URL: https://issues.apache.org/jira/browse/SPARK-23714
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Ted Yu
>Priority: Major
>
> SPARK-23623 added KafkaDataConsumer to avoid concurrent use of cached 
> KafkaConsumer.
> This JIRA is to add metrics for measuring the operations of the cache so that 
> users can gain insight into the caching solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416091#comment-16416091
 ] 

Marcelo Vanzin commented on SPARK-19964:


I ran into this failure in our internal jenkins also... the logs look similar 
to the above failure. It seems the code is taking a long time inside ivy 
libraries:

{noformat}
18/03/27 02:20:43.516 Utils: SLF4J: Class path contains multiple SLF4J bindings.
...
18/03/27 02:21:22.384 Utils:found my.great.lib#mylib;0.1 in repo-1
{noformat}

Those are the first and last log lines for the test. In our internal jenkins 
spark-submit makes further progress, but still the timeout is caused by the 
call into ivy taking a long time:

{noformat}
18/03/27 11:21:20.307 Ivy Default Cache set to: /var/lib/jenkins/.ivy2/cache
...
18/03/27 11:21:20.582 :: resolving dependencies :: 
org.apache.spark#spark-submit-parent;1.0
18/03/27 11:21:20.582   confs: [default]
18/03/27 11:21:41.618   found my.great.lib#mylib;0.1 in repo-1
18/03/27 11:22:11.271   found my.great.dep#mylib;0.1 in repo-1
...
18/03/27 11:22:18.878 INFO BlockManagerMasterEndpoint: Registering block 
manager 172.28.195.10:58362 with 366.3 MB RAM, BlockManagerId(0, 172.28.195.10, 
58362, None)
{noformat}

Wonder if it has anything to to with ivy trying to access the network during 
these tests, or some local lock maybe (which would be affected by multiple 
jenkins jobs on the same machine).

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9209) Using executor allocation, a executor is removed but it exists in ExecutorsPage of the web ui

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9209:
-
Due Date: (was: 21/Jul/15)

> Using executor allocation, a executor is removed but it exists in 
> ExecutorsPage of the web ui 
> --
>
> Key: SPARK-9209
> URL: https://issues.apache.org/jira/browse/SPARK-9209
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Priority: Minor
> Attachments: A Executor exists in web.png, executor is removed.png
>
>
> I set "spark.dynamicAllocation.enabled = true”, and  run a big job. In 
> driver, a executor is asked to remove, and it's remove successfully, and the 
> process of this executor is not exist. But it exists in ExecutorsPage of the 
> web ui.
> The log in driver :
> 2015-07-17 11:48:14,543 | INFO  | 
> [sparkDriver-akka.actor.default-dispatcher-3] | Removing block manager 
> BlockManagerId(264, 172.1.1.8, 23811) 
> 2015-07-17 11:48:14,543 | INFO  | [dag-scheduler-event-loop] | Removed 264 
> successfully in removeExecutor 
> 2015-07-17 11:48:21,226 | INFO  | 
> [sparkDriver-akka.actor.default-dispatcher-3] | Registering block manager 
> 172.1.1.8:23811 with 10.4 GB RAM, BlockManagerId(264, 172.1.1.8, 23811) 
> 2015-07-17 11:48:21,228 | INFO  | 
> [sparkDriver-akka.actor.default-dispatcher-3] | Added broadcast_781_piece0 in 
> memory on 172.1.1.8:23811 (size: 38.6 KB, free: 10.4 GB)  
> 2015-07-17 11:48:35,277 | ERROR | 
> [sparkDriver-akka.actor.default-dispatcher-16] | Lost executor 264 on 
> datasight-195: remote Rpc client disassociated 
> 2015-07-17 11:48:35,277 | WARN  | 
> [sparkDriver-akka.actor.default-dispatcher-4] | Association with remote 
> system [akka.tcp://sparkExecutor@datasight-195:23929] has failed, address is 
> now gated for [5000] ms. Reason is: [Disassociated].
> 2015-07-17 11:48:35,277 | INFO  | 
> [sparkDriver-akka.actor.default-dispatcher-16] | Re-queueing tasks for 264 
> from TaskSet 415.0 
> 2015-07-17 11:48:35,804 | INFO  | [SparkListenerBus] | Existing executor 264 
> has been removed (new total is 10)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23694:
--
Target Version/s:   (was: 2.3.0)

> The staging directory should under hive.exec.stagingdir if we set 
> hive.exec.stagingdir but not under the table directory 
> -
>
> Key: SPARK-23694
> URL: https://issues.apache.org/jira/browse/SPARK-23694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Yifeng Dong
>Priority: Major
>
> When we set hive.exec.stagingdir but not under the table directory, for 
> example: /tmp/hive-staging, I think the staging directory should under 
> /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22797) Add multiple column support to PySpark Bucketizer

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22797:
--
Target Version/s:   (was: 2.3.0)

> Add multiple column support to PySpark Bucketizer
> -
>
> Key: SPARK-22797
> URL: https://issues.apache.org/jira/browse/SPARK-22797
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10473) EventLog will loss message in the long-running security application

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10473:
--
Due Date: (was: 7/Sep/15)

> EventLog will loss message in the long-running security application
> ---
>
> Key: SPARK-10473
> URL: https://issues.apache.org/jira/browse/SPARK-10473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.0, 1.6.0
>Reporter: carlmartin
>Priority: Major
>
> In the implementation of *EventLoggingListener* , there is only one 
> OutputStream writing event message to HDFS.
> But when the token of the *DFSClient* in the "OutputStream" was expired, the  
> "DFSClient" had no right to write the message and miss all the message behind.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23425) load data for hdfs file path with wild card usage is not working properly

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23425:
--
Due Date: (was: 15/Feb/18)

> load data for hdfs file path with wild card usage is not working properly
> -
>
> Key: SPARK-23425
> URL: https://issues.apache.org/jira/browse/SPARK-23425
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Sujith
>Priority: Major
> Attachments: wildcard_issue.PNG
>
>
> load data command  for loading data from non local  file paths by using wild 
> card strings lke * are not working
> eg:
> "load data inpath 'hdfs://hacluster/user/ext*  into table t1"
> Getting Analysis excepton while executing this query
> !image-2018-02-14-23-41-39-923.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22809) pyspark is sensitive to imports with dots

2018-03-27 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416038#comment-16416038
 ] 

holdenk commented on SPARK-22809:
-

This _should_ be resolved by SPARK-23169 but I'll double check when I've got 
some cycles set aside this Friday..

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19964) Flaky test: SparkSubmitSuite "includes jars passed in through --packages"

2018-03-27 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19964:
---
Summary: Flaky test: SparkSubmitSuite "includes jars passed in through 
--packages"  (was: Flaky test: SparkSubmitSuite fails due to Timeout)

> Flaky test: SparkSubmitSuite "includes jars passed in through --packages"
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23804) Flaky test: SparkSubmitSuite "repositories"

2018-03-27 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-23804:
--

 Summary: Flaky test: SparkSubmitSuite "repositories"
 Key: SPARK-23804
 URL: https://issues.apache.org/jira/browse/SPARK-23804
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


Seen on an unrelated PR:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88624/testReport/org.apache.spark.deploy/SparkSubmitSuite/repositories/

{noformat}
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
failAfter did not complete within 60 seconds.
at java.lang.Thread.getStackTrace(Thread.java:1552)
at 
org.scalatest.concurrent.TimeLimits$class.failAfterImpl(TimeLimits.scala:234)
at 
org.apache.spark.deploy.SparkSubmitSuite$.failAfterImpl(SparkSubmitSuite.scala:1066)
at 
org.scalatest.concurrent.TimeLimits$class.failAfter(TimeLimits.scala:230)
at 
org.apache.spark.deploy.SparkSubmitSuite$.failAfter(SparkSubmitSuite.scala:1066)
at 
org.apache.spark.deploy.SparkSubmitSuite$.runSparkSubmit(SparkSubmitSuite.scala:1085)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10$$anonfun$apply$mcV$sp$2.apply(SparkSubmitSuite.scala:545)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10$$anonfun$apply$mcV$sp$2.apply(SparkSubmitSuite.scala:534)
at 
org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:377)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply$mcV$sp(SparkSubmitSuite.scala:534)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply(SparkSubmitSuite.scala:529)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$10.apply(SparkSubmitSuite.scala:529)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout

2018-03-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416033#comment-16416033
 ] 

Marcelo Vanzin commented on SPARK-19964:


Still flaky:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88624/testReport/org.apache.spark.deploy/SparkSubmitSuite/includes_jars_passed_in_through___packages/

{noformat}
sbt.ForkMain$ForkError: 
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
failAfter did not complete within 60 seconds.
at java.lang.Thread.getStackTrace(Thread.java:1552)
at 
org.scalatest.concurrent.TimeLimits$class.failAfterImpl(TimeLimits.scala:234)
at 
org.apache.spark.deploy.SparkSubmitSuite$.failAfterImpl(SparkSubmitSuite.scala:1066)
at 
org.scalatest.concurrent.TimeLimits$class.failAfter(TimeLimits.scala:230)
at 
org.apache.spark.deploy.SparkSubmitSuite$.failAfter(SparkSubmitSuite.scala:1066)
at 
org.apache.spark.deploy.SparkSubmitSuite$.runSparkSubmit(SparkSubmitSuite.scala:1085)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$9$$anonfun$apply$mcV$sp$1.apply(SparkSubmitSuite.scala:525)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$9$$anonfun$apply$mcV$sp$1.apply(SparkSubmitSuite.scala:514)
at 
org.apache.spark.deploy.IvyTestUtils$.withRepository(IvyTestUtils.scala:377)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$9.apply$mcV$sp(SparkSubmitSuite.scala:514)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$9.apply(SparkSubmitSuite.scala:510)
at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$9.apply(SparkSubmitSuite.scala:510)
{noformat}

> Flaky test: SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>Priority: Major
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23195) Hint of cached data is lost

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23195:
--
Target Version/s: 2.3.1  (was: 2.3.0)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22809) pyspark is sensitive to imports with dots

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22809:
--
Target Version/s: 2.3.1, 2.4.0  (was: 2.3.0, 2.4.0)

> pyspark is sensitive to imports with dots
> -
>
> Key: SPARK-22809
> URL: https://issues.apache.org/jira/browse/SPARK-22809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Cricket Temple
>Assignee: holdenk
>Priority: Major
>
> User code can fail with dotted imports.  Here's a repro script.
> {noformat}
> import numpy as np
> import pandas as pd
> import pyspark
> import scipy.interpolate
> import scipy.interpolate as scipy_interpolate
> import py4j
> scipy_interpolate2 = scipy.interpolate
> sc = pyspark.SparkContext()
> spark_session = pyspark.SQLContext(sc)
> ###
> # The details of this dataset are irrelevant  #
> # Sorry if you'd have preferred something more boring #
> ###
> x__ = np.linspace(0,10,1000)
> freq__ = np.arange(1,5)
> x_, freq_ = np.ix_(x__, freq__)
> y = np.sin(x_ * freq_).ravel()
> x = (x_ * np.ones(freq_.shape)).ravel()
> freq = (np.ones(x_.shape) * freq_).ravel()
> df_pd = pd.DataFrame(np.stack([x,y,freq]).T, columns=['x','y','freq'])
> df_sk = spark_session.createDataFrame(df_pd)
> assert(df_sk.toPandas() == df_pd).all().all()
> try:
> import matplotlib.pyplot as plt
> for f, data in df_pd.groupby("freq"):
> plt.plot(*data[['x','y']].values.T)
> plt.show()
> except:
> print("I guess we can't plot anything")
> def mymap(x, interp_fn):
> df = pd.DataFrame.from_records([row.asDict() for row in list(x)])
> return interp_fn(df.x.values, df.y.values)(np.pi)
> df_by_freq = df_sk.rdd.keyBy(lambda x: x.freq).groupByKey()
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> try:
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> raise Excpetion("Not going to reach this line")
> except py4j.protocol.Py4JJavaError, e:
> print("See?")
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy_interpolate2.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> # But now it works!
> result = df_by_freq.mapValues(lambda x: mymap(x, 
> scipy.interpolate.interp1d)).collect()
> assert(np.allclose(np.array(zip(*result)[1]), np.zeros(len(freq__)), 
> atol=1e-6))
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23292) python tests related to pandas are skipped with python 2

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23292:
--
Target Version/s:   (was: 2.3.0)

> python tests related to pandas are skipped with python 2
> 
>
> Key: SPARK-23292
> URL: https://issues.apache.org/jira/browse/SPARK-23292
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Yin Huai
>Priority: Critical
>
> I was running python tests and found that 
> [pyspark.sql.tests.GroupbyAggPandasUDFTests.test_unsupported_types|https://github.com/apache/spark/blob/52e00f70663a87b5837235bdf72a3e6f84e11411/python/pyspark/sql/tests.py#L4528-L4548]
>  does not run with Python 2 because the test uses "assertRaisesRegex" 
> (supported by Python 3) instead of "assertRaisesRegexp" (supported by Python 
> 2). However, spark jenkins does not fail because of this issue (see run 
> history at 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/]).
>  After looking into this issue, [seems test script will skip tests related to 
> pandas if pandas is not 
> installed|https://github.com/apache/spark/blob/2ac895be909de7e58e1051dc2a1bba98a25bf4be/python/pyspark/sql/tests.py#L51-L63],
>  which means that jenkins does not have pandas installed. 
>  
> Since pyarrow related tests have the same skipping logic, we will need to 
> check if jenkins has pyarrow installed correctly as well. 
>  
> Since features using pandas and pyarrow are in 2.3, we should fix the test 
> issue and make sure all tests pass before we make the release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23309) Spark 2.3 cached query performance 20-30% worse then spark 2.2

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23309:
--
Target Version/s:   (was: 2.3.0)

> Spark 2.3 cached query performance 20-30% worse then spark 2.2
> --
>
> Key: SPARK-23309
> URL: https://issues.apache.org/jira/browse/SPARK-23309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> I was testing spark 2.3 rc2 and I am seeing a performance regression in sql 
> queries on cached data.
> The size of the data: 10.4GB input from hive orc files /18.8 GB cached/5592 
> partitions
> Here is the example query:
> val dailycached = spark.sql("select something from table where dt = 
> '20170301' AND something IS NOT NULL")
> dailycached.createOrReplaceTempView("dailycached") 
> spark.catalog.cacheTable("dailyCached")
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show()
>  
> On spark 2.2 I see queries times average 13 seconds
> On the same nodes I see spark 2.3 queries times average 17 seconds
> Note these are times of queries after the initial caching.  so just running 
> the last line again: 
> spark.sql("SELECT COUNT(DISTINCT(something)) from dailycached").show() 
> multiple times.
>  
> I also ran a query over more data (335GB input/587.5 GB cached) and saw a 
> similar discrepancy in the performance of querying cached data between spark 
> 2.3 and spark 2.2, where 2.2 was better by like 20%.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17859) persist should not impede with spark's ability to perform a broadcast join.

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17859:
--
Fix Version/s: (was: 2.2.1)
   (was: 2.0.2)

> persist should not impede with spark's ability to perform a broadcast join.
> ---
>
> Key: SPARK-17859
> URL: https://issues.apache.org/jira/browse/SPARK-17859
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0
> Environment: spark 2.0.0 , Linux RedHat
>Reporter: Franck Tago
>Priority: Major
>
> I am using Spark 2.0.0 
> My investigation leads me to conclude that calling persist could prevent 
> broadcast join  from happening .
> Example
> Case1: No persist call 
> var  df1 =spark.range(100).select($"id".as("id1"))
> df1: org.apache.spark.sql.DataFrame = [id1: bigint]
>  var df2 =spark.range(1000).select($"id".as("id2"))
> df2: org.apache.spark.sql.DataFrame = [id2: bigint]
>  df1.join(df2 , $"id1" === $"id2" ).explain 
> == Physical Plan ==
> *BroadcastHashJoin [id1#117L], [id2#123L], Inner, BuildRight
> :- *Project [id#114L AS id1#117L]
> :  +- *Range (0, 100, splits=2)
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]))
>+- *Project [id#120L AS id2#123L]
>   +- *Range (0, 1000, splits=2)
> Case 2:  persist call 
>  df1.persist.join(df2 , $"id1" === $"id2" ).explain 
> 16/10/10 15:50:21 WARN CacheManager: Asked to cache already cached data.
> == Physical Plan ==
> *SortMergeJoin [id1#3L], [id2#9L], Inner
> :- *Sort [id1#3L ASC], false, 0
> :  +- Exchange hashpartitioning(id1#3L, 10)
> : +- InMemoryTableScan [id1#3L]
> ::  +- InMemoryRelation [id1#3L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :: :  +- *Project [id#0L AS id1#3L]
> :: : +- *Range (0, 100, splits=2)
> +- *Sort [id2#9L ASC], false, 0
>+- Exchange hashpartitioning(id2#9L, 10)
>   +- InMemoryTableScan [id2#9L]
>  :  +- InMemoryRelation [id2#9L], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
>  : :  +- *Project [id#6L AS id2#9L]
>  : : +- *Range (0, 1000, splits=2)
> Why does the persist call prevent the broadcast join . 
> My opinion is that it should not .
> I was made aware that the persist call is  lazy and that might have something 
> to do with it , but I still contend that it should not . 
> Losing broadcast joins is really costly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23625) spark sql long-running mission will be dead

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23625:
--
Target Version/s:   (was: 1.6.2)
   Fix Version/s: (was: 1.6.2)

> spark sql long-running mission will be dead
> ---
>
> Key: SPARK-23625
> URL: https://issues.apache.org/jira/browse/SPARK-23625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Yu Wang
>Priority: Major
> Attachments: 1520489823.png, 1520489833.png, 1520489848.png, 
> 1520489854.png, 1520489861.png, 1520489867.png
>
>
> spark sql long-running mission will be dead



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23657) Document InternalRow and expose it as a stable interface

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23657:
--
Fix Version/s: (was: 2.4.0)

> Document InternalRow and expose it as a stable interface
> 
>
> Key: SPARK-23657
> URL: https://issues.apache.org/jira/browse/SPARK-23657
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The new DataSourceV2 API needs to stabilize the {{InternalRow}} interface so 
> that it can be used by new data source implementations. It already exposes 
> {{UnsafeRow}} for reads and {{InternalRow}} for writes, and the 
> representations are unlikely to change so this is primarily documentation 
> work.
> For more discussion, see SPARK-23325.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22839) Refactor Kubernetes code for configuring driver/executor pods to use consistent and cleaner abstraction

2018-03-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22839:
--
Fix Version/s: (was: 2.4.0)

> Refactor Kubernetes code for configuring driver/executor pods to use 
> consistent and cleaner abstraction
> ---
>
> Key: SPARK-22839
> URL: https://issues.apache.org/jira/browse/SPARK-22839
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Yinan Li
>Priority: Major
>
> As discussed in https://github.com/apache/spark/pull/19954, the current code 
> for configuring the driver pod vs the code for configuring the executor pods 
> are not using the same abstraction. Besides that, the current code leaves a 
> lot to be desired in terms of the level and cleaness of abstraction. For 
> example, the current code is passing around many pieces of information around 
> different class hierarchies, which makes code review and maintenance 
> challenging. We need some thorough refactoring of the current code to achieve 
> better, cleaner, and consistent abstraction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415997#comment-16415997
 ] 

Apache Spark commented on SPARK-23803:
--

User 'sabanas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20915

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23803:


Assignee: Apache Spark

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23803:


Assignee: (was: Apache Spark)

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2018-03-27 Thread imran shaik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415985#comment-16415985
 ] 

imran shaik edited comment on SPARK-5594 at 3/27/18 5:48 PM:
-

[~NeoYE]
Thank you for your awesome comment!!
It worked!!
Previously I was the getting the same issue as *_"Caused by: 
org.apache.spark.SparkException: Failed to get broadcast_2_piece0 of 
broadcast_2"_*
The problem was there were multiple contexts running 
Why multiple contexts?: Spark session with the following command starts 
everything except Streaming Context and i did a mistake to start a new context 
for Streaming Context
i started the Streaming context with the create Spark Session 's Spark Context 
and there is no more problem

Here is the working code:

val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = 
SparkSession.builder().appName("somename").config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()

val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) ---> This is 
streaming context using spark session's spark context

I havent used this parameter: spark.cleaner.ttl 







was (Author: imranshaik):
[~NeoYE]
Thank you for your awesome comment!!
It worked!!
Previously I was the getting the same issue as *_"Caused by: 
org.apache.spark.SparkException: Failed to get broadcast_2_piece0 of 
broadcast_2"_*
The problem was there were multiple contexts running 
Why multiple contexts?: Spark session with the following command starts 
everything except Streaming Context and i did a mistake to start a new context 
for Streaming Context
i started the Streaming context with the create Spark Session 's Spark Context 
and there is no more problem

Here is the working code:

val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = 
SparkSession.builder().appName("somename").config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()

val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) ---> This is 
streaming context using spark session's spark context






> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
>

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2018-03-27 Thread imran shaik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415985#comment-16415985
 ] 

imran shaik commented on SPARK-5594:


[~NeoYE]
Thank you for your awesome comment!!
It worked!!
Previously I was the getting the same issue as *_"Caused by: 
org.apache.spark.SparkException: Failed to get broadcast_2_piece0 of 
broadcast_2"_*
The problem was there were multiple contexts running 
Why multiple contexts?: Spark session with the following command starts 
everything except Streaming Context and i did a mistake to start a new context 
for Streaming Context
i started the Streaming context with the create Spark Session 's Spark Context 
and there is no more problem

Here is the working code:

val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = 
SparkSession.builder().appName("somename").config("spark.sql.warehouse.dir",warehouseLocation).getOrCreate()

val ssc = new StreamingContext(spark.sparkContext, Seconds(1)) ---> This is 
streaming context using spark session's spark context






> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
>

[jira] [Updated] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Asher Saban (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asher Saban updated SPARK-23803:

Target Version/s:   (was: 2.3.1)

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Asher Saban (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asher Saban updated SPARK-23803:

Fix Version/s: (was: 2.3.1)

> Support bucket pruning to optimize filtering on a bucketed column
> -
>
> Key: SPARK-23803
> URL: https://issues.apache.org/jira/browse/SPARK-23803
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Asher Saban
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> support bucket pruning when filtering on a single bucketed column on the 
> following predicates - 
>  # EqualTo
>  # EqualNullSafe
>  # In
>  # (1)-(3) combined in And/Or predicates
>  
> based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
> base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23803) Support bucket pruning to optimize filtering on a bucketed column

2018-03-27 Thread Asher Saban (JIRA)

Asher Saban created SPARK-23803:
---

 Summary: Support bucket pruning to optimize filtering on a 
bucketed column
 Key: SPARK-23803
 URL: https://issues.apache.org/jira/browse/SPARK-23803
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Asher Saban
 Fix For: 2.3.1


support bucket pruning when filtering on a single bucketed column on the 
following predicates - 
 # EqualTo
 # EqualNullSafe
 # In
 # (1)-(3) combined in And/Or predicates

 

based on [~smilegator]'s work in SPARK-12850 which was removed from the code 
base. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415911#comment-16415911
 ] 

Marcelo Vanzin commented on SPARK-23801:


[~joshrosen] looks like the code you added to find memory corruption found 
something.

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
>

[jira] [Updated] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-27 Thread Joshua Howard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Howard updated SPARK-23784:
--
Description: I have code 
[here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
 where I am trying to use an Aggregator with both the select and agg functions. 
I cannot seem to get this to work in Spark 2.3.0. 
[Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
 is a blog post that appears to be using this functionality in Spark 1.6, but 
It appears to no longer work.   (was: {{I have code 
[here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
 where I am trying to use an Aggregator with both the `select` and `agg` 
functions. I cannot seem to get this to work in Spark 2.3.0. 
[Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
 is a blog post that appears to be using this functionality in Spark 1.6, but 
It appears to no longer work. }})

> Cannot use custom Aggregator with groupBy/agg 
> --
>
> Key: SPARK-23784
> URL: https://issues.apache.org/jira/browse/SPARK-23784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joshua Howard
>Priority: Major
>
> I have code 
> [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
>  where I am trying to use an Aggregator with both the select and agg 
> functions. I cannot seem to get this to work in Spark 2.3.0. 
> [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
>  is a blog post that appears to be using this functionality in Spark 1.6, but 
> It appears to no longer work. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-27 Thread Joshua Howard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Howard updated SPARK-23784:
--
Description: {{I have code 
[here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
 where I am trying to use an Aggregator with both the `select` and `agg` 
functions. I cannot seem to get this to work in Spark 2.3.0. 
[Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
 is a blog post that appears to be using this functionality in Spark 1.6, but 
It appears to no longer work. }}  (was: {{I have code 
[here|[https://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work],]
 where I am trying to use an Aggregator with both the `select` and `agg` 
functions. I cannot seem to get this to work in Spark 2.3.0. 
[Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
 is a blog post that appears to be using this functionality in Spark 1.6, but 
It appears to no longer work. }})

> Cannot use custom Aggregator with groupBy/agg 
> --
>
> Key: SPARK-23784
> URL: https://issues.apache.org/jira/browse/SPARK-23784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joshua Howard
>Priority: Major
>
> {{I have code 
> [here|http://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work]
>  where I am trying to use an Aggregator with both the `select` and `agg` 
> functions. I cannot seem to get this to work in Spark 2.3.0. 
> [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
>  is a blog post that appears to be using this functionality in Spark 1.6, but 
> It appears to no longer work. }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23802) PropagateEmptyRelation can leave query plan in unresolved state

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415832#comment-16415832
 ] 

Apache Spark commented on SPARK-23802:
--

User 'robert3005' has created a pull request for this issue:
https://github.com/apache/spark/pull/20914

> PropagateEmptyRelation can leave query plan in unresolved state
> ---
>
> Key: SPARK-23802
> URL: https://issues.apache.org/jira/browse/SPARK-23802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Since [https://github.com/apache/spark/pull/19825] PropagateEmptyRelation has 
> been taught to handle more cases it can cause the optimized query plan to be 
> unresolved.
> Simple repro is to run following through the optimizer
> {code:java}
> LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) 
> .join(LocalRelation('a.int, 'b.int), UsingJoin(FullOuter, "a" :: Nil), 
> None){code}
> Which results in
> {code:java}
> Project [coalesce(a#0, null) AS a#7, null AS b#6]
> +- LocalRelation [a#0]{code}
> This then fails type check on coalesce expression since `a` and null have 
> different type.
>  
> Simple, targeted fix is to change PropagateEmptyRelation to add casts around 
> nulls. More comprehensive fix would be to run type coercion at the end of 
> optimization so it can fix cases like those. Alternatively the type checking 
> code could treat NullType as equal to any other type and not fail the type 
> check in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23802) PropagateEmptyRelation can leave query plan in unresolved state

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23802:


Assignee: (was: Apache Spark)

> PropagateEmptyRelation can leave query plan in unresolved state
> ---
>
> Key: SPARK-23802
> URL: https://issues.apache.org/jira/browse/SPARK-23802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Robert Kruszewski
>Priority: Minor
>
> Since [https://github.com/apache/spark/pull/19825] PropagateEmptyRelation has 
> been taught to handle more cases it can cause the optimized query plan to be 
> unresolved.
> Simple repro is to run following through the optimizer
> {code:java}
> LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) 
> .join(LocalRelation('a.int, 'b.int), UsingJoin(FullOuter, "a" :: Nil), 
> None){code}
> Which results in
> {code:java}
> Project [coalesce(a#0, null) AS a#7, null AS b#6]
> +- LocalRelation [a#0]{code}
> This then fails type check on coalesce expression since `a` and null have 
> different type.
>  
> Simple, targeted fix is to change PropagateEmptyRelation to add casts around 
> nulls. More comprehensive fix would be to run type coercion at the end of 
> optimization so it can fix cases like those. Alternatively the type checking 
> code could treat NullType as equal to any other type and not fail the type 
> check in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15276) CREATE TABLE with LOCATION should imply EXTERNAL

2018-03-27 Thread Volodymyr Glushak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415825#comment-16415825
 ] 

Volodymyr Glushak edited comment on SPARK-15276 at 3/27/18 3:55 PM:


[~andrewor14],

HIVE does not "externalise" table  if LOCATION specified. (I reckon Impala 
neither)

Why does Apache Spark introduce different behaviour? 

 

 


was (Author: rumoku):
[~andrewor14],

HIVE does not "externalise" table  if LOCATION specified.

Why does Apache Spark introduce different behaviour? 

 

 

> CREATE TABLE with LOCATION should imply EXTERNAL
> 
>
> Key: SPARK-15276
> URL: https://issues.apache.org/jira/browse/SPARK-15276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
> Fix For: 2.0.0
>
>
> If the user runs `CREATE TABLE some_table ... LOCATION /some/path`, then this 
> will still be a managed table even though the table's data is stored at 
> /some/path. The problem is that when we drop the table we'll also delete the 
> data /some/path. This could cause problems if /some/path contains existing 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23802) PropagateEmptyRelation can leave query plan in unresolved state

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23802:


Assignee: Apache Spark

> PropagateEmptyRelation can leave query plan in unresolved state
> ---
>
> Key: SPARK-23802
> URL: https://issues.apache.org/jira/browse/SPARK-23802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Robert Kruszewski
>Assignee: Apache Spark
>Priority: Minor
>
> Since [https://github.com/apache/spark/pull/19825] PropagateEmptyRelation has 
> been taught to handle more cases it can cause the optimized query plan to be 
> unresolved.
> Simple repro is to run following through the optimizer
> {code:java}
> LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) 
> .join(LocalRelation('a.int, 'b.int), UsingJoin(FullOuter, "a" :: Nil), 
> None){code}
> Which results in
> {code:java}
> Project [coalesce(a#0, null) AS a#7, null AS b#6]
> +- LocalRelation [a#0]{code}
> This then fails type check on coalesce expression since `a` and null have 
> different type.
>  
> Simple, targeted fix is to change PropagateEmptyRelation to add casts around 
> nulls. More comprehensive fix would be to run type coercion at the end of 
> optimization so it can fix cases like those. Alternatively the type checking 
> code could treat NullType as equal to any other type and not fail the type 
> check in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23802) PropagateEmptyRelation can leave query plan in unresolved state

2018-03-27 Thread Robert Kruszewski (JIRA)

Robert Kruszewski created SPARK-23802:
-

 Summary: PropagateEmptyRelation can leave query plan in unresolved 
state
 Key: SPARK-23802
 URL: https://issues.apache.org/jira/browse/SPARK-23802
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Robert Kruszewski


Since [https://github.com/apache/spark/pull/19825] PropagateEmptyRelation has 
been taught to handle more cases it can cause the optimized query plan to be 
unresolved.

Simple repro is to run following through the optimizer
{code:java}
LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) 
.join(LocalRelation('a.int, 'b.int), UsingJoin(FullOuter, "a" :: Nil), 
None){code}
Which results in
{code:java}
Project [coalesce(a#0, null) AS a#7, null AS b#6]
+- LocalRelation [a#0]{code}
This then fails type check on coalesce expression since `a` and null have 
different type.

 

Simple, targeted fix is to change PropagateEmptyRelation to add casts around 
nulls. More comprehensive fix would be to run type coercion at the end of 
optimization so it can fix cases like those. Alternatively the type checking 
code could treat NullType as equal to any other type and not fail the type 
check in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15276) CREATE TABLE with LOCATION should imply EXTERNAL

2018-03-27 Thread Volodymyr Glushak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415825#comment-16415825
 ] 

Volodymyr Glushak edited comment on SPARK-15276 at 3/27/18 3:51 PM:


[~andrewor14],

HIVE does not "externalise" table  if LOCATION specified.

Why does Apache Spark introduce different behaviour? 

 

 


was (Author: rumoku):
[~andrewor14],

HIVE is not adding EXTERNAL keyword  if LOCATION specified.

Why does Apache Spark introduce different behaviour? 

 

 

> CREATE TABLE with LOCATION should imply EXTERNAL
> 
>
> Key: SPARK-15276
> URL: https://issues.apache.org/jira/browse/SPARK-15276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
> Fix For: 2.0.0
>
>
> If the user runs `CREATE TABLE some_table ... LOCATION /some/path`, then this 
> will still be a managed table even though the table's data is stored at 
> /some/path. The problem is that when we drop the table we'll also delete the 
> data /some/path. This could cause problems if /some/path contains existing 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15276) CREATE TABLE with LOCATION should imply EXTERNAL

2018-03-27 Thread Volodymyr Glushak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415825#comment-16415825
 ] 

Volodymyr Glushak commented on SPARK-15276:
---

[~andrewor14],

HIVE is not adding EXTERNAL keyword  if LOCATION specified.

Why does Apache Spark introduce different behaviour? 

 

 

> CREATE TABLE with LOCATION should imply EXTERNAL
> 
>
> Key: SPARK-15276
> URL: https://issues.apache.org/jira/browse/SPARK-15276
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
> Fix For: 2.0.0
>
>
> If the user runs `CREATE TABLE some_table ... LOCATION /some/path`, then this 
> will still be a managed table even though the table's data is stored at 
> /some/path. The problem is that when we drop the table we'll also delete the 
> data /some/path. This could cause problems if /some/path contains existing 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Description: 
After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
executor memory). I've attached the full coredump but here is an except:
{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 
)
# Problematic frame:
# V  [libjvm.so+0x995fdc]  oopDesc* 
PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
#
# Core dump written. Default location: 
/var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
 or core.1315
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{code}
{code:java}
---  T H R E A D  ---

Current thread (0x7f146005b000):  GCTaskThread [stack: 
0x7f1464e2d000,0x7f1464f2e000] [id=1363]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
0x

Registers:
RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
RDX=0x
RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
RDI=0x7ef7bc30bda8
R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
R11=0x7f14671240e0
R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
R15=0x000d
RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x7f1464f2c1a0)
0x7f1464f2c1a0:   7f146005b000 0001
0x7f1464f2c1b0:   0004 7f14600bb640
0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
0x7f1464f2c1f0:   7ef8a80a7060 1741
0x7f1464f2c200:   0002 
0x7f1464f2c210:   7f1464f2c230 7f146742b005
0x7f1464f2c220:   7ef8a80a7050 1741
0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
0x7f1464f2c270:   7ef8b843d7c8 00020006
0x7f1464f2c280:   7f1464f2c340 7f14600bb640
0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
0x7f1464f2c2b0:   0001 
0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
0x7f1464f2c360:   7f1464f2c9d0 
0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
0x7f1464f2c380:   7f145427ba90 7ef9
0x7f1464f2c390:   0078 7ef9c035f8c0 

Instructions: (pc=0x7f1467427fdc)
0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 

Register to memory mapping:

RAX=0x17e907feccbc6d20 is an unknown value
RBX=0x7ef9c035f8c8 is pointing into the stack for thread: 0x7ef850009800
RCX=0x7f1464f2c9f0 is an unknown value
RDX=0x is an unknown value
RSP=0x7f1464f2c1a0 is an unknown value
RBP=0x7f1464f2c210 is an unknown value
RSI=0x0068 is an unknown value
RDI=0x7ef7bc30bda8 is pointing into metadata
R8 =0x7f1464f2c3d0 is an unknown value
R9 =0x1741 is an unknown value
R10=0x7f1467a52819:  in 
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x7f1466a92000
R11=0x7f14671240e0:  in 
/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/libjvm.so at 0x7f1466a92000
R12=0x7f130912c998 is an oop

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Environment: 
Mesos coarse grained executor
18 * r3.4xlarge (16 core boxes) with 105G of executor memory

  was:18 * r3.4xlarge (16 core boxes) with 105G of executor memory


> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: Mesos coarse grained executor
> 18 * r3.4xlarge (16 core boxes) with 105G of executor memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Environment: 18 * r3.4xlarge (16 core boxes) with 105G of executor memory

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
> Environment: 18 * r3.4xlarge (16 core boxes) with 105G of executor 
> memory
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory

[jira] [Created] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)

Nathan Kleyn created SPARK-23801:


 Summary: Consistent SIGSEGV after upgrading to Spark v2.3.0
 Key: SPARK-23801
 URL: https://issues.apache.org/jira/browse/SPARK-23801
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Nathan Kleyn
 Attachments: spark-executor-failure.coredump.log

After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
executor memory). I've attached the full coredump but here is an except:


{code:java}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
1.8.0_161-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode linux-amd64 
)
# Problematic frame:
# V  [libjvm.so+0x995fdc]  oopDesc* 
PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
#
# Core dump written. Default location: 
/var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
 or core.1315
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#{code}
{code:java}
---  T H R E A D  ---

Current thread (0x7f146005b000):  GCTaskThread [stack: 
0x7f1464e2d000,0x7f1464f2e000] [id=1363]

siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
0x

Registers:
RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
RDX=0x
RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
RDI=0x7ef7bc30bda8
R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
R11=0x7f14671240e0
R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
R15=0x000d
RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
ERR=0x
  TRAPNO=0x000d

Top of Stack: (sp=0x7f1464f2c1a0)
0x7f1464f2c1a0:   7f146005b000 0001
0x7f1464f2c1b0:   0004 7f14600bb640
0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
0x7f1464f2c1f0:   7ef8a80a7060 1741
0x7f1464f2c200:   0002 
0x7f1464f2c210:   7f1464f2c230 7f146742b005
0x7f1464f2c220:   7ef8a80a7050 1741
0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
0x7f1464f2c270:   7ef8b843d7c8 00020006
0x7f1464f2c280:   7f1464f2c340 7f14600bb640
0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
0x7f1464f2c2b0:   0001 
0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
0x7f1464f2c360:   7f1464f2c9d0 
0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
0x7f1464f2c380:   7f145427ba90 7ef9
0x7f1464f2c390:   0078 7ef9c035f8c0 

Instructions: (pc=0x7f1467427fdc)
0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 

Register to memory mapping:

RAX=0x17e907feccbc6d20 is an unknown value
RBX=0x7ef9c035f8c8 is pointing into the stack for thread: 0x7ef850009800
RCX=0x7f1464f2c9f0 is an unknown value
RDX=0x is an unknown value
RSP=0x7f1464f2c1a0 is an unknown value
RBP=0x7f1464f2c210 is an unknown value
RSI=0x0068 is an unknown value
RDI=0x7ef7bc30bda8 is pointing into metadata
R8 =0x7f1464f2c3d0 is an unknown value
R9 =0x1741 is an unknown value
R10=0x7f1467a52819:  in

[jira] [Updated] (SPARK-23801) Consistent SIGSEGV after upgrading to Spark v2.3.0

2018-03-27 Thread Nathan Kleyn (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Kleyn updated SPARK-23801:
-
Attachment: spark-executor-failure.coredump.log

> Consistent SIGSEGV after upgrading to Spark v2.3.0
> --
>
> Key: SPARK-23801
> URL: https://issues.apache.org/jira/browse/SPARK-23801
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Nathan Kleyn
>Priority: Major
> Attachments: spark-executor-failure.coredump.log
>
>
> After upgrading to Spark v2.3.0 from Spark v2.1.1, we are seeing consistent 
> segfaults in a large Spark job (18 * r3.4xlarge 16 core boxes with 105G of 
> executor memory). I've attached the full coredump but here is an except:
> {code:java}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f1467427fdc, pid=1315, tid=0x7f1464f2d700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_161-b12) (build 
> 1.8.0_161-b12)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.161-b12 mixed mode 
> linux-amd64 )
> # Problematic frame:
> # V  [libjvm.so+0x995fdc]  oopDesc* 
> PSPromotionManager::copy_to_survivor_space(oopDesc*)+0x7c
> #
> # Core dump written. Default location: 
> /var/lib/mesos/slave/slaves/92f50385-a83b-4f36-b1a3-53d9b8716544-S203/frameworks/92f50385-a83b-4f36-b1a3-53d9b8716544-0095/executors/14/runs/2e6b3a6e-b811-47d1-9393-66301d923b98/spark-2.3.0-bin-hadoop2.7/core
>  or core.1315
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #{code}
> {code:java}
> ---  T H R E A D  ---
> Current thread (0x7f146005b000):  GCTaskThread [stack: 
> 0x7f1464e2d000,0x7f1464f2e000] [id=1363]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr: 
> 0x
> Registers:
> RAX=0x17e907feccbc6d20, RBX=0x7ef9c035f8c8, RCX=0x7f1464f2c9f0, 
> RDX=0x
> RSP=0x7f1464f2c1a0, RBP=0x7f1464f2c210, RSI=0x0068, 
> RDI=0x7ef7bc30bda8
> R8 =0x7f1464f2c3d0, R9 =0x1741, R10=0x7f1467a52819, 
> R11=0x7f14671240e0
> R12=0x7f130912c998, R13=0x17e907feccbc6d20, R14=0x0002, 
> R15=0x000d
> RIP=0x7f1467427fdc, EFLAGS=0x00010202, CSGSFS=0x002b0033, 
> ERR=0x
>   TRAPNO=0x000d
> Top of Stack: (sp=0x7f1464f2c1a0)
> 0x7f1464f2c1a0:   7f146005b000 0001
> 0x7f1464f2c1b0:   0004 7f14600bb640
> 0x7f1464f2c1c0:   7f1464f2c210 7f14673aeed6
> 0x7f1464f2c1d0:   7f1464f2c2c0 7f1464f2c250
> 0x7f1464f2c1e0:   7f11bde31b70 7ef9c035f8c8
> 0x7f1464f2c1f0:   7ef8a80a7060 1741
> 0x7f1464f2c200:   0002 
> 0x7f1464f2c210:   7f1464f2c230 7f146742b005
> 0x7f1464f2c220:   7ef8a80a7050 1741
> 0x7f1464f2c230:   7f1464f2c2d0 7f14673ae9fb
> 0x7f1464f2c240:   7f1467a5d880 7f14673ad9a0
> 0x7f1464f2c250:   7f1464f2c9f0 7f1464f2c3d0
> 0x7f1464f2c260:   7f1464f2c3a0 7f146005b620
> 0x7f1464f2c270:   7ef8b843d7c8 00020006
> 0x7f1464f2c280:   7f1464f2c340 7f14600bb640
> 0x7f1464f2c290:   17417f1453fb9cec 7f1453fb
> 0x7f1464f2c2a0:   7f1453fb819e 7f1464f2c3a0
> 0x7f1464f2c2b0:   0001 
> 0x7f1464f2c2c0:   7f1464f2c3d0 7f1464f2c9d0
> 0x7f1464f2c2d0:   7f1464f2c340 7f1467025f22
> 0x7f1464f2c2e0:   7f145427cb5c 7f1464f2c3a0
> 0x7f1464f2c2f0:   7f1464f2c370 7f146005b000
> 0x7f1464f2c300:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c310:   7f1464f2c9f0 7f1464f2c3a0
> 0x7f1464f2c320:   7f1464f2c3d0 7f146005b000
> 0x7f1464f2c330:   7f1464f2c9f0 7ef850009800
> 0x7f1464f2c340:   7f1464f2c9c0 7f1467508191
> 0x7f1464f2c350:   7ef9c16f7890 7f1464f2c370
> 0x7f1464f2c360:   7f1464f2c9d0 
> 0x7f1464f2c370:   7ef9c035f8c0 7f145427cb5c
> 0x7f1464f2c380:   7f145427ba90 7ef9
> 0x7f1464f2c390:   0078 7ef9c035f8c0 
> Instructions: (pc=0x7f1467427fdc)
> 0x7f1467427fbc:   01 0f 85 f5 00 00 00 89 f0 c1 f8 03 41 f6 c5 01
> 0x7f1467427fcc:   4c 63 f8 0f 85 04 01 00 00 4c 89 e8 48 83 e0 fd
> 0x7f1467427fdc:   48 8b 00 48 c1 e8 03 89 c2 48 8b 05 04 74 5e 00
> 0x7f1467427fec:   83 e2 0f 3b 10 0f 82 fd 00 00 00 48 8b 45 a8 4e 
> Register to memory mapping:
> RAX=0x17e907feccbc6d20 is an unknown value
> RBX=0x7ef9c035f8c8 is pointing into the stack for

[jira] [Updated] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Michael Shtelma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Shtelma updated SPARK-23799:

Description: 
Spark 2.2.1 and 2.3.0 can produce NumberFormatException (see below) during the 
analysis of the queries, which are using previously analyzed hive tables. 

The NumberFormatException occurs because in [FilterEstimation.scala on lines 50 
and 
52|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=%E2%9C%93#L50-L52]
 the method calculateFilterSelectivity returns NaN, which is caused by devision 
by zero. This leads to NumberFormatException during conversion from Double to 
BigDecimal. 

NaN is caused by devision by zero in evaluateInSet method. 

Exception:

java.lang.NumberFormatException

at java.math.BigDecimal.(BigDecimal.java:494)

at java.math.BigDecimal.(BigDecimal.java:824)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)

at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)

at 
scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)

at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)

at scala.collection.immutable.List.forall(List.scala:84)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:46)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:43)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

at

[jira] [Updated] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Michael Shtelma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Shtelma updated SPARK-23799:

Description: 
Spark 2.2.1 and 2.3.0 can produce NumberFormatException (see below) during the 
analysis of the queries if CBO is activated and the used hive tables were 
analyzed. 

The NumberFormatException occurs because in [FilterEstimation.scala on lines 50 
and 
52|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=✓#L50-L52]
 the method calculateFilterSelectivity returns NaN, which is caused by devision 
by zero. This leads to NumberFormatException during conversion from Double to 
BigDecimal. 

NaN is caused by devision by zero in evaluateInSet method. 






Exception:

java.lang.NumberFormatException

at java.math.BigDecimal.(BigDecimal.java:494)

at java.math.BigDecimal.(BigDecimal.java:824)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)

at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)

at 
scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)

at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)

at scala.collection.immutable.List.forall(List.scala:84)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:46)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:43)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

[jira] [Updated] (SPARK-23800) Support partial function and callable object with pandas UDF

2018-03-27 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-23800:
---
Description: 
Per discussion here: 
[https://github.com/apache/spark/pull/20900#issuecomment-376195597]

 

> Support partial function and callable object with pandas UDF
> 
>
> Key: SPARK-23800
> URL: https://issues.apache.org/jira/browse/SPARK-23800
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Minor
>
> Per discussion here: 
> [https://github.com/apache/spark/pull/20900#issuecomment-376195597]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23800) Support partial function and callable object with pandas UDF

2018-03-27 Thread Li Jin (JIRA)

Li Jin created SPARK-23800:
--

 Summary: Support partial function and callable object with pandas 
UDF
 Key: SPARK-23800
 URL: https://issues.apache.org/jira/browse/SPARK-23800
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
 Environment: Per discussion here: 
https://github.com/apache/spark/pull/20900#issuecomment-376195597
Reporter: Li Jin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23800) Support partial function and callable object with pandas UDF

2018-03-27 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-23800:
---
Environment: (was: Per discussion here: 
https://github.com/apache/spark/pull/20900#issuecomment-376195597)

> Support partial function and callable object with pandas UDF
> 
>
> Key: SPARK-23800
> URL: https://issues.apache.org/jira/browse/SPARK-23800
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415652#comment-16415652
 ] 

Apache Spark commented on SPARK-23799:
--

User 'mshtelma' has created a pull request for this issue:
https://github.com/apache/spark/pull/20913

> [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of 
> empty table with analyzed statistics
> 
>
> Key: SPARK-23799
> URL: https://issues.apache.org/jira/browse/SPARK-23799
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Michael Shtelma
>Priority: Major
>
> FilterEstimation.evaluateInSet can perform devision by zero in a case of 
> empty table with analyzed statistics, which leads to the following exception: 
>  
> java.lang.NumberFormatException
> at java.math.BigDecimal.(BigDecimal.java:494)
> at java.math.BigDecimal.(BigDecimal.java:824)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)
> at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
> at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)
> at scala.collection.immutable.List.forall(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> at 
>

[jira] [Assigned] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23799:


Assignee: (was: Apache Spark)

> [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of 
> empty table with analyzed statistics
> 
>
> Key: SPARK-23799
> URL: https://issues.apache.org/jira/browse/SPARK-23799
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Michael Shtelma
>Priority: Major
>
> FilterEstimation.evaluateInSet can perform devision by zero in a case of 
> empty table with analyzed statistics, which leads to the following exception: 
>  
> java.lang.NumberFormatException
> at java.math.BigDecimal.(BigDecimal.java:494)
> at java.math.BigDecimal.(BigDecimal.java:824)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)
> at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
> at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)
> at scala.collection.immutable.List.forall(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:46)
> at 
>

[jira] [Assigned] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23799:


Assignee: Apache Spark

> [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of 
> empty table with analyzed statistics
> 
>
> Key: SPARK-23799
> URL: https://issues.apache.org/jira/browse/SPARK-23799
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Michael Shtelma
>Assignee: Apache Spark
>Priority: Major
>
> FilterEstimation.evaluateInSet can perform devision by zero in a case of 
> empty table with analyzed statistics, which leads to the following exception: 
>  
> java.lang.NumberFormatException
> at java.math.BigDecimal.(BigDecimal.java:494)
> at java.math.BigDecimal.(BigDecimal.java:824)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)
> at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
> at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)
> at scala.collection.immutable.List.forall(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:46)
> at 
>

[jira] [Created] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-03-27 Thread Michael Shtelma (JIRA)

Michael Shtelma created SPARK-23799:
---

 Summary: [CBO] FilterEstimation.evaluateInSet produces devision by 
zero in a case of empty table with analyzed statistics
 Key: SPARK-23799
 URL: https://issues.apache.org/jira/browse/SPARK-23799
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.3.0, 2.2.1
Reporter: Michael Shtelma


FilterEstimation.evaluateInSet can perform devision by zero in a case of empty 
table with analyzed statistics, which leads to the following exception: 

 

java.lang.NumberFormatException

at java.math.BigDecimal.(BigDecimal.java:494)

at java.math.BigDecimal.(BigDecimal.java:824)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)

at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)

at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)

at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)

at 
scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)

at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)

at scala.Option.getOrElse(Option.scala:121)

at 
org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)

at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)

at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)

at scala.collection.immutable.List.forall(List.scala:84)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:46)

at 
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:43)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)

at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)

at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)

at

[jira] [Commented] (SPARK-6162) Handle missing values in GBM

2018-03-27 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415600#comment-16415600
 ] 

Barry Becker commented on SPARK-6162:
-

If we all agree that is is something that would be very nice to have, why is it 
closed as won't fix instead of just being deferred to a future release?

This seems like a big limitation of spark Tree models in Spark.

> Handle missing values in GBM
> 
>
> Key: SPARK-6162
> URL: https://issues.apache.org/jira/browse/SPARK-6162
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Devesh Parekh
>Priority: Major
>
> We build a lot of predictive models over data combined from multiple sources, 
> where some entries may not have all sources of data and so some values are 
> missing in each feature vector. Another place this might come up is if you 
> have features from slightly heterogeneous items (or items composed of 
> heterogeneous subcomponents) that share many features in common but may have 
> extra features for different types, and you don't want to manually train 
> models for every different type.
> R's GBM library, which is what we are currently using, deals with this type 
> of data nicely by making "missing" nodes in the decision tree (a surrogate 
> split) for features that can have missing values. We'd like to do the same 
> with MLLib, but LabeledPoint would need to support missing values, and 
> GradientBoostedTrees would need to be modified to deal with them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23798) The CreateArray and ConcatArray should return the default array type when no children provided

2018-03-27 Thread Marek Novotny (JIRA)

Marek Novotny created SPARK-23798:
-

 Summary: The CreateArray and ConcatArray should return the default 
array type when no children provided
 Key: SPARK-23798
 URL: https://issues.apache.org/jira/browse/SPARK-23798
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Marek Novotny


The expressions should return ArrayType.defaultConcreteType.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23794) UUID() should be stateful

2018-03-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23794.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.4.0

> UUID() should be stateful
> -
>
> Key: SPARK-23794
> URL: https://issues.apache.org/jira/browse/SPARK-23794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> The UUID() expression is stateful and should implement the Stateful trait 
> instead of the Nondeterministic trait.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2018-03-27 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415320#comment-16415320
 ] 

Steve Loughran commented on SPARK-22513:


bq. So I guess at the summary level Sean was correct

pretty much. 

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Priority: Major
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23504) Flaky test: RateSourceV2Suite.basic microbatch execution

2018-03-27 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415313#comment-16415313
 ] 

Marek Novotny commented on SPARK-23504:
---

Experienced the [same 
problem|https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88605/testReport/org.apache.spark.sql.execution.streaming/RateSourceV2Suite/basic_microbatch_execution/].

> Flaky test: RateSourceV2Suite.basic microbatch execution
> 
>
> Key: SPARK-23504
> URL: https://issues.apache.org/jira/browse/SPARK-23504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> Seen on an unrelated change:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87635/testReport/org.apache.spark.sql.execution.streaming/RateSourceV2Suite/basic_microbatch_execution/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedException:   == Results == !== Correct 
> Answer - 10 == == Spark Answer - 0 == !struct<_1:timestamp,_2:int>   
> struct<> ![1969-12-31 16:00:00.0,0]  ![1969-12-31 16:00:00.1,1]  
> ![1969-12-31 16:00:00.2,2]  ![1969-12-31 16:00:00.3,3]  ![1969-12-31 
> 16:00:00.4,4]  ![1969-12-31 16:00:00.5,5]  ![1969-12-31 16:00:00.6,6] 
>  ![1969-12-31 16:00:00.7,7]  ![1969-12-31 16:00:00.8,8]  
> ![1969-12-31 16:00:00.9,9]== Progress ==
> AdvanceRateManualClock(1) => CheckLastBatch: [1969-12-31 
> 16:00:00.0,0],[1969-12-31 16:00:00.1,1],[1969-12-31 16:00:00.2,2],[1969-12-31 
> 16:00:00.3,3],[1969-12-31 16:00:00.4,4],[1969-12-31 16:00:00.5,5],[1969-12-31 
> 16:00:00.6,6],[1969-12-31 16:00:00.7,7],[1969-12-31 16:00:00.8,8],[1969-12-31 
> 16:00:00.9,9]StopStream
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@22bc97a,Map(),null)
> AdvanceRateManualClock(2)CheckLastBatch: [1969-12-31 
> 16:00:01.0,10],[1969-12-31 16:00:01.1,11],[1969-12-31 
> 16:00:01.2,12],[1969-12-31 16:00:01.3,13],[1969-12-31 
> 16:00:01.4,14],[1969-12-31 16:00:01.5,15],[1969-12-31 
> 16:00:01.6,16],[1969-12-31 16:00:01.7,17],[1969-12-31 
> 16:00:01.8,18],[1969-12-31 16:00:01.9,19]  == Stream == Output Mode: Append 
> Stream state: 
> {org.apache.spark.sql.execution.streaming.sources.RateStreamMicroBatchReader@75b88292:
>  {"0":{"value":-1,"runTimeMs":0}}} Thread state: alive Thread stack trace: 
> sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153) 
> org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222) 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633) 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2030) 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:84)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
>  org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) 
> org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) 
>

[jira] [Assigned] (SPARK-23794) UUID() should be stateful

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23794:


Assignee: (was: Apache Spark)

> UUID() should be stateful
> -
>
> Key: SPARK-23794
> URL: https://issues.apache.org/jira/browse/SPARK-23794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Priority: Major
>
> The UUID() expression is stateful and should implement the Stateful trait 
> instead of the Nondeterministic trait.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23794) UUID() should be stateful

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415295#comment-16415295
 ] 

Apache Spark commented on SPARK-23794:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20912

> UUID() should be stateful
> -
>
> Key: SPARK-23794
> URL: https://issues.apache.org/jira/browse/SPARK-23794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Priority: Major
>
> The UUID() expression is stateful and should implement the Stateful trait 
> instead of the Nondeterministic trait.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23794) UUID() should be stateful

2018-03-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23794:


Assignee: Apache Spark

> UUID() should be stateful
> -
>
> Key: SPARK-23794
> URL: https://issues.apache.org/jira/browse/SPARK-23794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Major
>
> The UUID() expression is stateful and should implement the Stateful trait 
> instead of the Nondeterministic trait.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2

2018-03-27 Thread Jepson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson reopened SPARK-22968:


This issue is reappear.

> java.lang.IllegalStateException: No current assignment for partition kssh-2
> ---
>
> Key: SPARK-22968
> URL: https://issues.apache.org/jira/browse/SPARK-22968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Kafka:  0.10.0  (CDH5.12.0)
> Apache Spark 2.1.1
> Spark streaming+Kafka
>Reporter: Jepson
>Priority: Major
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> *Kafka Broker:*
> {code:java}
>message.max.bytes : 2621440  
> {code}
> *Spark Streaming+Kafka Code:*
> {code:java}
> , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: 
> 1048576
> , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6
> , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3
> , "heartbeat.interval.ms" -> (5000: java.lang.Integer)
> , "receive.buffer.bytes" -> (10485760: java.lang.Integer)
> {code}
> *Error message:*
> {code:java}
> 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously 
> assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, 
> kssh-1] for group use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined 
> group use_a_separate_group_id_for_each_stream with generation 4
> 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned 
> partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group 
> use_a_separate_group_id_for_each_stream
> 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 1515116907000 ms
> java.lang.IllegalStateException: No current assignment for partition kssh-2
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
>   at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197)
>   at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
>   at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249)
>   at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
>

[jira] [Comment Edited] (SPARK-23191) Workers registration failes in case of network drop

2018-03-27 Thread Sujith Jay Nair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415194#comment-16415194
 ] 

Sujith Jay Nair edited comment on SPARK-23191 at 3/27/18 8:10 AM:
--

This is a known race condition, in which the reconnection attempt made by the 
worker after the network outage is seen as a request to register a duplicate 
worker on the (same) master and hence, the attempt fails. Details on this can 
be found in SPARK-4592. Although the race condition is resolved for the case in 
which a new master replaces the unresponsive master, it still exists when the 
same old master recovers, which is the case here.


was (Author: suj1th):
This is a known race condition, in which the reconnection attempt made by the 
worker after the network outage is seen as a request to register a duplicate 
worker on the (same) master and hence, the attempt fails. Details on this can 
be found in [#SPARK-4592]. Although the race condition is resolved for the case 
in which a new master replaces the unresponsive master, it still exists when 
the same old master recovers, which is the case here.

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
>

[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2018-03-27 Thread Sujith Jay Nair (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415194#comment-16415194
 ] 

Sujith Jay Nair commented on SPARK-23191:
-

This is a known race condition, in which the reconnection attempt made by the 
worker after the network outage is seen as a request to register a duplicate 
worker on the (same) master and hence, the attempt fails. Details on this can 
be found in [#SPARK-4592]. Although the race condition is resolved for the case 
in which a new master replaces the unresponsive master, it still exists when 
the same old master recovers, which is the case here.

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at

[jira] [Commented] (SPARK-23500) Filters on named_structs could be pushed into scans

2018-03-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415141#comment-16415141
 ] 

Apache Spark commented on SPARK-23500:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20911

> Filters on named_structs could be pushed into scans
> ---
>
> Key: SPARK-23500
> URL: https://issues.apache.org/jira/browse/SPARK-23500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Major
> Fix For: 2.4.0
>
>
> Simple filters on dataframes joined with {{joinWith()}} are missing an 
> opportunity to get pushed into the scan because they're written in terms of 
> {{named_struct}} that could be removed by the optimizer.
> Given the following simple query over two dataframes:
> {code:java}
> scala> val df = spark.read.parquet("one_million")
> df: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = spark.read.parquet("one_million")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> df.joinWith(df2, df2.col("id") === df.col("id2")).filter("_2.id > 
> 30").explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [_1#94.id2], [_2#95.id], Inner, BuildRight
> :- *(2) Project [named_struct(id, id#0L, id2, id2#1L) AS _1#94]
> :  +- *(2) FileScan parquet [id#0L,id2#1L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/henry/src/spark/one_million], 
> PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> struct, false].id))
>+- *(1) Project [named_struct(id, id#90L, id2, id2#91L) AS _2#95]
>   +- *(1) Filter (named_struct(id, id#90L, id2, id2#91L).id > 30)
>  +- *(1) FileScan parquet [id#90L,id2#91L] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/Users/henry/src/spark/one_million], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct
> {code}
> Using {{joinWith}} means that the filter is placed on a {{named_struct}}, and 
> is then pushed down. When the filter is just above the scan, the 
> wrapping-and-projection of {{named_struct(id...).id}} is a no-op and could be 
> removed. Then the filter can be pushed down to Parquet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23766) Not able to execute multiple queries in spark structured streaming

2018-03-27 Thread Apeksha Agnihotri (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apeksha Agnihotri updated SPARK-23766:
--
Component/s: (was: Spark Core)
 Structured Streaming

> Not able to execute multiple queries in spark structured streaming
> --
>
> Key: SPARK-23766
> URL: https://issues.apache.org/jira/browse/SPARK-23766
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Apeksha Agnihotri
>Priority: Major
>
> I am able to receive output of first query(.ie reader) only. Although all the 
> queries are running in logs.No data is stored in hdfs also
>  
> {code:java}
> public class A extends D implements Serializable {
> public Dataset getDataSet(SparkSession session) {
> Dataset dfs = 
> session.readStream().format("socket").option("host", hostname).option("port", 
> port).load();
> publish(dfs.toDF(), "reader");
> return dfs;
> }
> }
> public class B extends D implements Serializable {
> public Dataset execute(Dataset ds) {
>Dataset d = 
> ds.select(functions.explode(functions.split(ds.col("value"), "\\s+")));
> publish(d.toDF(), "component");
> return d;
> }
> }
> public class C extends D implements Serializable {
> public Dataset execute(Dataset ds) {
> publish(inputDataSet.toDF(), "console");
> ds.writeStream().format("csv").option("path", 
> "hdfs://hostname:9000/user/abc/data1/")
> .option("checkpointLocation", 
> "hdfs://hostname:9000/user/abc/cp").outputMode("append").start();
> return ds;
> }
> }
> public class D {
> public void publish(Dataset dataset, String name) {
> dataset.writeStream().format("csv").queryName(name).option("path", 
> "hdfs://hostname:9000/user/abc/" + name)
> .option("checkpointLocation", 
> "hdfs://hostname:9000/user/abc/checkpoint/" + directory).outputMode("append")
> .start();
> }
> }
> public static void main(String[] args) {
> SparkSession session = createSession();
> try {
> A a = new A();
> Dataset records = a.getDataSet(session);
> B b = new B();
> Dataset ds = b.execute(records);
> C c = new C();
> c.execute(ds);
> session.streams().awaitAnyTermination();
> } catch (StreamingQueryException e) {
> e.printStackTrace();
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-27 Thread Davide Isoardi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415093#comment-16415093
 ] 

Davide Isoardi commented on SPARK-23739:


We know that kafka client 0.8 has not this class, but the druid packet has.

Is it possible that this issue is cause by druid packet? In this case, can not 
you install druid and use structured streaming for get data by Kafka?

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> 18/03/16 16:10:57 ERROR ClientUtils: Failed to close coordinator
> java.lang.NoClassDefFoundError: 
> org/apache/kafka/common/requests/LeaveGroupResponse
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.sendLeaveGroupRequest(AbstractCoordinator.java:575)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.maybeLeaveGroup(AbstractCoordinator.java:566)
>  at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator.close(AbstractCoordinator.java:555)
>  at 
> org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.close(ConsumerCoordinator.java:377)
>  at org.apache.kafka.clients.ClientUtils.closeQuietly(ClientUtils.java:66)
>  at 
>

78 matches

Mail list logo