[jira] [Created] (SPARK-22515) Estimation relation size based on numRows * rowSize

2017-11-13 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-22515:


 Summary: Estimation relation size based on numRows * rowSize
 Key: SPARK-22515
 URL: https://issues.apache.org/jira/browse/SPARK-22515
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22490) PySpark doc has misleading string for SparkSession.builder

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22490:
-
Affects Version/s: 2.2.2

> PySpark doc has misleading string for SparkSession.builder
> --
>
> Key: SPARK-22490
> URL: https://issues.apache.org/jira/browse/SPARK-22490
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.2
>Reporter: Xiao Li
>Priority: Minor
>
> We need to fix the following line in our PySpark doc 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
> {noformat}
>  SparkSession.builder =  0x7f51f134a110>¶
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22511:
-
Affects Version/s: 2.3.0

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0, 2.2.2
>Reporter: Felix Cheung
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21098) Set lineseparator csv multiline and csv write to \n

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21098:
-
Affects Version/s: (was: 2.2.1)
   2.2.2

> Set lineseparator csv multiline and csv write to \n
> ---
>
> Key: SPARK-21098
> URL: https://issues.apache.org/jira/browse/SPARK-21098
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, SQL
>Affects Versions: 2.2.2
>Reporter: Daniel van der Ende
>Priority: Minor
>
> The Univocity-parser library uses the system line ending character as the 
> default line ending character. Rather than remain dependent on the setting in 
> this lib, we could set the default to \n.  We cannot make this configurable 
> for reading as it depends on LineReader from Hadoop, which has a hardcoded \n 
> as line ending.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21859) SparkFiles.get failed on driver in yarn-cluster and yarn-client mode

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21859:
-
Affects Version/s: (was: 2.2.1)

> SparkFiles.get failed on driver in yarn-cluster and yarn-client mode
> 
>
> Key: SPARK-21859
> URL: https://issues.apache.org/jira/browse/SPARK-21859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Cyanny
>
> when use SparkFiles.get a file on driver in yarn-client or yarn-cluster, it 
> will report file not found exception.
> This exception only happens on driver, SparkFiles.get works fine on 
> executor.
> 
> we can reproduce the bug as follows:
> ```scala
> def testOnDriver(fileName: String) = {
> val file = new File(SparkFiles.get(fileName))
> if (!file.exists()) {
> logging.info(s"$file not exist")
> } else {
> // print file content on driver
> val content = Source.fromFile(file).getLines().mkString("\n")
> logging.info(s"File content: ${content}")
> }
> }
> // the output will be file not exist
> ```
> 
> ```python
> conf = SparkConf().setAppName("test files")
> sc = SparkContext(appName="spark files test")
> 
> def test_on_driver(filename):
> file = SparkFiles.get(filename)
> print("file path: {}".format(file))
> if os.path.exists(file):
> with open(file) as f:
> lines = f.readlines()
> print(lines)
> else:
> print("file doesn't exist")
> run_command("ls .")
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22511:
-
Affects Version/s: 2.2.2

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.2.2
>Reporter: Felix Cheung
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21245) Resolve code duplication for classification/regression summarizers

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21245:
-
Affects Version/s: (was: 2.2.1)
   2.2.2

> Resolve code duplication for classification/regression summarizers
> --
>
> Key: SPARK-21245
> URL: https://issues.apache.org/jira/browse/SPARK-21245
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.2
>Reporter: Seth Hendrickson
>Priority: Minor
>  Labels: starter
>
> In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary 
> information about training data using {{MultivariateOnlineSummarizer}} and 
> {{MulticlassSummarizer}}. We have the same code appearing in several places 
> (and including test suites). We can eliminate this by creating a common 
> implementation somewhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21259) More rules for scalastyle

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21259:
-
Affects Version/s: (was: 2.2.1)
   2.2.2

> More rules for scalastyle
> -
>
> Key: SPARK-21259
> URL: https://issues.apache.org/jira/browse/SPARK-21259
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2
>Reporter: Gengliang Wang
>Priority: Minor
>
> During code review, we spent so much time on code style issues.
> It would be great if we add rules:
> 1) disallow space before colon
> 2) disallow space before right parentheses
> 3) disallow space after left parentheses



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode limit

2017-11-13 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250839#comment-16250839
 ] 

Kazuaki Ishizaki commented on SPARK-22510:
--

[~smilegator] Thanks, good idea. I will add other JIRA entries.

> Exceptions caused by 64KB JVM bytecode limit 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10496) Efficient DataFrame cumulative sum

2017-11-13 Thread Thomas Han (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250802#comment-16250802
 ] 

Thomas Han commented on SPARK-10496:


Any updates on this?


> Efficient DataFrame cumulative sum
> --
>
> Key: SPARK-10496
> URL: https://issues.apache.org/jira/browse/SPARK-10496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Goal: Given a DataFrame with a numeric column X, create a new column Y which 
> is the cumulative sum of X.
> This can be done with window functions, but it is not efficient for a large 
> number of rows.  It could be done more efficiently using a prefix sum/scan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22511:


Assignee: Apache Spark

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250729#comment-16250729
 ] 

Apache Spark commented on SPARK-22511:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19742

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Felix Cheung
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22511:


Assignee: (was: Apache Spark)

> Update maven central repo address
> -
>
> Key: SPARK-22511
> URL: https://issues.apache.org/jira/browse/SPARK-22511
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Felix Cheung
>
> As a part of building 2.2.1, we hit an issue with sonatype
> https://issues.sonatype.org/browse/MVNCENTRAL-2870
> to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.
> we should decide if we keep that or revert after 2.2.1 is released



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14228) Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14228:


Assignee: (was: Apache Spark)

> Lost executor of RPC disassociated, and occurs exception: Could not find 
> CoarseGrainedScheduler or it has been stopped
> --
>
> Key: SPARK-14228
> URL: https://issues.apache.org/jira/browse/SPARK-14228
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> When I start 1000 executors, and then stop the process. It will call 
> SparkContext.stop to stop all executors. But during this process, the 
> executors has been killed will lost of rpc with driver, and try to 
> reviveOffers, but can't find CoarseGrainedScheduler or it has been stopped.
> {quote}
> 16/03/29 01:45:45 ERROR YarnScheduler: Lost executor 610 on 51-196-152-8: 
> remote Rpc client disassociated
> 16/03/29 01:45:45 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
> has been stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:173)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:398)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:314)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:482)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:261)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:144)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14228) Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14228:


Assignee: Apache Spark

> Lost executor of RPC disassociated, and occurs exception: Could not find 
> CoarseGrainedScheduler or it has been stopped
> --
>
> Key: SPARK-14228
> URL: https://issues.apache.org/jira/browse/SPARK-14228
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>Assignee: Apache Spark
>
> When I start 1000 executors, and then stop the process. It will call 
> SparkContext.stop to stop all executors. But during this process, the 
> executors has been killed will lost of rpc with driver, and try to 
> reviveOffers, but can't find CoarseGrainedScheduler or it has been stopped.
> {quote}
> 16/03/29 01:45:45 ERROR YarnScheduler: Lost executor 610 on 51-196-152-8: 
> remote Rpc client disassociated
> 16/03/29 01:45:45 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
> has been stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:173)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:398)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:314)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:482)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:261)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:144)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14228) Lost executor of RPC disassociated, and occurs exception: Could not find CoarseGrainedScheduler or it has been stopped

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250714#comment-16250714
 ] 

Apache Spark commented on SPARK-14228:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19741

> Lost executor of RPC disassociated, and occurs exception: Could not find 
> CoarseGrainedScheduler or it has been stopped
> --
>
> Key: SPARK-14228
> URL: https://issues.apache.org/jira/browse/SPARK-14228
> Project: Spark
>  Issue Type: Bug
>Reporter: meiyoula
>
> When I start 1000 executors, and then stop the process. It will call 
> SparkContext.stop to stop all executors. But during this process, the 
> executors has been killed will lost of rpc with driver, and try to 
> reviveOffers, but can't find CoarseGrainedScheduler or it has been stopped.
> {quote}
> 16/03/29 01:45:45 ERROR YarnScheduler: Lost executor 610 on 51-196-152-8: 
> remote Rpc client disassociated
> 16/03/29 01:45:45 ERROR Inbox: Ignoring error
> org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it 
> has been stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:173)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:398)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:314)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:482)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.removeExecutor(CoarseGrainedSchedulerBackend.scala:261)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$onDisconnected$1.apply(CoarseGrainedSchedulerBackend.scala:207)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.onDisconnected(CoarseGrainedSchedulerBackend.scala:207)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:144)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9104) expose network layer memory usage

2017-11-13 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250675#comment-16250675
 ] 

Saisai Shao commented on SPARK-9104:


[~vsr] I think SPARK-21934 already exposed Netty shuffle metrics to metrics 
system, you can follow SPARK-21934 for the details.

For other Netty context like RPC, I don't have a strong feeling of support 
them, because usually memory usage is not so heavy for context like NettyRpcEnv.

> expose network layer memory usage
> -
>
> Key: SPARK-9104
> URL: https://issues.apache.org/jira/browse/SPARK-9104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> The default network transportation is netty, and when transfering blocks for 
> shuffle, the network layer will consume a decent size of memory, we shall 
> collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12375) VectorIndexer: allow unknown categories

2017-11-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-12375:
-

Assignee: Weichen Xu  (was: yuhao yang)

> VectorIndexer: allow unknown categories
> ---
>
> Key: SPARK-12375
> URL: https://issues.apache.org/jira/browse/SPARK-12375
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>
> Add option for allowing unknown categories, probably via a parameter like 
> "allowUnknownCategories."
> If true, then handle unknown categories during transform by assigning them to 
> an extra category index.
> The API should resemble the API used for StringIndexer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22504) Optimization in overwrite table in case of failure

2017-11-13 Thread xuchuanyin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250621#comment-16250621
 ] 

xuchuanyin commented on SPARK-22504:


[~srowen] thanks for your reply. My opinion is as below:

how do you ensure the new table is cleaned up in case of failure?
A: We can just use `drop` on the temp tables in case of failure. As for 'how to 
make sure', I would have to say that it depends on the realization of `drop`.

how do you make sure the old table is deleted?
A: The answer is the same as above.

what about the implications of having two of the tables' storage at once?
A: I think there is no other impacts other than have two copy of data for a 
while and some metadata operation.

I think the current semantics are correct and as expected in case of a failure.
A: I still can't aggree with you with this opinion.
`insert overwrite` differs from `insert` that it will truncate the target data, 
in another word, it will replace the old data with the newer one. As for 
`replace`, the old data can and only can be drop if the replace succeed, if the 
replace failed, the old data should remain unchanged -- It tries to keep the 
operation (weak) atomic.
Besides, in the testcase I provided in the issue description, the failure of 
overwrite will cause the origin data missing -- I don't think that's the user 
wanted.

SparkSQL currently does not provide `update` operation on a table. User who 
wants to update the existed table can only read from it and overwrite it at 
last. If SparkSQL won't support the above semantic, then it is the user's 
responsibility to keep the operation atomic.


> Optimization in overwrite table in case of failure
> --
>
> Key: SPARK-22504
> URL: https://issues.apache.org/jira/browse/SPARK-22504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: xuchuanyin
>
> Optimization in overwrite table in case of failure
> # SCENARIO
> Currently, `Overwrite` operation in spark is performed by following steps: 
> 1. DROP : drop old table
> 2. WRITE: create and write data into new table
> If some runtime error occurs in Step2, then the origin table will be lost 
> along with its data -- I think this will be a serious problem if someone 
> perform `read-update-flushback` actions. The problem can be reproduced by the 
> following code:
> ```scala
> 01: test("test spark df overwrite failed") {
> 02: // prepare table
> 03: val tableName = "test_spark_overwrite_failed"
> 04: sql(s"DROP TABLE IF EXISTS $tableName")
> 05: sql(s"CREATE TABLE IF NOT EXISTS $tableName ( field_int int, 
> field_string String)" +
> 06: s" STORED AS parquet").collect()
> 07: 
> 08: // load data first
> 09: val schema = StructType(
> 10:   Seq(StructField("field_int", DataTypes.IntegerType, nullable = 
> false),
> 11: StructField("field_string", DataTypes.StringType, nullable = 
> false)))
> 12: val rdd1 = sqlContext.sparkContext.parallelize(
> 13:   Row(20, "q") ::
> 14:   Row(21, "qw") ::
> 15:   Row(23, "qwe") :: Nil)
> 16: val dataFrame = sqlContext.createDataFrame(rdd1, schema)
> 17: 
> dataFrame.write.format("parquet").mode(SaveMode.Overwrite).saveAsTable(tableName)
> 18: sql(s"SELECT * FROM $tableName").show()
> 19: 
> 20: // load data again, the following data will cause failure in data 
> loading
> 21: try {
> 22:   val rdd2 = sqlContext.sparkContext.parallelize(
> 23: Row(31, "qwer") ::
> 24: Row(null, "qwer") ::
> 25: Row(32, "long_than_5") :: Nil)
> 26:   val dataFrame2 = sqlContext.createDataFrame(rdd2, schema)
> 27: 
> 28:   
> dataFrame2.write.format("parquet").mode(SaveMode.Overwrite).saveAsTable(tableName)
> 29: } catch {
> 30:   case e: Exception => LOGGER.error(e, "write overwrite failure")
> 31: }
> 32: // table `test_spark_overwrite_failed` has been dropped
> 33: sql(s"show tables").show(20, truncate = false)
> 34: // the content is empty even if table exists. We want it to be the 
> same as 
> 35: sql(s"SELECT * FROM $tableName").show()
> 36:   }
> ```
> In Line24, we creata a `null` element while the schema is `notnull` -- This 
> will cause runtime error in loading data.
> In Line33, table `test_spark_overwrite_failed` has already been dropped and 
> no longger exists in the current table. And of course Line35 will fail.
> Instead, we want Line35 to show the origin data just as Line18.
> # ANALYZE
> I am thinking of optimizing `overwrite` in spark -- The goal is to keep the 
> old data until the load has finished successfully. The old data can only be 
> cleaned when the load is successful.
> Since sparksql already support `rename` operation, we can optimize 
> `overwrite` in the following steps:
> 1. 

[jira] [Commented] (SPARK-22451) Reduce decision tree aggregate size for unordered features from O(2^numCategories) to O(numCategories)

2017-11-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250612#comment-16250612
 ] 

Joseph K. Bradley commented on SPARK-22451:
---

Whoops yes I think you're right.  Funny we never realized this before!  :P  
This would be great to get in.

> Reduce decision tree aggregate size for unordered features from 
> O(2^numCategories) to O(numCategories)
> --
>
> Key: SPARK-22451
> URL: https://issues.apache.org/jira/browse/SPARK-22451
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Do not need generate all possible splits for unordered features before 
> aggregate,
> in aggregete (executor side):
> 1. Change `mixedBinSeqOp`, for each unordered feature, we do the same stat 
> with ordered features. so for unordered features, we only need 
> O(numCategories) space for this feature stat.
> 2. After driver side get the aggregate result, generate all possible split 
> combinations, and compute the best split.
> This will reduce decision tree aggregate size for each unordered feature from 
> O(2^numCategories) to O(numCategories),  `numCategories` is the arity of this 
> unordered feature.
> This also reduce the cpu cost in executor side. Reduce time complexity for 
> this unordered feature from O(numPoints * 2^numCategories) to O(numPoints).
> This won't increase time complexity for unordered features best split 
> computing in driver side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22514) move ColumnVector.Array and ColumnarBatch.Row to individual files

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250610#comment-16250610
 ] 

Apache Spark commented on SPARK-22514:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19740

> move ColumnVector.Array and ColumnarBatch.Row to individual files
> -
>
> Key: SPARK-22514
> URL: https://issues.apache.org/jira/browse/SPARK-22514
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22514) move ColumnVector.Array and ColumnarBatch.Row to individual files

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22514:


Assignee: Wenchen Fan  (was: Apache Spark)

> move ColumnVector.Array and ColumnarBatch.Row to individual files
> -
>
> Key: SPARK-22514
> URL: https://issues.apache.org/jira/browse/SPARK-22514
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22514) move ColumnVector.Array and ColumnarBatch.Row to individual files

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22514:


Assignee: Apache Spark  (was: Wenchen Fan)

> move ColumnVector.Array and ColumnarBatch.Row to individual files
> -
>
> Key: SPARK-22514
> URL: https://issues.apache.org/jira/browse/SPARK-22514
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22514) move ColumnVector.Array and ColumnarBatch.Row to individual files

2017-11-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22514:
---

 Summary: move ColumnVector.Array and ColumnarBatch.Row to 
individual files
 Key: SPARK-22514
 URL: https://issues.apache.org/jira/browse/SPARK-22514
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250589#comment-16250589
 ] 

Felix Cheung commented on SPARK-22471:
--

yes, RC1 has been cut.

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.2
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22042:
-
Target Version/s: 2.2.2  (was: 2.2.1)

> ReorderJoinPredicates can break when child's partitioning is not decided
> 
>
> Key: SPARK-22042
> URL: https://issues.apache.org/jira/browse/SPARK-22042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Tejas Patil
>
> When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its 
> children, the children may not be properly constructed as the child-subtree 
> has to still go through other planner rules.
> In this particular case, the child is `SortMergeJoinExec`. Since the required 
> `Exchange` operators are not in place (because `EnsureRequirements` runs 
> _after_ `ReorderJoinPredicates`), the join's children would not have 
> partitioning defined. This breaks while creation the `PartitioningCollection` 
> here : 
> https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69
> Small repro:
> {noformat}
> context.sql("SET spark.sql.autoBroadcastJoinThreshold=0")
> val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", 
> "k")
> df.write.format("parquet").saveAsTable("table1")
> df.write.format("parquet").saveAsTable("table2")
> df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table")
> sql("""
>   SELECT *
>   FROM (
> SELECT a.i, a.j, a.k
> FROM bucketed_table a
> JOIN table1 b
> ON a.i = b.i
>   ) c
>   JOIN table2
>   ON c.i = table2.i
> """).explain
> {noformat}
> This fails with :
> {noformat}
> java.lang.IllegalArgumentException: requirement failed: 
> PartitioningCollection requires all of its partitionings have the same 
> numPartitions.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69)
>   at 
> org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76)
>   at 
> org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:114)
>   at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
>   at 
> org.apache.spark.sql.execution.command.ExplainCommand.run(commands.scala:147)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:78)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:75)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:91)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:464)
>   at org.apache.spark.sql.Dataset.explain(Dataset.scala:477)
>   ... 60 elided
> {noformat}



--

[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22471:
-
Target Version/s: 2.2.2

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.2
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-22471:
-
Fix Version/s: (was: 2.2.1)
   2.2.2

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.2
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-11-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21046.
-
Resolution: Not A Problem

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-11-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-21046:
-

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22513:


Assignee: (was: Apache Spark)

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22513:


Assignee: Apache Spark

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>Assignee: Apache Spark
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250521#comment-16250521
 ] 

Apache Spark commented on SPARK-22513:
--

User 'cko' has created a pull request for this issue:
https://github.com/apache/spark/pull/19739

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)

2017-11-13 Thread Dong Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250512#comment-16250512
 ] 

Dong Jiang commented on SPARK-13127:


[~igozali], I think you are referring to this parquet ticket: 
https://issues.apache.org/jira/browse/PARQUET-686
The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to 
upgrade parquet to 1.9.0
I have examined the parquet file generated by Spark 2.2, the string column 
doesn't have the min/max generated in the footer. I believe it is disabled.

> Upgrade Parquet to 1.9 (Fixes parquet sorting)
> --
>
> Key: SPARK-13127
> URL: https://issues.apache.org/jira/browse/SPARK-13127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Justin Pihony
>
> Currently, when you write a sorted DataFrame to Parquet, then reading the 
> data back out is not sorted by default. [This is due to a bug in 
> Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in 
> 1.9.
> There is a workaround to read the file back in using a file glob (filepath/*).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13127) Upgrade Parquet to 1.9 (Fixes parquet sorting)

2017-11-13 Thread Dong Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250512#comment-16250512
 ] 

Dong Jiang edited comment on SPARK-13127 at 11/13/17 11:56 PM:
---

[~igozali], I think you are referring to this parquet ticket: 
https://issues.apache.org/jira/browse/PARQUET-686
The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to 
upgrade parquet to 1.9.0
I have examined the parquet file generated by Spark 2.2, the string column 
doesn't have the min/max generated in the footer. I believe it is disabled.
Do we have any progress on this issue? Will it be included in Spark 2.3?


was (Author: djiangxu):
[~igozali], I think you are referring to this parquet ticket: 
https://issues.apache.org/jira/browse/PARQUET-686
The parquet ticket indicated the fix is in 1.9.0, so we still need Spark to 
upgrade parquet to 1.9.0
I have examined the parquet file generated by Spark 2.2, the string column 
doesn't have the min/max generated in the footer. I believe it is disabled.

> Upgrade Parquet to 1.9 (Fixes parquet sorting)
> --
>
> Key: SPARK-13127
> URL: https://issues.apache.org/jira/browse/SPARK-13127
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Justin Pihony
>
> Currently, when you write a sorted DataFrame to Parquet, then reading the 
> data back out is not sorted by default. [This is due to a bug in 
> Parquet|https://issues.apache.org/jira/browse/PARQUET-241] that was fixed in 
> 1.9.
> There is a workaround to read the file back in using a file glob (filepath/*).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21646) Add new type coercion rules to compatible with Hive

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21646:

Target Version/s: 2.3.0

> Add new type coercion rules to compatible with Hive
> ---
>
> Key: SPARK-21646
> URL: https://issues.apache.org/jira/browse/SPARK-21646
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yuming Wang
> Attachments: Type_coercion_rules_to_compatible_with_Hive.pdf
>
>
> How to reproduce:
> hive:
> {code:sql}
> $ hive -S
> hive> create table spark_21646(c1 string, c2 string);
> hive> insert into spark_21646 values('92233720368547758071', 'a');
> hive> insert into spark_21646 values('21474836471', 'b');
> hive> insert into spark_21646 values('10', 'c');
> hive> select * from spark_21646 where c1 > 0;
> 92233720368547758071  a
> 10c
> 21474836471   b
> hive>
> {code}
> spark-sql:
> {code:sql}
> $ spark-sql -S
> spark-sql> select * from spark_21646 where c1 > 0;
> 10  c 
>   
> spark-sql> select * from spark_21646 where c1 > 0L;
> 21474836471   b
> 10c
> spark-sql> explain select * from spark_21646 where c1 > 0;
> == Physical Plan ==
> *Project [c1#14, c2#15]
> +- *Filter (isnotnull(c1#14) && (cast(c1#14 as int) > 0))
>+- *FileScan parquet spark_21646[c1#14,c2#15] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[viewfs://cluster4/user/hive/warehouse/spark_21646], 
> PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> spark-sql> 
> {code}
> As you can see, spark auto cast c1 to int type, if this value out of integer 
> range, the result is different from Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22469:

Labels:   (was: release-notes)

> Accuracy problem in comparison with string and numeric 
> ---
>
> Key: SPARK-22469
> URL: https://issues.apache.org/jira/browse/SPARK-22469
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lijia Liu
>
> {code:sql}
> select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
> {code}
> IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250485#comment-16250485
 ] 

Sean Owen commented on SPARK-22513:
---

It's not required. It should work fine with 2.8 as-is if you use the hadoop-2.7 
profile.

> Provide build profile for hadoop 2.8
> 
>
> Key: SPARK-22513
> URL: https://issues.apache.org/jira/browse/SPARK-22513
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Christine Koppelt
>
> hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
> Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.
> [1] 
> https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22469:

Labels: release-notes  (was: )

> Accuracy problem in comparison with string and numeric 
> ---
>
> Key: SPARK-22469
> URL: https://issues.apache.org/jira/browse/SPARK-22469
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lijia Liu
>  Labels: release-notes
>
> {code:sql}
> select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
> {code}
> IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-11-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22377:


Assignee: Hyukjin Kwon

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
>Assignee: Hyukjin Kwon
> Fix For: 2.1.3, 2.2.1, 2.3.0
>
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-11-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22377.
--
   Resolution: Fixed
Fix Version/s: 2.1.3
   2.3.0
   2.2.1

Issue resolved by pull request 19695
[https://github.com/apache/spark/pull/19695]

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22513) Provide build profile for hadoop 2.8

2017-11-13 Thread Christine Koppelt (JIRA)
Christine Koppelt created SPARK-22513:
-

 Summary: Provide build profile for hadoop 2.8
 Key: SPARK-22513
 URL: https://issues.apache.org/jira/browse/SPARK-22513
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.0
Reporter: Christine Koppelt


hadoop 2.8 comes with a patch which is necessary to make it run on NixOS [1]. 
Therefore it would be cool to have a Spark version pre-built for Hadoop 2.8.


[1] 
https://github.com/apache/hadoop/commit/5231c527aaf19fb3f4bd59dcd2ab19bfb906d377#diff-19821342174c77119be4a99dc3f3618d



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22509.
--
Resolution: Not A Bug

I don't think it's worth to do such improvement in Spark Streaming. Even if 
Spark streaming commits a Spark job like what you said, Spark doesn't guarantee 
when to run the job.

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch with batch length of five for example will run 
> precisely on the zeroes and fives. (12:00, 12:05, 12:10, 12:15, etc.) Would 
> be beneficial for performance to permit running spark jobs on offset minutes 
> (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-22509:
--

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch with batch length of five for example will run 
> precisely on the zeroes and fives. (12:00, 12:05, 12:10, 12:15, etc.) Would 
> be beneficial for performance to permit running spark jobs on offset minutes 
> (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22509.
--
Resolution: Duplicate

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch with batch length of five for example will run 
> precisely on the zeroes and fives. (12:00, 12:05, 12:10, 12:15, etc.) Would 
> be beneficial for performance to permit running spark jobs on offset minutes 
> (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22509:
-
Component/s: (was: Structured Streaming)
 DStreams

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch with batch length of five for example will run 
> precisely on the zeroes and fives. (12:00, 12:05, 12:10, 12:15, etc.) Would 
> be beneficial for performance to permit running spark jobs on offset minutes 
> (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Wallace Baggaley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wallace Baggaley updated SPARK-22509:
-
Description: Using Spark Streaming, a batch with batch length of five for 
example will run precisely on the zeroes and fives. (12:00, 12:05, 12:10, 
12:15, etc.) Would be beneficial for performance to permit running spark jobs 
on offset minutes (1s and 6s or 2s and 7s), if configured so to do.  (was: 
Using Spark Streaming, a batch will run precisely on the zeroes and fives. 
Would be beneficial for performance to permit running spark jobs on offset 
minutes (1s and 6s or 2s and 7s), if configured so to do.)

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch with batch length of five for example will run 
> precisely on the zeroes and fives. (12:00, 12:05, 12:10, 12:15, etc.) Would 
> be beneficial for performance to permit running spark jobs on offset minutes 
> (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22509) Spark Streaming: jobs with same batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Wallace Baggaley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wallace Baggaley updated SPARK-22509:
-
Summary: Spark Streaming: jobs with same batch length all start at the same 
time, permit jobs to be offset  (was: Spark Streaming: jobs with 5 minute batch 
length all start at the same time, permit jobs to be offset)

> Spark Streaming: jobs with same batch length all start at the same time, 
> permit jobs to be offset
> -
>
> Key: SPARK-22509
> URL: https://issues.apache.org/jira/browse/SPARK-22509
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wallace Baggaley
>Priority: Minor
>
> Using Spark Streaming, a batch will run precisely on the zeroes and fives. 
> Would be beneficial for performance to permit running spark jobs on offset 
> minutes (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22512) How do we send UUID in spark dataset (using Java) to postgreSQL

2017-11-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22512.

Resolution: Invalid

Please use the mailing lists for questions.
http://spark.apache.org/community.html

> How do we send UUID in spark dataset (using Java) to postgreSQL
> ---
>
> Key: SPARK-22512
> URL: https://issues.apache.org/jira/browse/SPARK-22512
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Abhijit Parasnis
>
> We have a PostgreSQL table which has UUID as one of the column. How do we 
> send UUID field in Spark dataset(using Java) to PostgreSQL DB.
> We are not able to find uuid field in org.apache.spark.sql.types.DataTypes.
> Please advice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22512) How do we send UUID in spark dataset (using Java) to postgreSQL

2017-11-13 Thread Abhijit Parasnis (JIRA)
Abhijit Parasnis created SPARK-22512:


 Summary: How do we send UUID in spark dataset (using Java) to 
postgreSQL
 Key: SPARK-22512
 URL: https://issues.apache.org/jira/browse/SPARK-22512
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Abhijit Parasnis


We have a PostgreSQL table which has UUID as one of the column. How do we send 
UUID field in Spark dataset(using Java) to PostgreSQL DB.

We are not able to find uuid field in org.apache.spark.sql.types.DataTypes.

Please advice.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250343#comment-16250343
 ] 

Marcelo Vanzin commented on SPARK-22471:


If RC1 has been cut then there probably should be a 2.2.2 version in jira.

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.1
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250332#comment-16250332
 ] 

Dongjoon Hyun commented on SPARK-22471:
---

Thank you for merging, [~vanzin].

Although this is late for RC1, ping [~felixcheung] anyway.

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.1
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22471:
--

Assignee: Arseniy Tashoyan

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.1
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-13 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22471.

   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 19711
[https://github.com/apache/spark/pull/19711]

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Fix For: 2.2.1
>
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22511) Update maven central repo address

2017-11-13 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-22511:


 Summary: Update maven central repo address
 Key: SPARK-22511
 URL: https://issues.apache.org/jira/browse/SPARK-22511
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Felix Cheung


As a part of building 2.2.1, we hit an issue with sonatype
https://issues.sonatype.org/browse/MVNCENTRAL-2870
to workaround, we switch the address to repo.maven.apache.org, in branch-2.2.

we should decide if we keep that or revert after 2.2.1 is released




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22510) Exceptions caused by 64KB JVM bytecode limit

2017-11-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250292#comment-16250292
 ] 

Xiao Li edited comment on SPARK-22510 at 11/13/17 9:42 PM:
---

[~kiszk] Could you just add the new subtasks regarding 64KB JVM limit under 
this umbrella ticket? It can help us manage the issues. Thanks!


was (Author: smilegator):
[~kiszk] Could you just add the new subtasks under this umbrella ticket? It can 
help us manage the issues. Thanks!

> Exceptions caused by 64KB JVM bytecode limit 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22510) Exceptions caused by 64KB JVM bytecode limit

2017-11-13 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250292#comment-16250292
 ] 

Xiao Li commented on SPARK-22510:
-

[~kiszk] Could you just add the new subtasks under this umbrella ticket? It can 
help us manage the issues. Thanks!

> Exceptions caused by 64KB JVM bytecode limit 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21720:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.1, 2.3.0
>
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and  Field177 = "" and  
> Field178 = "" and  Field179 = "" and  Field180 = "" and  Field181 = "" and  
> Field182 = "" and  Field183 = "" and  Field184 = "" and  Field185 = "" and  
> Field186 = "" and  Field187 = "" and  Field188 = "" and  Field189 = "" and  
> Field190 = "" and  Field191 = "" and  Field192 = "" and  Field193 = "" and  
> Field194 = "" and  Field195 = "" and  Field196 = "" and  Field197 = "" and  
> Field198 = "" and  Field199 = "" and  Field200 = "" and  Field201 = "" and  
> Field202 = "" and  Field203 = "" and  

[jira] [Updated] (SPARK-22494) Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22494:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception
> -
>
> Key: SPARK-22494
> URL: https://issues.apache.org/jira/browse/SPARK-22494
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception 
> when used with a lot of arguments and/or complex expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22498) 64KB JVM bytecode limit problem with concat and concat_ws

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22498:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with concat and concat_ws
> -
>
> Key: SPARK-22498
> URL: https://issues.apache.org/jira/browse/SPARK-22498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Both {{concat}} and {{concat_ws}} can throw an exception due to the 64KB JVM 
> bytecode limit when they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode limit

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22510:

Summary: Exceptions caused by 64KB JVM bytecode limit   (was: 64KB JVM 
bytecode limit )

> Exceptions caused by 64KB JVM bytecode limit 
> -
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> Codegen can throw an exception due to the 64KB JVM bytecode limit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22499) 64KB JVM bytecode limit problem with least and greatest

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22499:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with least and greatest
> ---
>
> Key: SPARK-22499
> URL: https://issues.apache.org/jira/browse/SPARK-22499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> Both {{least}} and {{greatest}} can throw an exception due to the 64KB JVM 
> bytecode limit when they use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22500) 64KB JVM bytecode limit problem with cast

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22500:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with cast
> -
>
> Key: SPARK-22500
> URL: https://issues.apache.org/jira/browse/SPARK-22500
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{Cast}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of structure fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22508:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22501) 64KB JVM bytecode limit problem with in

2017-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22501:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22510

> 64KB JVM bytecode limit problem with in
> ---
>
> Key: SPARK-22501
> URL: https://issues.apache.org/jira/browse/SPARK-22501
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{In}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22510) 64KB JVM bytecode limit

2017-11-13 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22510:
---

 Summary: 64KB JVM bytecode limit 
 Key: SPARK-22510
 URL: https://issues.apache.org/jira/browse/SPARK-22510
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


Codegen can throw an exception due to the 64KB JVM bytecode limit.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22509) Spark Streaming: jobs with 5 minute batch length all start at the same time, permit jobs to be offset

2017-11-13 Thread Wallace Baggaley (JIRA)
Wallace Baggaley created SPARK-22509:


 Summary: Spark Streaming: jobs with 5 minute batch length all 
start at the same time, permit jobs to be offset
 Key: SPARK-22509
 URL: https://issues.apache.org/jira/browse/SPARK-22509
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Wallace Baggaley
Priority: Minor


Using Spark Streaming, a batch will run precisely on the zeroes and fives. 
Would be beneficial for performance to permit running spark jobs on offset 
minutes (1s and 6s or 2s and 7s), if configured so to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22495:


Assignee: Apache Spark

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22495:


Assignee: (was: Apache Spark)

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22495) Fix setup of SPARK_HOME variable on Windows

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250137#comment-16250137
 ] 

Apache Spark commented on SPARK-22495:
--

User 'jsnowacki' has created a pull request for this issue:
https://github.com/apache/spark/pull/19370

> Fix setup of SPARK_HOME variable on Windows
> ---
>
> Key: SPARK-22495
> URL: https://issues.apache.org/jira/browse/SPARK-22495
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Windows
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> On Windows, pip installed pyspark is unable to find out the spark home. There 
> is already proposed change, sufficient details and discussions in 
> https://github.com/apache/spark/pull/19370 and SPARK-18136



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-13 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250054#comment-16250054
 ] 

Ruslan Dautkhanov edited comment on SPARK-22505 at 11/13/17 7:47 PM:
-

Looks like we already discussed very similar topic 1.5 years ago on github )
https://github.com/databricks/spark-csv/issues/264#issuecomment-184943114
Any chance this can be added as a core Spark functionality? 
Not sure if we can even call that CsvParser().csvRdd from pySpark.. 

This is what I am asking exactly  
support for transforming dataframe to RDD[String] #188
https://github.com/databricks/spark-csv/commit/2eb90153a2d6a77b9cde4aee3f6e382df3da1746

I don't see CsvRdd from spark-csv module anywhere in Spark codebase 
https://github.com/databricks/spark-csv/commit/2eb90153a2d6a77b9cde4aee3f6e382df3da1746#diff-c6f09c5a3e6aedc2e6bfb1c16358e970R123

Did it make its way into Spark?

What I see is only private class CSVInferSchema
https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L29

Any other way to achieve this? Thanks a lot for any leads.


was (Author: tagar):
Looks like we already discussed very similar topic 1.5 years ago on github )
https://github.com/databricks/spark-csv/issues/264#issuecomment-184943114
Any chance this can be added as a core Spark functionality? 
Not sure if we can even call that CsvParser().csvRdd from pySpark.. 

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-11-13 Thread Srinivasa Reddy Vundela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250066#comment-16250066
 ] 

Srinivasa Reddy Vundela commented on SPARK-21994:
-

[~srowen] Thats right, it is not available in public release yet. I just posted 
for reference. 

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-13 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250054#comment-16250054
 ] 

Ruslan Dautkhanov commented on SPARK-22505:
---

Looks like we already discussed very similar topic 1.5 years ago on github )
https://github.com/databricks/spark-csv/issues/264#issuecomment-184943114
Any chance this can be added as a core Spark functionality? 
Not sure if we can even call that CsvParser().csvRdd from pySpark.. 

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250039#comment-16250039
 ] 

Sean Owen commented on SPARK-21994:
---

(Don't think that would be meaningful outside Cloudera at the moment; the 
commit doesn't exist in the public release/repo yet)

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-11-13 Thread Srinivasa Reddy Vundela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250032#comment-16250032
 ] 

Srinivasa Reddy Vundela commented on SPARK-21994:
-

commit d5e3ba3e970c7241298db2578f0d7965b6e16ae3
Author: Srinivasa Reddy Vundela 
Date:   Mon Oct 9 14:25:01 2017 -0700

CDH-60037. Not able to read hive table from Cloudera version of Spark 2.2

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path (causing the data to be saved as external 
> table), it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> A second workaround is to update the metadata of the managed table created by 
> Spark 2.2:
> {code}
> spark.sql("alter table test.spark22_test set SERDEPROPERTIES 
> ('path'='hdfs://my-cluster-name:8020/hadoop/eco/hive/warehouse/test.db/spark22_test')")
> spark.catalog.refreshTable("test.spark22_test")
> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20791) Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250029#comment-16250029
 ] 

Apache Spark commented on SPARK-20791:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/19738

> Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame
> ---
>
> Key: SPARK-20791
> URL: https://issues.apache.org/jira/browse/SPARK-20791
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.1.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
>
> The current code for creating a Spark DataFrame from a Pandas DataFrame uses 
> `to_records` to convert the DataFrame to a list of records and then converts 
> each record to a list.  Following this, there are a number of calls to 
> serialize and transfer this data to the JVM.  This process is very 
> inefficient and also discards all schema metadata, requiring another pass 
> over the data to infer types.
> Using Apache Arrow, the Pandas DataFrame could be efficiently converted to 
> Arrow data and directly transferred to the JVM to create the Spark DataFrame. 
>  The performance will be better and the Pandas schema will also be used so 
> that the correct types will be used.  
> Issues with the poor type inference have come up before, causing confusion 
> and frustration with users because it is not clear why it fails or doesn't 
> use the same type from Pandas.  Fixing this with Apache Arrow will solve 
> another pain point for Python users and the following JIRAs could be closed:
> * SPARK-17804
> * SPARK-18178



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22490) PySpark doc has misleading string for SparkSession.builder

2017-11-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250025#comment-16250025
 ] 

Dongjoon Hyun commented on SPARK-22490:
---

Hi, [~smilegator]. Could you review the PR? I assumed that this is for 2.2.1 
doc issue, too.
Although this is minor doc issue, cc [~felixcheung] just fyi.


> PySpark doc has misleading string for SparkSession.builder
> --
>
> Key: SPARK-22490
> URL: https://issues.apache.org/jira/browse/SPARK-22490
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Priority: Minor
>
> We need to fix the following line in our PySpark doc 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
> {noformat}
>  SparkSession.builder =  0x7f51f134a110>¶
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250001#comment-16250001
 ] 

Apache Spark commented on SPARK-22508:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19737

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22508:


Assignee: (was: Apache Spark)

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22508:


Assignee: Apache Spark

> 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()
> -
>
> Key: SPARK-22508
> URL: https://issues.apache.org/jira/browse/SPARK-22508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> {{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB 
> JVM bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22508) 64KB JVM bytecode limit problem with GenerateUnsafeRowJoiner.create()

2017-11-13 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-22508:


 Summary: 64KB JVM bytecode limit problem with 
GenerateUnsafeRowJoiner.create()
 Key: SPARK-22508
 URL: https://issues.apache.org/jira/browse/SPARK-22508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kazuaki Ishizaki


{{GenerateUnsafeRowJoiner.create()}} can throw an exception due to the 64KB JVM 
bytecode limit when they use a schema with a lot of fields



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9104) expose network layer memory usage

2017-11-13 Thread Srinivasa Reddy Vundela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249904#comment-16249904
 ] 

Srinivasa Reddy Vundela edited comment on SPARK-9104 at 11/13/17 6:39 PM:
--

Hi [~jerryshao] Thanks for the PR which exposes the Netty buffered pool memory 
usage, its a very good start to expose the unaccounted memory. But, I see that 
these are not registered with Metric System or Web UI. I was wondering if you 
have plans to expose them to metrics system? or Would it be okay if I send a PR 
and you help in reviewing? I see that you have included netty metrics for 
ExternalShuffleService and I was wondering for other parts which uses 
TransportServer and TransportClientFactory like NettyRpcEnv.


was (Author: vsr):
Hi [~jerryshao] Thanks for the PR which exposes the Netty buffered pool memory 
usage, its a very good start to expose the unaccounted memory. But, I see that 
these are not registered with Metric System or Web UI. I was wondering if you 
have plans to expose them to metrics system? or Would it be okay if I send a PR 
and you help in reviewing? 

> expose network layer memory usage
> -
>
> Key: SPARK-9104
> URL: https://issues.apache.org/jira/browse/SPARK-9104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> The default network transportation is netty, and when transfering blocks for 
> shuffle, the network layer will consume a decent size of memory, we shall 
> collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9104) expose network layer memory usage

2017-11-13 Thread Srinivasa Reddy Vundela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249904#comment-16249904
 ] 

Srinivasa Reddy Vundela commented on SPARK-9104:


Hi [~jerryshao] Thanks for the PR which exposes the Netty buffered pool memory 
usage, its a very good start to expose the unaccounted memory. But, I see that 
these are not registered with Metric System or Web UI. I was wondering if you 
have plans to expose them to metrics system? or Would it be okay if I send a PR 
and you help in reviewing? 

> expose network layer memory usage
> -
>
> Key: SPARK-9104
> URL: https://issues.apache.org/jira/browse/SPARK-9104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>Assignee: Saisai Shao
> Fix For: 2.3.0
>
>
> The default network transportation is netty, and when transfering blocks for 
> shuffle, the network layer will consume a decent size of memory, we shall 
> collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22507) Cannot register inner class with Kryo using SparkConf

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249844#comment-16249844
 ] 

Sean Owen commented on SPARK-22507:
---

What if you make it a non-inner class? although that should be OK, it's best to 
narrow this down. I doubt it has to do with the inner class.

> Cannot register inner class with Kryo using SparkConf
> -
>
> Key: SPARK-22507
> URL: https://issues.apache.org/jira/browse/SPARK-22507
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Yu LIU
>Priority: Critical
>
> When I created the _SparkConf_, I use _registerKryoClasses_ method to 
> register some custom classes that I created. But when these classes are inner 
> classes, they cannot be successfully added and I got following error:
> {noformat}
> [ERROR]: org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 0.0 
> failed 4 times; aborting job
> [ERROR]: org.apache.spark.streaming.scheduler.JobScheduler - Error running 
> job streaming job 1510585911000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, tal-qa188.talend.lan): java.io.IOException: 
> org.apache.spark.SparkException: Failed to register classes with Kryo
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to register classes with 
> Kryo
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more
> Caused by: java.lang.ClassNotFoundException: 
> local_project.maprstreamsin_0_1.MapRStreamsIn$TalendKryoRegistrator
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)
>   ... 17 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
>   

[jira] [Commented] (SPARK-22507) Cannot register inner class with Kryo using SparkConf

2017-11-13 Thread Yu LIU (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249835#comment-16249835
 ] 

Yu LIU commented on SPARK-22507:


I did something like this:

{code:java}
package local_project.maprstreamproducer_0_1;

public class MapRStreamsIn {
.
.
public static class TalendKryoRegistrator implements KryoRegistrator {
// implement a Kryo registrator
}
.
.
public static void main(String[] args) {
SparkConf sc = new SparkConf();
sc.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");
sc.registerKryoClasses(new Class[] { TalendKryoRegistrator.class });
}
.
.
}
{code}
And then when I started the streaming job, I always get the above error:

{noformat}
org.apache.spark.SparkException: Failed to register classes with Kryo
java.lang.ClassNotFoundException: 
local_project.maprstreamsin_0_1.MapRStreamsIn$TalendKryoRegistrator
{noformat}


> Cannot register inner class with Kryo using SparkConf
> -
>
> Key: SPARK-22507
> URL: https://issues.apache.org/jira/browse/SPARK-22507
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Yu LIU
>Priority: Critical
>
> When I created the _SparkConf_, I use _registerKryoClasses_ method to 
> register some custom classes that I created. But when these classes are inner 
> classes, they cannot be successfully added and I got following error:
> {noformat}
> [ERROR]: org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 0.0 
> failed 4 times; aborting job
> [ERROR]: org.apache.spark.streaming.scheduler.JobScheduler - Error running 
> job streaming job 1510585911000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, tal-qa188.talend.lan): java.io.IOException: 
> org.apache.spark.SparkException: Failed to register classes with Kryo
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to register classes with 
> Kryo
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more
> Caused by: java.lang.ClassNotFoundException: 
> local_project.maprstreamsin_0_1.MapRStreamsIn$TalendKryoRegistrator
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)
>   ... 17 more
> Driver stacktrace:
>   at 
> 

[jira] [Commented] (SPARK-22507) Cannot register inner class with Kryo using SparkConf

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249812#comment-16249812
 ] 

Sean Owen commented on SPARK-22507:
---

Are you sure the class is on the classpath? how do you try to register it?
This may be related to other serializer/classpath issues, so may be a duplicate.

> Cannot register inner class with Kryo using SparkConf
> -
>
> Key: SPARK-22507
> URL: https://issues.apache.org/jira/browse/SPARK-22507
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Yu LIU
>Priority: Critical
>
> When I created the _SparkConf_, I use _registerKryoClasses_ method to 
> register some custom classes that I created. But when these classes are inner 
> classes, they cannot be successfully added and I got following error:
> {noformat}
> [ERROR]: org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 0.0 
> failed 4 times; aborting job
> [ERROR]: org.apache.spark.streaming.scheduler.JobScheduler - Error running 
> job streaming job 1510585911000 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, tal-qa188.talend.lan): java.io.IOException: 
> org.apache.spark.SparkException: Failed to register classes with Kryo
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Failed to register classes with 
> Kryo
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
>   ... 11 more
> Caused by: java.lang.ClassNotFoundException: 
> local_project.maprstreamsin_0_1.MapRStreamsIn$TalendKryoRegistrator
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)
>   ... 17 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
>   at 
> 

[jira] [Created] (SPARK-22507) Cannot register inner class with Kryo using SparkConf

2017-11-13 Thread Yu LIU (JIRA)
Yu LIU created SPARK-22507:
--

 Summary: Cannot register inner class with Kryo using SparkConf
 Key: SPARK-22507
 URL: https://issues.apache.org/jira/browse/SPARK-22507
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Yu LIU
Priority: Critical


When I created the _SparkConf_, I use _registerKryoClasses_ method to register 
some custom classes that I created. But when these classes are inner classes, 
they cannot be successfully added and I got following error:

{noformat}
[ERROR]: org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 0.0 failed 
4 times; aborting job
[ERROR]: org.apache.spark.streaming.scheduler.JobScheduler - Error running job 
streaming job 1510585911000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
3, tal-qa188.talend.lan): java.io.IOException: org.apache.spark.SparkException: 
Failed to register classes with Kryo
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 11 more
Caused by: java.lang.ClassNotFoundException: 
local_project.maprstreamsin_0_1.MapRStreamsIn$TalendKryoRegistrator
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$4.apply(KryoSerializer.scala:120)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:120)
... 17 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 

[jira] [Commented] (SPARK-22431) Creating Permanent view with illegal type

2017-11-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249756#comment-16249756
 ] 

Herman van Hovell commented on SPARK-22431:
---

I look forward to the PR :)

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-13 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249702#comment-16249702
 ] 

Ruslan Dautkhanov commented on SPARK-22505:
---

In a way, you can think of this as of Pandas' infer_dtype() call:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html?highlight=infer#pandas.api.types.infer_dtype

One workaround for this missing Spark functionality is writing file back as 
delimited, and then read it back in so we can use spark-csv schema inference. 
But this would be super inefficient. Again, would be great to somehow engage 
the same type inference as in spark-csv from an RDD of arbitrary tuples of 
strings (or arrays). 

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22505) toDF() / createDataFrame() type inference doesn't work as expected

2017-11-13 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249689#comment-16249689
 ] 

Ruslan Dautkhanov commented on SPARK-22505:
---

[~hyukjin.kwon] Yep, '1' is of type 'str'. This was made specifically to 
demonstrate my point.
As I said in the jira description, I wanted to have the same schema inference 
works as expected when reading delimited files (in the old good spark-csv spark 
module).
As an example, we read in fixed-width files using sc.binaryRecords(hdfsFile, 
recordLength) and then after rdd.map() basically get a very wide modeling 
dataset which has all elements / "columns" strings. 
We want to engage the same spark-csv type of schema inference so Spark maps 
strings by analyzing all strings to come up with actual data types.
We had other scenarios when we want toDF() and/or createDataFrame() API calls 
to engage the same schema inference by reading whole dataset and see, like in 
example above, '1', '2', '3' "least common" type is type 'int' - again, exactly 
what spark-csv logic does. Is this possible in Spark? 

> toDF() / createDataFrame() type inference doesn't work as expected
> --
>
> Key: SPARK-22505
> URL: https://issues.apache.org/jira/browse/SPARK-22505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: csvparser, inference, pyspark, schema, spark-sql
>
> {code}
> df = 
> sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str'])
> df.printSchema()
> {code}
> produces
> {noformat}
> root
>  |-- should_be_int: string (nullable = true)
>  |-- should_be_str: string (nullable = true)
> {noformat}
> Notice `should_be_int` has `string` datatype, according to documentation:
> https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
> {quote}
> Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the 
> datatypes. Rows are constructed by passing a list of key/value pairs as 
> kwargs to the Row class. The keys of this list define the column names of the 
> table, *and the types are inferred by sampling the whole dataset*, similar to 
> the inference that is performed on JSON files.
> {quote}
> Schema inference works as expected when reading delimited files like
> {code}
> spark.read.format('csv').option('inferSchema', True)...
> {code}
> but not when using toDF() / createDataFrame() API calls.
> Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20387) Permissive mode is not replacing corrupt record with null

2017-11-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249563#comment-16249563
 ] 

Hyukjin Kwon commented on SPARK-20387:
--

BTW, the example and input indeed has a problem. I had to replace few 
additional options and etc. In addition, I believe Sean's description is 
correct in the current Spark releases (Spark 2.2.0 <).

> Permissive mode is not replacing corrupt record with null
> -
>
> Key: SPARK-20387
> URL: https://issues.apache.org/jira/browse/SPARK-20387
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying "mode" as 
> PERMISSIVE.
> Source File: 
> String,int,f1,bool1
> abc,23111,23.07738,true
> abc,23111,23.07738,true
> abc,23111,true,true
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.collect();
> Result: Error is thrown
> stack trace: 
> ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 15)
> java.lang.IllegalArgumentException: For input string: "23.07738"
> at 
> scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:290)
> at 
> scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:260)
> at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:270)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20387) Permissive mode is not replacing corrupt record with null

2017-11-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249553#comment-16249553
 ] 

Hyukjin Kwon edited comment on SPARK-20387 at 11/13/17 1:38 PM:


Sounds it is related with SPARK-21263. Just double checked it does not happen 
in the master and double checked the PR of SPARK-21263 fixes it.




was (Author: hyukjin.kwon):
Yup, all sound correct ^ and sounds it is related with SPARK-21263. Just double 
checked it does not happen in the master and double checked the PR of 
SPARK-21263 fixes it.



> Permissive mode is not replacing corrupt record with null
> -
>
> Key: SPARK-20387
> URL: https://issues.apache.org/jira/browse/SPARK-20387
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying "mode" as 
> PERMISSIVE.
> Source File: 
> String,int,f1,bool1
> abc,23111,23.07738,true
> abc,23111,23.07738,true
> abc,23111,true,true
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.collect();
> Result: Error is thrown
> stack trace: 
> ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 15)
> java.lang.IllegalArgumentException: For input string: "23.07738"
> at 
> scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:290)
> at 
> scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:260)
> at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:270)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20387) Permissive mode is not replacing corrupt record with null

2017-11-13 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20387.
--
Resolution: Duplicate

> Permissive mode is not replacing corrupt record with null
> -
>
> Key: SPARK-20387
> URL: https://issues.apache.org/jira/browse/SPARK-20387
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying "mode" as 
> PERMISSIVE.
> Source File: 
> String,int,f1,bool1
> abc,23111,23.07738,true
> abc,23111,23.07738,true
> abc,23111,true,true
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.collect();
> Result: Error is thrown
> stack trace: 
> ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 15)
> java.lang.IllegalArgumentException: For input string: "23.07738"
> at 
> scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:290)
> at 
> scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:260)
> at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:270)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20387) Permissive mode is not replacing corrupt record with null

2017-11-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249553#comment-16249553
 ] 

Hyukjin Kwon commented on SPARK-20387:
--

Yup, all sound correct ^ and sounds it is related with SPARK-21263. Just double 
checked it does not happen in the master and double checked the PR of 
SPARK-21263 fixes it.



> Permissive mode is not replacing corrupt record with null
> -
>
> Key: SPARK-20387
> URL: https://issues.apache.org/jira/browse/SPARK-20387
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying "mode" as 
> PERMISSIVE.
> Source File: 
> String,int,f1,bool1
> abc,23111,23.07738,true
> abc,23111,23.07738,true
> abc,23111,true,true
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.collect();
> Result: Error is thrown
> stack trace: 
> ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 15)
> java.lang.IllegalArgumentException: For input string: "23.07738"
> at 
> scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:290)
> at 
> scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:260)
> at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:270)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22431) Creating Permanent view with illegal type

2017-11-13 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249517#comment-16249517
 ] 

Sunitha Kambhampati commented on SPARK-22431:
-

Thanks for the response.   Option 1 sounds good.  I'll go ahead and create a PR 
for the same.

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22431) Creating Permanent view with illegal type

2017-11-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249511#comment-16249511
 ] 

Herman van Hovell commented on SPARK-22431:
---

[~ksunitha] Thanks for the thorough analysis! I think this is mainly a Hive 
metastore issue (that does not support column names with weird characters), and 
so I think we should keep this localized to the Hive code. I would go with 
option '1' for now.

> Creating Permanent view with illegal type
> -
>
> Key: SPARK-22431
> URL: https://issues.apache.org/jira/browse/SPARK-22431
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>
> It is possible in Spark SQL to create a permanent view that uses an nested 
> field with an illegal name.
> For example if we create the following view:
> {noformat}
> create view x as select struct('a' as `$q`, 1 as b) q
> {noformat}
> A simple select fails with the following exception:
> {noformat}
> select * from x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}
> Dropping the view isn't possible either:
> {noformat}
> drop view x;
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> struct<$q:string,b:int>
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20387) Permissive mode is not replacing corrupt record with null

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249508#comment-16249508
 ] 

Sean Owen commented on SPARK-20387:
---

That's not the same example. I believe the underlying number parsing from the 
JDK will parse, in permissive move, anything it can as a string as a number and 
ignore the rest. I think that's consistent then. [~hyukjin.kwon]?
There also appears to be a problem with your input -- extra blank column.

> Permissive mode is not replacing corrupt record with null
> -
>
> Key: SPARK-20387
> URL: https://issues.apache.org/jira/browse/SPARK-20387
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: Navya Krishnappa
>
> When reading the below mentioned time value by specifying "mode" as 
> PERMISSIVE.
> Source File: 
> String,int,f1,bool1
> abc,23111,23.07738,true
> abc,23111,23.07738,true
> abc,23111,true,true
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.collect();
> Result: Error is thrown
> stack trace: 
> ERROR Executor: Exception in task 0.0 in stage 15.0 (TID 15)
> java.lang.IllegalArgumentException: For input string: "23.07738"
> at 
> scala.collection.immutable.StringLike$class.parseBoolean(StringLike.scala:290)
> at 
> scala.collection.immutable.StringLike$class.toBoolean(StringLike.scala:260)
> at scala.collection.immutable.StringOps.toBoolean(StringOps.scala:29)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:270)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:125)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:167)
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:166)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22431) Creating Permanent view with illegal type

2017-11-13 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249505#comment-16249505
 ] 

Sunitha Kambhampati commented on SPARK-22431:
-

*Observations:*
I ran a few tests with the STRUCT containing a `$a` and for the following 
scenarios:
a) create table, b) create view, c) create datasource table against the hive 
and in-memory catalog.   

*A. Hive Catalog*
+1. Create Table (CTAS) - illegal type+   - Results in Error
{code:java}
spark-sql> CREATE TABLE t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q;
17/11/10 22:50:45 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: Error: name expected at the position 7 of 
'struct<$a:string,b:int>' but '$' is found.;
{code}

+2 Create Table – illegal type+  -  Results in Error

{code:java}
CREATE TABLE t(q STRUCT<`$a`:INT,col2:STRING>, i1 INT);
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: Error: name expected at the position 7 of 
'struct<$a:int,col2:string>:int' but '$' is found.;
{code}

+3 Create DataSourceTable  - illegal type+  - 
Try to store it in hive compatible way if possible, if this fails, then it 
tries to  store the metadata in Spark SQL format.   This is successful
With Parquet, there is error trying to store in hive compatible way so it falls 
back to persisting the metadata in Spark SQL specific format. 
{code:java}
CREATE TABLE t(q STRUCT<`$a`:INT,col2:STRING>, i1 INT) USING PARQUET;   
17/11/10 22:52:40 WARN HiveExternalCatalog: Could not persist `default`.`t` in 
a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific 
format.
org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.IllegalArgumentException: Error: name expected at the position 7 of 
'struct<$a:int,col2:string>:int' but '$' is found.
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:720)
Caused by: java.lang.IllegalArgumentException: Error: name expected at the 
position 7 of 'struct<$a:int,col2:string>:int' but '$' is found.
{code}
Retrieving the table metadata –  OK

{code:java}
select * from t;
Time taken: 0.912 seconds
spark-sql> describe formatted t;
q   struct<$a:int,col2:string>  NULL
i1  int NULL

# Detailed Table Information
Databasedefault 
Table   t   
Owner   ksunitha
Created TimeFri Nov 10 22:52:40 IST 2017
Last Access Thu Jan 01 05:30:00 IST 1970
Created By  Spark 2.3.0-SNAPSHOT
TypeMANAGED 
ProviderPARQUET 
Table Properties[transient_lastDdlTime=1510334560]  
Locationfile:/Users/ksunitha/projects/trunk/spark/spark-warehouse/t 
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat   
Storage Properties  [serialization.format=1]
Time taken: 0.071 seconds, Fetched 18 row(s)
{code}

+4. Create View - illegal type+ 
Creation successful.
Retrieving the view metadata – fails, so select, drop fail.

{code:java}
CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q;
Time taken: 0.036 seconds
spark-sql> select * from t;
17/11/10 22:57:22 ERROR SparkSQLDriver: Failed in [select * from t]
org.apache.spark.SparkException: Cannot recognize hive type string: 
struct<$a:string,b:int>
{code}

--
*B. InMemoryCatalog*
+1.Create Table -  illegal type+
N/A – Hive Support is needed 

+2. Create DataSourceTable  - illegal type+
OK/Successful
{code:java}
CREATE TABLE t(q STRUCT<`$a`:INT,col2:STRING>, i1 INT) USING PARQUET
{code}
Retrieving the table metadata – Select Query – OK

+3. Create View - illegal type+
Creation successful.
{code:java}
CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q
{code}
Retrieving the view metadata and select query – OK 



*Cause:*
# When you store the table metadata with provider as Hive, then it stores the 
schema that has a struct with illegal name in the catalog. 
a.  If *table metadata* is being stored, the underlying serde 
initialization catches any illegal parameters and throws an exception. 
b.  If *view metadata* is being stored, then hive has some special case 
logic and will not trigger any checks on the illegal parameters so the metadata 
for the view gets stored(/created) in the Hive metastore. But when retrieving a 
hive table’s metadata, Spark will make sure that it can read the schema and it 
is compatible with Spark and so when it retrieves the information from the Hive 
Metastore, it checks that the column datatype can be parsed by Spark.  Spark is 
not able to parse the struct with the illegal name because it is not quoted 
(backticked). 
# When you store the table metadata for 

[jira] [Commented] (SPARK-22504) Optimization in overwrite table in case of failure

2017-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249502#comment-16249502
 ] 

Sean Owen commented on SPARK-22504:
---

Removing the original table isn't a problem; the user asked for that.
Your suggestion just swaps it out for a new problem: how do you ensure the new 
table is cleaned up in case of failure? how do you make sure the old table is 
deleted? what about the implications of having two of the tables' storage at 
once?
I think the current semantics are correct and as expected in case of a failure.

> Optimization in overwrite table in case of failure
> --
>
> Key: SPARK-22504
> URL: https://issues.apache.org/jira/browse/SPARK-22504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: xuchuanyin
>
> Optimization in overwrite table in case of failure
> # SCENARIO
> Currently, `Overwrite` operation in spark is performed by following steps: 
> 1. DROP : drop old table
> 2. WRITE: create and write data into new table
> If some runtime error occurs in Step2, then the origin table will be lost 
> along with its data -- I think this will be a serious problem if someone 
> perform `read-update-flushback` actions. The problem can be reproduced by the 
> following code:
> ```scala
> 01: test("test spark df overwrite failed") {
> 02: // prepare table
> 03: val tableName = "test_spark_overwrite_failed"
> 04: sql(s"DROP TABLE IF EXISTS $tableName")
> 05: sql(s"CREATE TABLE IF NOT EXISTS $tableName ( field_int int, 
> field_string String)" +
> 06: s" STORED AS parquet").collect()
> 07: 
> 08: // load data first
> 09: val schema = StructType(
> 10:   Seq(StructField("field_int", DataTypes.IntegerType, nullable = 
> false),
> 11: StructField("field_string", DataTypes.StringType, nullable = 
> false)))
> 12: val rdd1 = sqlContext.sparkContext.parallelize(
> 13:   Row(20, "q") ::
> 14:   Row(21, "qw") ::
> 15:   Row(23, "qwe") :: Nil)
> 16: val dataFrame = sqlContext.createDataFrame(rdd1, schema)
> 17: 
> dataFrame.write.format("parquet").mode(SaveMode.Overwrite).saveAsTable(tableName)
> 18: sql(s"SELECT * FROM $tableName").show()
> 19: 
> 20: // load data again, the following data will cause failure in data 
> loading
> 21: try {
> 22:   val rdd2 = sqlContext.sparkContext.parallelize(
> 23: Row(31, "qwer") ::
> 24: Row(null, "qwer") ::
> 25: Row(32, "long_than_5") :: Nil)
> 26:   val dataFrame2 = sqlContext.createDataFrame(rdd2, schema)
> 27: 
> 28:   
> dataFrame2.write.format("parquet").mode(SaveMode.Overwrite).saveAsTable(tableName)
> 29: } catch {
> 30:   case e: Exception => LOGGER.error(e, "write overwrite failure")
> 31: }
> 32: // table `test_spark_overwrite_failed` has been dropped
> 33: sql(s"show tables").show(20, truncate = false)
> 34: // the content is empty even if table exists. We want it to be the 
> same as 
> 35: sql(s"SELECT * FROM $tableName").show()
> 36:   }
> ```
> In Line24, we creata a `null` element while the schema is `notnull` -- This 
> will cause runtime error in loading data.
> In Line33, table `test_spark_overwrite_failed` has already been dropped and 
> no longger exists in the current table. And of course Line35 will fail.
> Instead, we want Line35 to show the origin data just as Line18.
> # ANALYZE
> I am thinking of optimizing `overwrite` in spark -- The goal is to keep the 
> old data until the load has finished successfully. The old data can only be 
> cleaned when the load is successful.
> Since sparksql already support `rename` operation, we can optimize 
> `overwrite` in the following steps:
> 1. WRITE: create and write data to tempTable
> 2. SWAP: swap temptable1 with targetTable by using rename operation
> 3. CLEAN: clean up old data
> If step1 works fine, then swap tempTable with targetTable and clean up old 
> data; otherwise, keep the target table not changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22439) Not able to get numeric columns for the file having decimal values

2017-11-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22439.
-

> Not able to get numeric columns for the file having decimal values
> --
>
> Key: SPARK-22439
> URL: https://issues.apache.org/jira/browse/SPARK-22439
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.2.0
>Reporter: Navya Krishnappa
>
> When reading the below-mentioned decimal value by specifying header as true.
> SourceFile: 
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> 8.95977565356765764E+20
> Source code1:
> Dataset dataset = getSqlContext().read()
> .option(PARSER_LIB, "commons")
> .option(INFER_SCHEMA, "true")
> .option(HEADER, "true")
> .option(DELIMITER, ",")
> .option(QUOTE, "\"")
> .option(ESCAPE, "
> ")
> .option(MODE, Mode.PERMISSIVE)
> .csv(sourceFile);
> dataset.numericColumns()
> Result: 
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:223)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$numericColumns$2.apply(Dataset.scala:222)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >