date:20181121

[jira] [Assigned] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-21 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-26134:
-

Assignee: Takanobu Asanuma

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
> Fix For: 3.0.0
>
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26134) Upgrading Hadoop to 2.7.4 to fix java.version problem

2018-11-21 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26134.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23101

> Upgrading Hadoop to 2.7.4 to fix java.version problem
> -
>
> Key: SPARK-26134
> URL: https://issues.apache.org/jira/browse/SPARK-26134
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takanobu Asanuma
>Priority: Major
> Fix For: 3.0.0
>
>
> When I ran spark-shell on JDK11+28(2018-09-25), It failed with the error 
> below.
> {noformat}
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>   at 
> org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(SecurityUtil.java:611)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:273)
>   at 
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:261)
>   at 
> org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:791)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:761)
>   at 
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:634)
>   at 
> org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2427)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2427)
>   at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
>   at 
> org.apache.spark.deploy.SparkSubmit.secMgr$lzycompute$1(SparkSubmit.scala:359)
>   at org.apache.spark.deploy.SparkSubmit.secMgr$1(SparkSubmit.scala:359)
>   at 
> org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:367)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.StringIndexOutOfBoundsException: begin 0, end 3, length 2
>   at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
>   at java.base/java.lang.String.substring(String.java:1874)
>   at org.apache.hadoop.util.Shell.(Shell.java:52)
> {noformat}
> This is a Hadoop issue that fails to parse some {{java.version}}. It has been 
> fixed from Hadoop-2.7.4(see HADOOP-14586).
> Note, Hadoop-2.7.5 or upper have another problem with Spark (SPARK-25330). So 
> upgrading to 2.7.4 would be fine for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-21 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26118:
--
Fix Version/s: 2.3.3
   2.2.3

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24553) Job UI redirect causing http 302 error

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695601#comment-16695601
 ] 

Apache Spark commented on SPARK-24553:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23116

> Job UI redirect causing http 302 error
> --
>
> Key: SPARK-24553
> URL: https://issues.apache.org/jira/browse/SPARK-24553
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Steven Kallman
>Assignee: Steven Kallman
>Priority: Minor
> Fix For: 2.4.0
>
>
> When on spark UI port 4040 jobs or stages tab, the href links for the 
> individual jobs or stages are missing a '/' before the '?id' this causes a 
> redirect to the address with a '/' which is breaking the use of a reverse 
> proxy
>  
> localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2
> localhost:4040/stages/stage?id=3=0 --> 
> localhost:4040/stages/stage/?id=3=0
>  
> Updated with Pull Request --> https://github.com/apache/spark/pull/21600



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24553) Job UI redirect causing http 302 error

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695604#comment-16695604
 ] 

Apache Spark commented on SPARK-24553:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23116

> Job UI redirect causing http 302 error
> --
>
> Key: SPARK-24553
> URL: https://issues.apache.org/jira/browse/SPARK-24553
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1, 2.3.0, 2.3.1
>Reporter: Steven Kallman
>Assignee: Steven Kallman
>Priority: Minor
> Fix For: 2.4.0
>
>
> When on spark UI port 4040 jobs or stages tab, the href links for the 
> individual jobs or stages are missing a '/' before the '?id' this causes a 
> redirect to the address with a '/' which is breaking the use of a reverse 
> proxy
>  
> localhost:4040/jobs/job?id=2 --> localhost:4040/jobs/job/?id=2
> localhost:4040/stages/stage?id=3=0 --> 
> localhost:4040/stages/stage/?id=3=0
>  
> Updated with Pull Request --> https://github.com/apache/spark/pull/21600



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695565#comment-16695565
 ] 

Hyukjin Kwon commented on SPARK-23410:
--

[~x1q1j1], can you point me out the flink pr?

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8

2018-11-21 Thread xuqianjin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695539#comment-16695539
 ] 

xuqianjin commented on SPARK-23410:
---

[~maxgekk] I want to support utf-16 and utf-32 with BOMs because I also 
submitted a PR to Flink not long ago to support utf-16le utf-16be utf-32le 
utf-32be with BOMs so I want to try it out

> Unable to read jsons in charset different from UTF-8
> 
>
> Key: SPARK-23410
> URL: https://issues.apache.org/jira/browse/SPARK-23410
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: utf16WithBOM.json
>
>
> Currently the Json Parser is forced to read json files in UTF-8. Such 
> behavior breaks backward compatibility with Spark 2.2.1 and previous versions 
> that can read json files in UTF-16, UTF-32 and other encodings due to using 
> of the auto detection mechanism of the jackson library. Need to give back to 
> users possibility to read json files in specified charset and/or detect 
> charset automatically as it was before.    



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695526#comment-16695526
 ] 

Hyukjin Kwon commented on SPARK-26116:
--

Please describe that fact in the JIRA as well.

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also forces to provision huge 
> amounts of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling 
> sort data of 3.6 GB to disk}}). Sometimes the

[jira] [Reopened] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-26116:
--

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also forces to provision huge 
> amounts of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling 
> sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the 
> disk and the sort completes ({{INFO FileFormatWriter:

[jira] [Assigned] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26099:


Assignee: Maxim Gekk

> Verification of the corrupt column in from_csv/from_json
> 
>
> Key: SPARK-26099
> URL: https://issues.apache.org/jira/browse/SPARK-26099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
> must be of string type and not nullable. The checking does exist in 
> DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
> CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26085.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23054
[https://github.com/apache/spark/pull/23054]

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26099) Verification of the corrupt column in from_csv/from_json

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26099.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23070
[https://github.com/apache/spark/pull/23070]

> Verification of the corrupt column in from_csv/from_json
> 
>
> Key: SPARK-26099
> URL: https://issues.apache.org/jira/browse/SPARK-26099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The corrupt column specified via JSON/CSV option *columnNameOfCorruptRecord* 
> must be of string type and not nullable. The checking does exist in 
> DataFrameReader and JSON/CSVFileFormat, and the same should be added to 
> CsvToStructs and to JsonToStructs 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26085:
---

Assignee: Liang-Chi Hsieh

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695491#comment-16695491
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23115

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695493#comment-16695493
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23115

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25935) Prevent null rows from JSON parser

2018-11-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25935.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22938
[https://github.com/apache/spark/pull/22938]

> Prevent null rows from JSON parser
> --
>
> Key: SPARK-25935
> URL: https://issues.apache.org/jira/browse/SPARK-25935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, JSON parser can produce nulls if it cannot detect any valid JSON 
> token on the root level, see 
> https://github.com/apache/spark/blob/4d6704db4d490bd1830ed3c757525f41058523e0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L402
>  . As a consequence of that, the from_json() function can produce null in the 
> PERMISSIVE mode. To prevent that, need to throw an exception which should 
> treat as a bad record and handled according specified mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25935) Prevent null rows from JSON parser

2018-11-21 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25935:
---

Assignee: Maxim Gekk

> Prevent null rows from JSON parser
> --
>
> Key: SPARK-25935
> URL: https://issues.apache.org/jira/browse/SPARK-25935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, JSON parser can produce nulls if it cannot detect any valid JSON 
> token on the root level, see 
> https://github.com/apache/spark/blob/4d6704db4d490bd1830ed3c757525f41058523e0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala#L402
>  . As a consequence of that, the from_json() function can produce null in the 
> PERMISSIVE mode. To prevent that, need to throw an exception which should 
> treat as a bad record and handled according specified mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695449#comment-16695449
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23114

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695448#comment-16695448
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23114

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695427#comment-16695427
 ] 

Ruslan Dautkhanov edited comment on SPARK-26019 at 11/22/18 12:42 AM:
--

Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
 [https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
 
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
 to avoid race condition that could happen if we start listening and accepting 
connections in main thread. 

I manually verified and it fixes things for us.

Thank you.


was (Author: tagar):
Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
[https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
to avoid race condition that happens that could happen if we start listening 
and accepting connections in main thread. 

I manually verified and it fixes things for us.

Thank you.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695427#comment-16695427
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Thank you [~irashid]

I confirm that swapping those two lines doesn't fix things.

Fixing race condition that happens in accumulators.py: _start_update_server()
 # SocketServer:TCPServer defaults bind_and_activate to True
[https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L413]

 # Also {{handle()}} is defined in derived class _UpdateRequestHandler here
[https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232]

Please help review [https://github.com/apache/spark/pull/23113] 

Basically fix is to bind and activate SocketServer.TCPServer only in that 
dedicated thread to serve AccumulatorServer, 
to avoid race condition that happens that could happen if we start listening 
and accepting connections in main thread. 

I manually verified and it fixes things for us.

Thank you.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26019:


Assignee: (was: Apache Spark)

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695421#comment-16695421
 ] 

Apache Spark commented on SPARK-26019:
--

User 'Tagar' has created a pull request for this issue:
https://github.com/apache/spark/pull/23113

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26019:


Assignee: Apache Spark

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Assignee: Apache Spark
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-21 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26106) Prioritizes ML unittests over the doctests in PySpark

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26106.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23078
[https://github.com/apache/spark/pull/23078]

> Prioritizes ML unittests over the doctests in PySpark
> -
>
> Key: SPARK-26106
> URL: https://issues.apache.org/jira/browse/SPARK-26106
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Arguably unittests usually takes longer then doctests. We better prioritize 
> unittests over doctests.
> Other modules are already being prioritized over doctests. Looks ML module 
> was missed at the very first place. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26106) Prioritizes ML unittests over the doctests in PySpark

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26106:


Assignee: Hyukjin Kwon

> Prioritizes ML unittests over the doctests in PySpark
> -
>
> Key: SPARK-26106
> URL: https://issues.apache.org/jira/browse/SPARK-26106
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> Arguably unittests usually takes longer then doctests. We better prioritize 
> unittests over doctests.
> Other modules are already being prioritized over doctests. Looks ML module 
> was missed at the very first place. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-21 Thread Matt Cheah (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah resolved SPARK-25957.

   Resolution: Fixed
Fix Version/s: 3.0.0

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
> Fix For: 3.0.0
>
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev commented on SPARK-22865:
-

I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26127) Remove deprecated setters from tree regression and classification models

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26127:
-

Assignee: Marco Gaido

> Remove deprecated setters from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
>
> Many {{set***}} methods are present for the models of regression and 
> classification trees. They are useless and deprecated since 2.1 and targeted 
> to be removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26127) Remove deprecated setters from tree regression and classification models

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26127.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23093
[https://github.com/apache/spark/pull/23093]

> Remove deprecated setters from tree regression and classification models
> 
>
> Key: SPARK-26127
> URL: https://issues.apache.org/jira/browse/SPARK-26127
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Many {{set***}} methods are present for the models of regression and 
> classification trees. They are useless and deprecated since 2.1 and targeted 
> to be removed in 3.0. So the JIRA tracks its removal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 11:00 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22865) Publish Official Apache Spark Docker images

2018-11-21 Thread Andrew Korzhuev (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695338#comment-16695338
 ] 

Andrew Korzhuev edited comment on SPARK-22865 at 11/21/18 10:59 PM:


I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from 
[https://hub.docker.com/r/andrusha/spark-k8s/tags/|https://hub.docker.com/r/andrusha/spark-k8s/tags/,]
 scripts were also updated. I would love to help making this happen for Spark, 
but I need somebody to show me around Apache CI/CD infrastructure.


was (Author: akorzhuev):
I've added builds for vanilla 2.3.1, 2.3.2 and 2.4.0 images, you can grab them 
from [https://hub.docker.com/r/andrusha/spark-k8s/tags/,] scripts were also 
updated. I would love to help making this happen for Spark, but I need somebody 
to show me around Apache CI/CD infrastructure.

> Publish Official Apache Spark Docker images
> ---
>
> Key: SPARK-22865
> URL: https://issues.apache.org/jira/browse/SPARK-22865
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26129) Instrumentation for query planning time

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695316#comment-16695316
 ] 

Apache Spark commented on SPARK-26129:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23110

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2018-11-21 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695261#comment-16695261
 ] 

Maxim Gekk commented on SPARK-26075:


The restriction of 8GB still exists 
https://github.com/apache/spark/blob/79c66894296840cc4a5bf6c8718ecfd2b08bcca8/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L102-L105
 and it is related to maximum amount of memory could be allocated for hash 
relation. 

> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well.

The decision of using Broadcast Hash Join is not based on actual size of your 
dataframes. Spark tries to estimate the sizes in advance. If it makes a 
mistake, it tries to broadcast a relation of huge size. So, the problem is in 
size estimation of build relations.

> However, Disabling the broadcasting is working fine. 
> 'spark.sql.autoBroadcastJoinThreshold': '-1'

Right, because you just disable Broadcast Hash Join at all.

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695201#comment-16695201
 ] 

Apache Spark commented on SPARK-26069:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23109

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25993:


Assignee: Apache Spark

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695194#comment-16695194
 ] 

Apache Spark commented on SPARK-25993:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/23108

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25993:


Assignee: (was: Apache Spark)

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25153) Improve error messages for columns with dots/periods

2018-11-21 Thread Bradley LaVigne (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695162#comment-16695162
 ] 

Bradley LaVigne commented on SPARK-25153:
-

I'll take a crack at this one; I took a look at the source (albeit in 2.2.1), 
and believe I understand the work to be done. I'll have some free time over the 
holiday weekend, and hope to have something by Monday.

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26143) Shuffle shuffle default storage level

2018-11-21 Thread Avi minsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avi minsky updated SPARK-26143:
---
Summary: Shuffle shuffle default storage level  (was: Shuffle shuffle 
default persist type)

> Shuffle shuffle default storage level
> -
>
> Key: SPARK-26143
> URL: https://issues.apache.org/jira/browse/SPARK-26143
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Avi minsky
>Priority: Major
>
> Currently developer can set storage level explicitly only on persist command 
> but shuffling can occur in many cases (group by, join, etc..) why can we set 
> a default persist type, this can be helpful in many cases (also will 
> automatically allow replication of shuffle blocks if set to memory_only_2, 
> for example)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26143) Shuffle shuffle default persist type

2018-11-21 Thread Avi minsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avi minsky updated SPARK-26143:
---
Description: Currently developer can set storage level explicitly only on 
persist command but shuffling can occur in many cases (group by, join, etc..) 
why can we set a default persist type, this can be helpful in many cases (also 
will automatically allow replication of shuffle blocks if set to memory_only_2, 
for example)  (was: Currently developer can set persist type explicitly only on 
persist command but shuffling can occur in many cases (group by, join, etc..) 
why can we set a default shuffle mode, this can be helpful in many cases (also 
will automatically allow replication of shuffle blocks if set to memory_only_2, 
for example))

> Shuffle shuffle default persist type
> 
>
> Key: SPARK-26143
> URL: https://issues.apache.org/jira/browse/SPARK-26143
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Avi minsky
>Priority: Major
>
> Currently developer can set storage level explicitly only on persist command 
> but shuffling can occur in many cases (group by, join, etc..) why can we set 
> a default persist type, this can be helpful in many cases (also will 
> automatically allow replication of shuffle blocks if set to memory_only_2, 
> for example)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26143) Shuffle shuffle default persist type

2018-11-21 Thread Avi minsky (JIRA)

Avi minsky created SPARK-26143:
--

 Summary: Shuffle shuffle default persist type
 Key: SPARK-26143
 URL: https://issues.apache.org/jira/browse/SPARK-26143
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Affects Versions: 2.3.0
Reporter: Avi minsky


Currently developer can set persist type explicitly only on persist command but 
shuffling can occur in many cases (group by, join, etc..) why can we set a 
default shuffle mode, this can be helpful in many cases (also will 
automatically allow replication of shuffle blocks if set to memory_only_2, for 
example)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-21 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26066.
---
Resolution: Fixed
  Assignee: Maxim Gekk

This is resolved via https://github.com/apache/spark/pull/23039

> Moving truncatedString to sql/catalyst
> --
>
> Key: SPARK-26066
> URL: https://issues.apache.org/jira/browse/SPARK-26066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The truncatedString method is used to convert elements of TreeNodes and 
> expressions to strings, and called only from sql.* packages. The ticket aims 
> to move the method out from core. Also need to introduce SQL config to 
> control maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-21 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26066:
--
Fix Version/s: 3.0.0

> Moving truncatedString to sql/catalyst
> --
>
> Key: SPARK-26066
> URL: https://issues.apache.org/jira/browse/SPARK-26066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The truncatedString method is used to convert elements of TreeNodes and 
> expressions to strings, and called only from sql.* packages. The ticket aims 
> to move the method out from core. Also need to introduce SQL config to 
> control maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18180) pyspark.sql.Row does not serialize well to json

2018-11-21 Thread Oleg V Korchagin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694971#comment-16694971
 ] 

Oleg V Korchagin commented on SPARK-18180:
--

I'll take a look on this if there are no objections.

> pyspark.sql.Row does not serialize well to json
> ---
>
> Key: SPARK-18180
> URL: https://issues.apache.org/jira/browse/SPARK-18180
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: HDP 2.3.4, Spark 2.0.1, 
>Reporter: Miguel Cabrera
>Priority: Major
>
> {{Row}} does not serialize well automatically. Although they are dict-like in 
> Python, the json module does not see to be able to serialize it.
> {noformat}
> from  pyspark.sql import Row
> import json
> r = Row(field1='hello', field2='world')
> json.dumps(r)
> {noformat}
> Results:
> {noformat}
> '["hello", "world"]'
> {noformat}
> Expected:
> {noformat}
> {'field1':'hellow', 'field2':'world'}
> {noformat}
> The work around is to call the {{asDict()}} method of Row. However, this 
> makes custom serializing of nested objects really painful as the person has 
> to be aware that is serializing a Row object. In particular with SPARK-17695, 
>   you cannot serialize DataFrames easily if you have some empty or null 
> fields,  so you have to customize the serialization process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-11-21 Thread Pawan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawan updated SPARK-25919:
--
Priority: Blocker  (was: Major)

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26141) Enable custom shuffle metrics implementation in shuffle write

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26141:


Assignee: Reynold Xin  (was: Apache Spark)

> Enable custom shuffle metrics implementation in shuffle write
> -
>
> Key: SPARK-26141
> URL: https://issues.apache.org/jira/browse/SPARK-26141
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26141) Enable custom shuffle metrics implementation in shuffle write

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26141:

Summary: Enable custom shuffle metrics implementation in shuffle write  
(was: Enable passing in custom shuffle metrics implementation in shuffle write)

> Enable custom shuffle metrics implementation in shuffle write
> -
>
> Key: SPARK-26141
> URL: https://issues.apache.org/jira/browse/SPARK-26141
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26141) Enable custom shuffle metrics implementation in shuffle write

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694878#comment-16694878
 ] 

Apache Spark commented on SPARK-26141:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23106

> Enable custom shuffle metrics implementation in shuffle write
> -
>
> Key: SPARK-26141
> URL: https://issues.apache.org/jira/browse/SPARK-26141
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26141) Enable custom shuffle metrics implementation in shuffle write

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26141:


Assignee: Apache Spark  (was: Reynold Xin)

> Enable custom shuffle metrics implementation in shuffle write
> -
>
> Key: SPARK-26141
> URL: https://issues.apache.org/jira/browse/SPARK-26141
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26142) Implement shuffle read metrics in SQL

2018-11-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-26142:
---

 Summary: Implement shuffle read metrics in SQL
 Key: SPARK-26142
 URL: https://issues.apache.org/jira/browse/SPARK-26142
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8288.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23062
[https://github.com/apache/spark/pull/23062]

> ScalaReflection should also try apply methods defined in companion objects 
> when inferring schema from a Product type
> 
>
> Key: SPARK-8288
> URL: https://issues.apache.org/jira/browse/SPARK-8288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Drew Robb
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
> issue).
> My comment on that issue quoted below:
> {quote}
> ...  The reason of this exception is that, the Scala code Scrooge generates 
> is actually a trait extending {{Product}}:
> {code}
> trait Junk
>   extends ThriftStruct
>   with scala.Product2[Long, String]
>   with java.io.Serializable
> {code}
> while Spark expects a case class, something like:
> {code}
> case class Junk(junkID: Long, junkString: String)
> {code}
> The key difference here is that the latter case class version has a 
> constructor whose arguments can be transformed into fields of the DataFrame 
> schema.  The exception was thrown because Spark can't find such a constructor 
> from trait {{Junk}}.
> {quote}
> We can make {{ScalaReflection}} try {{apply}} methods in companion objects, 
> so that trait types generated by Scrooge can also be used for Spark SQL 
> schema inference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26140) Enable custom shuffle metrics reporter in shuffle reader

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26140:

Summary: Enable custom shuffle metrics reporter in shuffle reader  (was: 
Enable custom shuffle metrics reporter into shuffle reader)

> Enable custom shuffle metrics reporter in shuffle reader
> 
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26140) Enable custom shuffle metrics implementation in shuffle reader

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26140:

Summary: Enable custom shuffle metrics implementation in shuffle reader  
(was: Enable custom shuffle metrics reporter in shuffle reader)

> Enable custom shuffle metrics implementation in shuffle reader
> --
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26140) Enable custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26140:

Summary: Enable custom shuffle metrics reporter into shuffle reader  (was: 
Enable passing in a custom shuffle metrics reporter into shuffle reader)

> Enable custom shuffle metrics reporter into shuffle reader
> --
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26141) Enable passing in custom shuffle metrics implementation in shuffle write

2018-11-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-26141:
---

 Summary: Enable passing in custom shuffle metrics implementation 
in shuffle write
 Key: SPARK-26141
 URL: https://issues.apache.org/jira/browse/SPARK-26141
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26109) Duration in the task summary metrics table and the task table are different

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26109.
---
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0
   2.3.3

Issue resolved by pull request 23081
[https://github.com/apache/spark/pull/23081]

> Duration in the task summary metrics table and the task table are different
> ---
>
> Key: SPARK-26109
> URL: https://issues.apache.org/jira/browse/SPARK-26109
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 2.3.3, 3.0.0, 2.4.1
>
>
> Duration time in the summary metrics table and tasks table are different even 
> though other metrics in the table are proper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26129) Instrumentation for query planning time

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-26129.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Instrumentation for query planning time
> ---
>
> Key: SPARK-26129
> URL: https://issues.apache.org/jira/browse/SPARK-26129
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently don't have good visibility into query planning time (analysis vs 
> optimization vs physical planning). This patch adds a simple utility to track 
> the runtime of various rules and various planning phases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-8288:


Assignee: Drew Robb

> ScalaReflection should also try apply methods defined in companion objects 
> when inferring schema from a Product type
> 
>
> Key: SPARK-8288
> URL: https://issues.apache.org/jira/browse/SPARK-8288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Drew Robb
>Priority: Major
>
> This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
> issue).
> My comment on that issue quoted below:
> {quote}
> ...  The reason of this exception is that, the Scala code Scrooge generates 
> is actually a trait extending {{Product}}:
> {code}
> trait Junk
>   extends ThriftStruct
>   with scala.Product2[Long, String]
>   with java.io.Serializable
> {code}
> while Spark expects a case class, something like:
> {code}
> case class Junk(junkID: Long, junkString: String)
> {code}
> The key difference here is that the latter case class version has a 
> constructor whose arguments can be transformed into fields of the DataFrame 
> schema.  The exception was thrown because Spark can't find such a constructor 
> from trait {{Junk}}.
> {quote}
> We can make {{ScalaReflection}} try {{apply}} methods in companion objects, 
> so that trait types generated by Scrooge can also be used for Spark SQL 
> schema inference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26109) Duration in the task summary metrics table and the task table are different

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26109:
-

Assignee: shahid

> Duration in the task summary metrics table and the task table are different
> ---
>
> Key: SPARK-26109
> URL: https://issues.apache.org/jira/browse/SPARK-26109
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> Duration time in the summary metrics table and tasks table are different even 
> though other metrics in the table are proper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25678) SPIP: Adding support in Spark for HPC cluster manager (PBS Professional)

2018-11-21 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25678.
---
Resolution: Won't Fix

If there is more work to be done to make resource managers pluggable, I'd put 
that under SPARK-19700. If it's about merging support for this cluster manager, 
no I am pretty certain that would not happen in Spark.

> SPIP: Adding support in Spark for HPC cluster manager (PBS Professional)
> 
>
> Key: SPARK-25678
> URL: https://issues.apache.org/jira/browse/SPARK-25678
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Utkarsh Maheshwari
>Priority: Major
>
> I sent an email on the dev mailing list but got no response, hence filing a 
> JIRA ticket.
>  
> PBS (Portable Batch System) Professional is an open sourced workload 
> management system for HPC clusters. Many organizations using PBS for managing 
> their cluster also use Spark for Big Data but they are forced to divide the 
> cluster into Spark cluster and PBS cluster either physically dividing the 
> cluster nodes into two groups or starting Spark Standalone cluster manager's 
> Master and Slaves as PBS jobs, leading to underutilization of resources.
>  
>  I am trying to add support in Spark to use PBS as a pluggable cluster 
> manager. Going through the Spark codebase and looking at Mesos and Kubernetes 
> integration, I found that we can get this working as follows:
>  
>  - Extend `ExternalClusterManager`.
>  - Extend `CoarseGrainedSchedulerBackend`
>    - This class can start `Executors` as PBS jobs.
>    - The initial number of `Executors` are started `onStart`.
>    - More `Executors` can be started as and when required using 
> `doRequestTotalExecutors`.
>    - `Executors` can be killed using `doKillExecutors`.
>  - Extend `SparkApplication` to start `Driver` as a PBS job in cluster deploy 
> mode.
>    - This extended class can submit the Spark application again as a PBS job 
> which with deploy mode = client, so that the application driver is started on 
> a node in the cluster.
>  
>  I have a couple of questions:
>  - Does this seem like a good idea to do this or should we look at other 
> options?
>  - What are the expectations from the initial prototype?
>  - If this works, would Spark maintainers look forward to merging this or 
> would they want it to be maintained as a fork?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26140) Enable passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694815#comment-16694815
 ] 

Apache Spark commented on SPARK-26140:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23105

> Enable passing in a custom shuffle metrics reporter into shuffle reader
> ---
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26140) Enable passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26140:


Assignee: Apache Spark  (was: Reynold Xin)

> Enable passing in a custom shuffle metrics reporter into shuffle reader
> ---
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26140) Enable passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26140:


Assignee: Reynold Xin  (was: Apache Spark)

> Enable passing in a custom shuffle metrics reporter into shuffle reader
> ---
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26140) Enable passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694813#comment-16694813
 ] 

Apache Spark commented on SPARK-26140:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23105

> Enable passing in a custom shuffle metrics reporter into shuffle reader
> ---
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26140) Pull TempShuffleReadMetrics creation out of shuffle layer

2018-11-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-26140:
---

 Summary: Pull TempShuffleReadMetrics creation out of shuffle layer
 Key: SPARK-26140
 URL: https://issues.apache.org/jira/browse/SPARK-26140
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
layer, so it can be driven by an external caller. Then we can in SQL execution 
pass in a special metrics reporter that allows updating ShuffleExchangeExec's 
metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26140) Enable passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26140:

Summary: Enable passing in a custom shuffle metrics reporter into shuffle 
reader  (was: Allow passing in a custom shuffle metrics reporter into shuffle 
reader)

> Enable passing in a custom shuffle metrics reporter into shuffle reader
> ---
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26140) Allow passing in a custom shuffle metrics reporter into shuffle reader

2018-11-21 Thread Reynold Xin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-26140:

Summary: Allow passing in a custom shuffle metrics reporter into shuffle 
reader  (was: Pull TempShuffleReadMetrics creation out of shuffle layer)

> Allow passing in a custom shuffle metrics reporter into shuffle reader
> --
>
> Key: SPARK-26140
> URL: https://issues.apache.org/jira/browse/SPARK-26140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> The first step to pull the creation of TempShuffleReadMetrics out of shuffle 
> layer, so it can be driven by an external caller. Then we can in SQL 
> execution pass in a special metrics reporter that allows updating 
> ShuffleExchangeExec's metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26139) Support passing shuffle metrics to exchange operator

2018-11-21 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-26139:
---

 Summary: Support passing shuffle metrics to exchange operator
 Key: SPARK-26139
 URL: https://issues.apache.org/jira/browse/SPARK-26139
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


Due to the way Spark's architected (SQL is defined on top of the RDD API), 
there are two separate metrics system used in core vs SQL. Ideally, we'd want 
to be able to get the shuffle metrics for each of the exchange operator 
independently, e.g. blocks read, number of records.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26138) LimitPushDown cross join requires maybeBushLocalLimit

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26138:


Assignee: Apache Spark

> LimitPushDown cross join requires maybeBushLocalLimit
> -
>
> Key: SPARK-26138
> URL: https://issues.apache.org/jira/browse/SPARK-26138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: guoxiaolong
>Assignee: Apache Spark
>Priority: Minor
>
> In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26138) LimitPushDown cross join requires maybeBushLocalLimit

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26138:


Assignee: (was: Apache Spark)

> LimitPushDown cross join requires maybeBushLocalLimit
> -
>
> Key: SPARK-26138
> URL: https://issues.apache.org/jira/browse/SPARK-26138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: guoxiaolong
>Priority: Minor
>
> In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26138) LimitPushDown cross join requires maybeBushLocalLimit

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694712#comment-16694712
 ] 

Apache Spark commented on SPARK-26138:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/23104

> LimitPushDown cross join requires maybeBushLocalLimit
> -
>
> Key: SPARK-26138
> URL: https://issues.apache.org/jira/browse/SPARK-26138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: guoxiaolong
>Priority: Minor
>
> In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26138) LimitPushDown cross join requires maybeBushLocalLimit

2018-11-21 Thread guoxiaolong (JIRA)

guoxiaolong created SPARK-26138:
---

 Summary: LimitPushDown cross join requires maybeBushLocalLimit
 Key: SPARK-26138
 URL: https://issues.apache.org/jira/browse/SPARK-26138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: guoxiaolong


In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26121:


Assignee: (was: Apache Spark)

> [Structured Streaming] Allow users to define prefix of Kafka's consumer group 
> (group.id)
> 
>
> Key: SPARK-26121
> URL: https://issues.apache.org/jira/browse/SPARK-26121
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Anastasios Zouzias
>Priority: Minor
>
> I run in the following situation with Spark Structure Streaming (SS) using 
> Kafka.
>  
> In a project that I work on, there is already a secured Kafka setup where ops 
> can issue an SSL certificate per "[group.id|http://group.id/];, which should 
> be predefined (or its prefix to be predefined).
>  
> On the other hand, Spark SS fixes the [group.id|http://group.id/] to 
>  
> val uniqueGroupId = 
> s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
>  
> see, i.e.,
>  
> [https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124]
>  
> https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81
>  
> I guess Spark developers had a good reason to fix it, but is it possible to 
> make configurable the prefix of the above uniqueGroupId 
> ("spark-kafka-source")?
>  
> The rational is that spark users are not forced to use the same certificate 
> on group-ids of the form (spark-kafka-source-*).
>  
> DoD:
> * Allow spark SS users to define the group.id prefix as input parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694558#comment-16694558
 ] 

Apache Spark commented on SPARK-26121:
--

User 'zouzias' has created a pull request for this issue:
https://github.com/apache/spark/pull/23103

> [Structured Streaming] Allow users to define prefix of Kafka's consumer group 
> (group.id)
> 
>
> Key: SPARK-26121
> URL: https://issues.apache.org/jira/browse/SPARK-26121
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Anastasios Zouzias
>Priority: Minor
>
> I run in the following situation with Spark Structure Streaming (SS) using 
> Kafka.
>  
> In a project that I work on, there is already a secured Kafka setup where ops 
> can issue an SSL certificate per "[group.id|http://group.id/];, which should 
> be predefined (or its prefix to be predefined).
>  
> On the other hand, Spark SS fixes the [group.id|http://group.id/] to 
>  
> val uniqueGroupId = 
> s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
>  
> see, i.e.,
>  
> [https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124]
>  
> https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81
>  
> I guess Spark developers had a good reason to fix it, but is it possible to 
> make configurable the prefix of the above uniqueGroupId 
> ("spark-kafka-source")?
>  
> The rational is that spark users are not forced to use the same certificate 
> on group-ids of the form (spark-kafka-source-*).
>  
> DoD:
> * Allow spark SS users to define the group.id prefix as input parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26121:


Assignee: Apache Spark

> [Structured Streaming] Allow users to define prefix of Kafka's consumer group 
> (group.id)
> 
>
> Key: SPARK-26121
> URL: https://issues.apache.org/jira/browse/SPARK-26121
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Anastasios Zouzias
>Assignee: Apache Spark
>Priority: Minor
>
> I run in the following situation with Spark Structure Streaming (SS) using 
> Kafka.
>  
> In a project that I work on, there is already a secured Kafka setup where ops 
> can issue an SSL certificate per "[group.id|http://group.id/];, which should 
> be predefined (or its prefix to be predefined).
>  
> On the other hand, Spark SS fixes the [group.id|http://group.id/] to 
>  
> val uniqueGroupId = 
> s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
>  
> see, i.e.,
>  
> [https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124]
>  
> https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81
>  
> I guess Spark developers had a good reason to fix it, but is it possible to 
> make configurable the prefix of the above uniqueGroupId 
> ("spark-kafka-source")?
>  
> The rational is that spark users are not forced to use the same certificate 
> on group-ids of the form (spark-kafka-source-*).
>  
> DoD:
> * Allow spark SS users to define the group.id prefix as input parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-21 Thread Pierre Lienhart (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694544#comment-16694544
 ] 

Pierre Lienhart commented on SPARK-26116:
-

Ok so I started from a situation where I have the above-described crashes and 
then increased the off-heap memory size by setting 
spark.yarn.executor.memoryOverhead to 4g, 8g and 16g, the other settings 
remaining the same. It still crashes with the same logs : the 
UnsafeExternalSorter keeps spilling sort data to disk while the 
TaskMemoryManager unsuccesfully tries to allocate more pages until {{ERROR 
CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM}}. Note that the stdout of 
the crashed executor mentions {{java.lang.OutOfMemoryError: Java heap space}}. 
Same thing with XX:MaxDirectMemorySize.

I know that the error message suggests to increase 
spark.yarn.executor.memoryOverhead but it does not seem to work in that case 
(and I forgot to mention in my first message that I had already tried).

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also forces to provision huge 
> amounts of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
>

[jira] [Assigned] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26137:


Assignee: Apache Spark

> Linux file separator is hard coded in DependencyUtils used in deploy process
> 
>
> Key: SPARK-26137
> URL: https://issues.apache.org/jira/browse/SPARK-26137
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mark Pavey
>Assignee: Apache Spark
>Priority: Major
>
> During deployment, while downloading dependencies the code tries to remove 
> multiple copies of the application jar from the driver classpath. The Linux 
> file separator ("/") is hard coded here so on Windows multiple copies of the 
> jar are not removed.
> This has a knock on effect when trying to use elasticsearch-spark as this 
> library does not run if there are multiple copies of the application jar on 
> the driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26137:


Assignee: (was: Apache Spark)

> Linux file separator is hard coded in DependencyUtils used in deploy process
> 
>
> Key: SPARK-26137
> URL: https://issues.apache.org/jira/browse/SPARK-26137
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mark Pavey
>Priority: Major
>
> During deployment, while downloading dependencies the code tries to remove 
> multiple copies of the application jar from the driver classpath. The Linux 
> file separator ("/") is hard coded here so on Windows multiple copies of the 
> jar are not removed.
> This has a knock on effect when trying to use elasticsearch-spark as this 
> library does not run if there are multiple copies of the application jar on 
> the driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-21 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694521#comment-16694521
 ] 

Apache Spark commented on SPARK-26137:
--

User 'markpavey' has created a pull request for this issue:
https://github.com/apache/spark/pull/23102

> Linux file separator is hard coded in DependencyUtils used in deploy process
> 
>
> Key: SPARK-26137
> URL: https://issues.apache.org/jira/browse/SPARK-26137
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Mark Pavey
>Priority: Major
>
> During deployment, while downloading dependencies the code tries to remove 
> multiple copies of the application jar from the driver classpath. The Linux 
> file separator ("/") is hard coded here so on Windows multiple copies of the 
> jar are not removed.
> This has a knock on effect when trying to use elasticsearch-spark as this 
> library does not run if there are multiple copies of the application jar on 
> the driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-21 Thread Wenchen Fan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118
 ] 

Wenchen Fan edited comment on SPARK-25829 at 11/21/18 10:06 AM:


More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, StringToMap, MapConcat, 
TransformKeys, MapFilter, and also reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.


was (Author: cloud_fan):
More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapFromString, MapConcat, 
TransformKeys, MapFilter, and also reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25599) Stateful aggregation in PySpark

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25599.
--
Resolution: Duplicate

> Stateful aggregation in PySpark
> ---
>
> Key: SPARK-25599
> URL: https://issues.apache.org/jira/browse/SPARK-25599
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Vincent Grosbois
>Priority: Minor
>
> Hi!
>  
> From PySpark, I am trying to define a custom aggregator *that is accumulating 
> state*. Is it possible in Spark 2.3 ?
> AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 
> (cf [How to define and use a User-Defined Aggregate Function in Spark 
> SQL?|https://stackoverflow.com/questions/32100973/how-to-define-and-use-a-user-defined-aggregate-function-in-spark-sql]),
>  by calling {{pandas_udf}} with the {{PandasUDFType.GROUPED_AGG}} keyword.
> However given that it is just taking a function as a parameter I don't think 
> it is possible to carry state around during the aggregation with this 
> function.
> From Scala, I see it is possible to have stateful aggregation by either 
> extending {{UserDefinedAggregateFunction}} or 
> {{org.apache.spark.sql.expressions.Aggregator}} , but is there a similar 
> thing I can do on python-side only?
> If no, is this planned in a future release?
> thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26137) Linux file separator is hard coded in DependencyUtils used in deploy process

2018-11-21 Thread Mark Pavey (JIRA)

Mark Pavey created SPARK-26137:
--

 Summary: Linux file separator is hard coded in DependencyUtils 
used in deploy process
 Key: SPARK-26137
 URL: https://issues.apache.org/jira/browse/SPARK-26137
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Mark Pavey


During deployment, while downloading dependencies the code tries to remove 
multiple copies of the application jar from the driver classpath. The Linux 
file separator ("/") is hard coded here so on Windows multiple copies of the 
jar are not removed.

This has a knock on effect when trying to use elasticsearch-spark as this 
library does not run if there are multiple copies of the application jar on the 
driver classpath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694470#comment-16694470
 ] 

Hyukjin Kwon commented on SPARK-26136:
--

That's minor tho. Please reopen and go ahead for a PR if you have a fix in your 
mind. Otherwise let's leave it resolved.

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Charlie Feng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1669#comment-1669
 ] 

Charlie Feng commented on SPARK-26136:
--

Thanks for quick feedback.

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Charlie Feng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694464#comment-16694464
 ] 

Charlie Feng commented on SPARK-26136:
--

And I'm thinking when row.getAs() can't infer the type correctly, it should 
throw an running time exception, instead of return null value.
When user met an exception, he will find the issue and debug code and fix it.
But when spark return null value, user may not notice this error and lead 
production issue.

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Charlie Feng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charlie Feng updated SPARK-26136:
-
Comment: was deleted

(was: And I'm thinking when row.getAs() can't infer the type correctly, it 
should throw an running time exception, instead of return null value.
When user met an exception, he will find the issue and debug code and fix it.
But when spark return null value, user may not notice this error and lead 
production issue.)

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Charlie Feng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694459#comment-16694459
 ] 

Charlie Feng commented on SPARK-26136:
--

And I'm thinking when row.getAs() can't infer the type correctly, it should 
throw an running time exception, instead of return null value.
When user met an exception, he will find the issue and debug code and fix it.
But when spark return null value, user may not notice this error and lead 
production issue.

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26136.
--
Resolution: Invalid

For questions, please ask to mailing list next time.
When filing an issue, please make it readable as much as possible.

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694427#comment-16694427
 ] 

Hyukjin Kwon commented on SPARK-26136:
--

Type should be specified {{row.getAs[String]("A")}}; otherwise, it can't infer 
the type correctly.

{code}
-   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
+  (row.getAs[String]("A"), colA, row.getString(0), col0, row.getString(1), 
xyz)
{code}

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 = df.flatMap { row =>
> row.getAs[String]("XYZ").split(",").map { xyz =>
>   var colA: String = row.getAs("A");
>   var col0: String = row.getString(0);
>   (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
> }
>   }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
>   df2.show();
>   spark.close()
>  }
> }
> {code}
> Console Output:
> {code}
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> {code}
> We try to get "A" column with 4 approach
> 1. call {{row.getAs("A")}} inside a tuple
> 2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
> variable into the tuple
> 3. call {{row.getString(0)}} inside a tuple
> 4. call {{row.getString(0)}}, save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26136:
-
Description: 
{{Row.getAs("fieldName")}} will return null value when all below conditions met:

 * Used in {{DataFrame.flatMap()}}

 * {{Another map()}} call inside {{flatMap}}

 * call {{row.getAs("fieldName")}} inside a {{Tuple}}.

Source code to reproduce the bug:

{code}
import org.apache.spark.sql.SparkSession

object FlatMapGetAsBug {

def main(args: Array[String]) {
  val spark = 
SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
  import spark.implicits._;

  val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
  df.show();
  val df2 = df.flatMap(row => row.getAs[String]("XYZ").split(",")
.map(xyz => {
  var colA: String = row.getAs("A");
  var col0: String = row.getString(0);
  (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
   })).toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
"ColumnB", "ColumnXYZ")

  df2.show();
  spark.close()
 }
}
{code}


Console Output:

{code}

+---+---+-+
| A| B| XYZ|
+---+---+-+
| a1| b1|x,y,z|
+---+---+-+

+++++---+-+
|ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
+++++---+-+
| null| a1| a1| a1| b1| x|
| null| a1| a1| a1| b1| y|
| null| a1| a1| a1| b1| z|
+++++---+-+
{code}

We try to get "A" column with 4 approach

1. call {{row.getAs("A")}} inside a tuple

2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
variable into the tuple

3. call {{row.getString(0)}} inside a tuple

4. call {{row.getString(0)}}, save result into a variable "col0", and add 
variable into the tuple 

And we found that approach 2~4 get value "a1" successfully, but approach 1 get 
"null"

This issue existing in spark 2.4.0/2.3.2/2.3.0


  was:
Row.getAs("fieldName") will return null value when all below conditions met:
 * Used in DataFrame.flatMap()
 * Another map() call inside flatMap
 * call row.getAs("fieldName") inside a Tuple.

*Source code to reproduce the bug:*

import org.apache.spark.sql.SparkSession

object FlatMapGetAsBug {

def main(args: Array[String]) {
 val spark = 
SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
 import spark.implicits._;

val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
 df.show();
 val df2 = df.flatMap(row => row.getAs[String]("XYZ").split(",")
 .map(xyz => {
 var colA: String = row.getAs("A");
 var col0: String = row.getString(0);
 (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
 })).toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
"ColumnB", "ColumnXYZ")

df2.show();
 spark.close()
 }
}

*Console Output:*

+---+---+-+
| A| B| XYZ|
+---+---+-+
| a1| b1|x,y,z|
+---+---+-+

+++++---+-+
|ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
+++++---+-+
| null| a1| a1| a1| b1| x|
| null| a1| a1| a1| b1| y|
| null| a1| a1| a1| b1| z|
+++++---+-+

We try to get "A" column with 4 approach
1) call row.getAs("A") inside a tuple
2) call row.getAs("A"), save result into a variable "colA", and add variable 
into the tuple
3) call row.getString(0) inside a tuple
4) call row.getString(0), save result into a variable "col0", and add variable 
into the tuple 

And we found that approach 2~4 get value "a1" successfully, but approach 1 get 
"null"

This issue existing in spark 2.4.0/2.3.2/2.3.0

 


> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import spark.implicits._;
>   val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>   df.show();
>   val df2 =

[jira] [Updated] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26136:
-
Description: 
{{Row.getAs("fieldName")}} will return null value when all below conditions met:

 * Used in {{DataFrame.flatMap()}}

 * {{Another map()}} call inside {{flatMap}}

 * call {{row.getAs("fieldName")}} inside a {{Tuple}}.

Source code to reproduce the bug:

{code}
import org.apache.spark.sql.SparkSession

object FlatMapGetAsBug {

def main(args: Array[String]) {
  val spark = 
SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
  import spark.implicits._;

  val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
  df.show();
  val df2 = df.flatMap { row =>
row.getAs[String]("XYZ").split(",").map { xyz =>
  var colA: String = row.getAs("A");
  var col0: String = row.getString(0);
  (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
}
  }.toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
"ColumnB", "ColumnXYZ")

  df2.show();
  spark.close()
 }
}
{code}


Console Output:

{code}

+---+---+-+
| A| B| XYZ|
+---+---+-+
| a1| b1|x,y,z|
+---+---+-+

+++++---+-+
|ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
+++++---+-+
| null| a1| a1| a1| b1| x|
| null| a1| a1| a1| b1| y|
| null| a1| a1| a1| b1| z|
+++++---+-+
{code}

We try to get "A" column with 4 approach

1. call {{row.getAs("A")}} inside a tuple

2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
variable into the tuple

3. call {{row.getString(0)}} inside a tuple

4. call {{row.getString(0)}}, save result into a variable "col0", and add 
variable into the tuple 

And we found that approach 2~4 get value "a1" successfully, but approach 1 get 
"null"

This issue existing in spark 2.4.0/2.3.2/2.3.0


  was:
{{Row.getAs("fieldName")}} will return null value when all below conditions met:

 * Used in {{DataFrame.flatMap()}}

 * {{Another map()}} call inside {{flatMap}}

 * call {{row.getAs("fieldName")}} inside a {{Tuple}}.

Source code to reproduce the bug:

{code}
import org.apache.spark.sql.SparkSession

object FlatMapGetAsBug {

def main(args: Array[String]) {
  val spark = 
SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
  import spark.implicits._;

  val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
  df.show();
  val df2 = df.flatMap(row => row.getAs[String]("XYZ").split(",")
.map(xyz => {
  var colA: String = row.getAs("A");
  var col0: String = row.getString(0);
  (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
   })).toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
"ColumnB", "ColumnXYZ")

  df2.show();
  spark.close()
 }
}
{code}


Console Output:

{code}

+---+---+-+
| A| B| XYZ|
+---+---+-+
| a1| b1|x,y,z|
+---+---+-+

+++++---+-+
|ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
+++++---+-+
| null| a1| a1| a1| b1| x|
| null| a1| a1| a1| b1| y|
| null| a1| a1| a1| b1| z|
+++++---+-+
{code}

We try to get "A" column with 4 approach

1. call {{row.getAs("A")}} inside a tuple

2. call {{row.getAs("A")}}, save result into a variable "colA", and add 
variable into the tuple

3. call {{row.getString(0)}} inside a tuple

4. call {{row.getString(0)}}, save result into a variable "col0", and add 
variable into the tuple 

And we found that approach 2~4 get value "a1" successfully, but approach 1 get 
"null"

This issue existing in spark 2.4.0/2.3.2/2.3.0



> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> {{Row.getAs("fieldName")}} will return null value when all below conditions 
> met:
>  * Used in {{DataFrame.flatMap()}}
>  * {{Another map()}} call inside {{flatMap}}
>  * call {{row.getAs("fieldName")}} inside a {{Tuple}}.
> Source code to reproduce the bug:
> {code}
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>   val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>   import

[jira] [Reopened] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-26108:
--

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26102) Common CSV/JSON functions tests

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26102.
--
Resolution: Won't Fix

> Common CSV/JSON functions tests
> ---
>
> Key: SPARK-26102
> URL: https://issues.apache.org/jira/browse/SPARK-26102
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> *CsvFunctionsSuite* and *JsonFunctionsSuite* have similar tests. Need to 
> extract common those test to a common place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26108:


Assignee: (was: Apache Spark)

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26136) Row.getAs return null value in some condition

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26136:
-
Docs Text:   (was: import org.apache.spark.sql.SparkSession

object FlatMapGetAsBug {

  def main(args: Array[String]) {
val spark = 
SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
import spark.implicits._;

val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
df.show();
val df2 = df.flatMap(row => row.getAs[String]("XYZ").split(",")
  .map(xyz => {
var colA: String = row.getAs("A");
var col0: String = row.getString(0);
(row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
  })).toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
"ColumnB", "ColumnXYZ")

df2.show();
spark.close()
  }
}
)

> Row.getAs return null value in some condition
> -
>
> Key: SPARK-26136
> URL: https://issues.apache.org/jira/browse/SPARK-26136
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.2, 2.4.0
> Environment: Windows 10
> JDK 1.8.0_181
> scala 2.11.12
> spark 2.4.0 / 2.3.2 / 2.3.0
>  
>Reporter: Charlie Feng
>Priority: Major
>
> Row.getAs("fieldName") will return null value when all below conditions met:
>  * Used in DataFrame.flatMap()
>  * Another map() call inside flatMap
>  * call row.getAs("fieldName") inside a Tuple.
> *Source code to reproduce the bug:*
> import org.apache.spark.sql.SparkSession
> object FlatMapGetAsBug {
> def main(args: Array[String]) {
>  val spark = 
> SparkSession.builder.appName("SparkUtil").master("local").getOrCreate
>  import spark.implicits._;
> val df = Seq(("a1", "b1", "x,y,z")).toDF("A", "B", "XYZ")
>  df.show();
>  val df2 = df.flatMap(row => row.getAs[String]("XYZ").split(",")
>  .map(xyz => {
>  var colA: String = row.getAs("A");
>  var col0: String = row.getString(0);
>  (row.getAs("A"), colA, row.getString(0), col0, row.getString(1), xyz)
>  })).toDF("ColumnA_API1", "ColumnA_API2", "ColumnA_API3", "ColumnA_API4", 
> "ColumnB", "ColumnXYZ")
> df2.show();
>  spark.close()
>  }
> }
> *Console Output:*
> +---+---+-+
> | A| B| XYZ|
> +---+---+-+
> | a1| b1|x,y,z|
> +---+---+-+
> +++++---+-+
> |ColumnA_API1|ColumnA_API2|ColumnA_API3|ColumnA_API4|ColumnB|ColumnXYZ|
> +++++---+-+
> | null| a1| a1| a1| b1| x|
> | null| a1| a1| a1| b1| y|
> | null| a1| a1| a1| b1| z|
> +++++---+-+
> We try to get "A" column with 4 approach
> 1) call row.getAs("A") inside a tuple
> 2) call row.getAs("A"), save result into a variable "colA", and add variable 
> into the tuple
> 3) call row.getString(0) inside a tuple
> 4) call row.getString(0), save result into a variable "col0", and add 
> variable into the tuple 
> And we found that approach 2~4 get value "a1" successfully, but approach 1 
> get "null"
> This issue existing in spark 2.4.0/2.3.2/2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-21 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26108:


Assignee: Apache Spark

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26108) Support custom lineSep in CSV datasource

2018-11-21 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26108.
--
Resolution: Won't Fix

> Support custom lineSep in CSV datasource
> 
>
> Key: SPARK-26108
> URL: https://issues.apache.org/jira/browse/SPARK-26108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently CSV datasource can detect and parse CSV text with '\n', '\r' and 
> '\r\n' as line separators. This ticket aims to support custom lineSep with 
> maximum length of 2 characters due to current restriction of uniVocity 
> parser. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 110 matches

Mail list logo