date:20190225

[jira] [Assigned] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26990:


Assignee: Apache Spark

> Difference in handling of mixed-case partition column names after SPARK-26188
> -
>
> Key: SPARK-26990
> URL: https://issues.apache.org/jira/browse/SPARK-26990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Major
>
> I noticed that the [PR for 
> SPARK-26188|https://github.com/apache/spark/pull/23165] changed how 
> mixed-cased partition columns are handled when the user provides a schema.
> Say I have this file structure (note that each instance of `pS` is mixed 
> case):
> {noformat}
> bash-3.2$ find partitioned5 -type d
> partitioned5
> partitioned5/pi=2
> partitioned5/pi=2/pS=foo
> partitioned5/pi=2/pS=bar
> partitioned5/pi=1
> partitioned5/pi=1/pS=foo
> partitioned5/pi=1/pS=bar
> bash-3.2$
> {noformat}
> If I load the file with a user-provided schema in 2.4 (before the PR was 
> committed) or 2.3, I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- ps: string (nullable = true)
> scala>
> {noformat}
> However, using 2.4 after the PR was committed. I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- pS: string (nullable = true)
> scala>
> {noformat}
> Spark is picking up the mixed-case column name {{pS}} from the directory 
> name, not the lower-case {{ps}} from my specified schema.
> In all tests, {{spark.sql.caseSensitive}} is set to the default (false).
> Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24778) DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed

2019-02-25 Thread Renkai Ge (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777680#comment-16777680
 ] 

Renkai Ge commented on SPARK-24778:
---

I think this issue is already resolved by
https://issues.apache.org/jira/browse/SPARK-26903

> DateTimeUtils.getTimeZone method returns GMT time if timezone cannot be parsed
> --
>
> Key: SPARK-24778
> URL: https://issues.apache.org/jira/browse/SPARK-24778
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Vinitha Reddy Gankidi
>Priority: Major
>
> {{DateTimeUtils.getTimeZone}} calls java's {{Timezone.getTimezone}} method 
> that defaults to GMT if the timezone cannot be parsed. This can be misleading 
> for users and its better to return NULL instead of returning an incorrect 
> value. 
> To reproduce: {{from_utc_timestamp}} is one of the functions that calls 
> {{DateTimeUtils.getTimeZone}}. Session timezone is GMT for the following 
> queries.
> {code:java}
> SELECT from_utc_timestamp('2018-07-10 12:00:00', 'GMT+05:00') -> 2018-07-10 
> 17:00:00 
> SELECT from_utc_timestamp('2018-07-10 12:00:00', '+05:00') -> 2018-07-10 
> 12:00:00 (Defaults to GMT as the timezone is not recognized){code}
> We could fix it by using the workaround mentioned here: 
> [https://bugs.openjdk.java.net/browse/JDK-4412864].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-26992:
---
Attachment: error_stage.png

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: error_session.png, error_stage.png
>
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26993) _minRegisteredRatio default value is zero not 0.8 for Yarn

2019-02-25 Thread Yifan Guo (JIRA)

Yifan Guo created SPARK-26993:
-

 Summary: _minRegisteredRatio default value is zero not 0.8 for Yarn
 Key: SPARK-26993
 URL: https://issues.apache.org/jira/browse/SPARK-26993
 Project: Spark
  Issue Type: Question
  Components: YARN
Affects Versions: 2.4.0
Reporter: Yifan Guo


private[spark]

class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: 
RpcEnv)
 extends ExecutorAllocationClient with SchedulerBackend with Logging {

 // Use an atomic variable to track total number of cores in the cluster for 
simplicity and speed
 protected val totalCoreCount = new AtomicInteger(0)
 // Total number of executors that are currently registered
 protected val totalRegisteredExecutors = new AtomicInteger(0)
 protected val conf = scheduler.sc.conf
 private val maxRpcMessageSize = RpcUtils.maxMessageSizeBytes(conf)
 private val defaultAskTimeout = RpcUtils.askRpcTimeout(conf)
 // Submit tasks only after (registered resources / total expected resources)
 // is equal to at least this value, that is double between 0 and 1.
 private val _minRegisteredRatio =
 math.min(1, conf.getDouble("spark.scheduler.minRegister*edResourcesRatio", 0))*

 

override val minRegisteredRatio =
 if (conf.getOption("spark.scheduler.minRegisteredResourcesRatio").isEmpty) {
 0.8
 } else {
 super.minRegisteredRatio
 }

 

Apparently, if "spark.scheduler.minRegisteredResourcesRatio" is not configured, 
default value is zero not 0.8

 

is that on purpose ? 

 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)

dzcxzl created SPARK-26992:
--

 Summary: Fix STS scheduler pool correct delivery
 Key: SPARK-26992
 URL: https://issues.apache.org/jira/browse/SPARK-26992
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.0.0
Reporter: dzcxzl


The user sets the value of spark.sql.thriftserver.scheduler.pool.
Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-26992:
---
Description: 
The user sets the value of spark.sql.thriftserver.scheduler.pool.
 Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.

 

For example

The second session does not manually set the pool name. The default pool name 
should be used, but the pool name of the previous user's settings is used. This 
is incorrect.

!error_session.png!

 

!error_stage.png!

 

  was:
The user sets the value of spark.sql.thriftserver.scheduler.pool.
 Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.

 

For example

The second session does not manually set the pool name. The default pool name 
should be used, but the pool name of the previous user's settings is used. This 
is incorrect.

 


> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: error_session.png, error_stage.png
>
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
> !error_session.png!
>  
> !error_stage.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-26992:
---
Attachment: error_session.png

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
> Attachments: error_session.png, error_stage.png
>
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-26992:
---
Description: 
The user sets the value of spark.sql.thriftserver.scheduler.pool.
 Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.

 

For example

The second session does not manually set the pool name. The default pool name 
should be used, but the pool name of the previous user's settings is used. This 
is incorrect.

 

  was:
The user sets the value of spark.sql.thriftserver.scheduler.pool.
 Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.

 

For example

The second session does not manually set the pool name. The default pool name 
should be used, but the pool name of the previous user's settings is used. This 
is incorrect.

!image-2019-02-26-15-20-51-076.png!

 

!image-2019-02-26-15-21-02-966.png!


> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread dzcxzl (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-26992:
---
Description: 
The user sets the value of spark.sql.thriftserver.scheduler.pool.
 Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.

 

For example

The second session does not manually set the pool name. The default pool name 
should be used, but the pool name of the previous user's settings is used. This 
is incorrect.

!image-2019-02-26-15-20-51-076.png!

 

!image-2019-02-26-15-21-02-966.png!

  was:
The user sets the value of spark.sql.thriftserver.scheduler.pool.
Spark thrift server saves this value in the LocalProperty of threadlocal type, 
but does not clean up after running, causing other sessions to run in the 
previously set pool name.


> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
>  Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.
>  
> For example
> The second session does not manually set the pool name. The default pool name 
> should be used, but the pool name of the previous user's settings is used. 
> This is incorrect.
> !image-2019-02-26-15-20-51-076.png!
>  
> !image-2019-02-26-15-21-02-966.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26992:


Assignee: (was: Apache Spark)

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
> Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26992:


Assignee: Apache Spark

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
> Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26992) Fix STS scheduler pool correct delivery

2019-02-25 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777667#comment-16777667
 ] 

Apache Spark commented on SPARK-26992:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/23895

> Fix STS scheduler pool correct delivery
> ---
>
> Key: SPARK-26992
> URL: https://issues.apache.org/jira/browse/SPARK-26992
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.0
>Reporter: dzcxzl
>Priority: Minor
>
> The user sets the value of spark.sql.thriftserver.scheduler.pool.
> Spark thrift server saves this value in the LocalProperty of threadlocal 
> type, but does not clean up after running, causing other sessions to run in 
> the previously set pool name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26990:


Assignee: (was: Apache Spark)

> Difference in handling of mixed-case partition column names after SPARK-26188
> -
>
> Key: SPARK-26990
> URL: https://issues.apache.org/jira/browse/SPARK-26990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Bruce Robbins
>Priority: Major
>
> I noticed that the [PR for 
> SPARK-26188|https://github.com/apache/spark/pull/23165] changed how 
> mixed-cased partition columns are handled when the user provides a schema.
> Say I have this file structure (note that each instance of `pS` is mixed 
> case):
> {noformat}
> bash-3.2$ find partitioned5 -type d
> partitioned5
> partitioned5/pi=2
> partitioned5/pi=2/pS=foo
> partitioned5/pi=2/pS=bar
> partitioned5/pi=1
> partitioned5/pi=1/pS=foo
> partitioned5/pi=1/pS=bar
> bash-3.2$
> {noformat}
> If I load the file with a user-provided schema in 2.4 (before the PR was 
> committed) or 2.3, I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- ps: string (nullable = true)
> scala>
> {noformat}
> However, using 2.4 after the PR was committed. I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- pS: string (nullable = true)
> scala>
> {noformat}
> Spark is picking up the mixed-case column name {{pS}} from the directory 
> name, not the lower-case {{ps}} from my specified schema.
> In all tests, {{spark.sql.caseSensitive}} is set to the default (false).
> Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26932) Orc compatibility between hive and spark

2019-02-25 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777654#comment-16777654
 ] 

Dongjoon Hyun commented on SPARK-26932:
---

`Migration Guide` might be the best place for that. Please use the migration 
guide from 2.3 to 2.4.
- 
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188

2019-02-25 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777663#comment-16777663
 ] 

Apache Spark commented on SPARK-26990:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23894

> Difference in handling of mixed-case partition column names after SPARK-26188
> -
>
> Key: SPARK-26990
> URL: https://issues.apache.org/jira/browse/SPARK-26990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Bruce Robbins
>Priority: Major
>
> I noticed that the [PR for 
> SPARK-26188|https://github.com/apache/spark/pull/23165] changed how 
> mixed-cased partition columns are handled when the user provides a schema.
> Say I have this file structure (note that each instance of `pS` is mixed 
> case):
> {noformat}
> bash-3.2$ find partitioned5 -type d
> partitioned5
> partitioned5/pi=2
> partitioned5/pi=2/pS=foo
> partitioned5/pi=2/pS=bar
> partitioned5/pi=1
> partitioned5/pi=1/pS=foo
> partitioned5/pi=1/pS=bar
> bash-3.2$
> {noformat}
> If I load the file with a user-provided schema in 2.4 (before the PR was 
> committed) or 2.3, I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- ps: string (nullable = true)
> scala>
> {noformat}
> However, using 2.4 after the PR was committed. I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- pS: string (nullable = true)
> scala>
> {noformat}
> Spark is picking up the mixed-case column name {{pS}} from the directory 
> name, not the lower-case {{ps}} from my specified schema.
> In all tests, {{spark.sql.caseSensitive}} is set to the default (false).
> Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26932) Orc compatibility between hive and spark

2019-02-25 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777648#comment-16777648
 ] 

Dongjoon Hyun commented on SPARK-26932:
---

Thank you for updating, [~haiboself]. 

So, does Apache Hive also has a document for this? For example, Hive 2.3.x 
generates some ORC tables which Hive 2.2.1 cannot read. We can add a reference 
to that Hive document if it exists. In general, this is Hive-side read issue, 
isn't it?

BTW, as I wrote in the mailing list, Spark 2.3.x has `spark.sql.orc.impl=hive` 
by default. So, I don't think we need a document for that. For Spark 2.4, 
please make a PR. I'm +1.

> Orc compatibility between hive and spark
> 
>
> Key: SPARK-26932
> URL: https://issues.apache.org/jira/browse/SPARK-26932
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0
>Reporter: Bo Hai
>Priority: Minor
>
> As of Spark 2.3 and Hive 2.3, both supports using apache/orc as orc writer 
> and reader. In older version of Hive, orc reader(isn't forward-compitaient) 
> implemented by its own.
> So Hive 2.2 and older can not read orc table created by spark 2.3 and newer 
> which using apache/orc instead of Hive orc.
> I think we should add these information into Spark2.4 orc configuration file 
> : https://spark.apache.org/docs/2.4.0/sql-data-sources-orc.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26777) SQL worked in 2.3.2 and fails in 2.4.0

2019-02-25 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777642#comment-16777642
 ] 

Jungtaek Lim commented on SPARK-26777:
--

[~ilya745]
You may want to post a new issue to update the summary and description to your 
reproducer. If your reproducer requires some input files zipping your project 
and attaching would be much helpful, like attachment in SPARK-22000.

> SQL worked in 2.3.2 and fails in 2.4.0
> --
>
> Key: SPARK-26777
> URL: https://issues.apache.org/jira/browse/SPARK-26777
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuri Budilov
>Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
>  PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string,

[jira] [Resolved] (SPARK-26952) Row count statics should respect the data reported by data source

2019-02-25 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26952.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23853
[https://github.com/apache/spark/pull/23853]

> Row count statics should respect the data reported by data source
> -
>
> Key: SPARK-26952
> URL: https://issues.apache.org/jira/browse/SPARK-26952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
> Fix For: 3.0.0
>
>
> In data source v2, if the data source scan implemented 
> `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row 
> count reported by the data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26952) Row count statics should respect the data reported by data source

2019-02-25 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26952:
---

Assignee: Xianyang Liu

> Row count statics should respect the data reported by data source
> -
>
> Key: SPARK-26952
> URL: https://issues.apache.org/jira/browse/SPARK-26952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xianyang Liu
>Assignee: Xianyang Liu
>Priority: Major
>
> In data source v2, if the data source scan implemented 
> `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row 
> count reported by the data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26991) Investigate difference of `returnNullable` between ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor

2019-02-25 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-26991:


 Summary: Investigate difference of `returnNullable` between 
ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor
 Key: SPARK-26991
 URL: https://issues.apache.org/jira/browse/SPARK-26991
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the effort on investigation on difference between 
ScalaReflection.deserializerFor and JavaTypeInference.deserializerFor, 
especially the reason why Java side uses `returnNullable = true` whereas 
`returnNullable = false`.

The origin discussion is linked here:
https://github.com/apache/spark/pull/23854#discussion_r260117702




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-02-25 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777609#comment-16777609
 ] 

Jungtaek Lim commented on SPARK-26984:
--

IMHO, the behavior of Spark 2.2 looks wrong, and current behavior looks correct 
to me. Suppose how the value should be interpreted in Spark, Some(null) and 
None, which is literally not same. If `None` falls into null, which is the 
correct value for `Some(null)`?

I'd recommend to use "Option()" if you're uncertain of the nullability of value.

> Incompatibility between Spark releases - Some(null) 
> 
>
> Key: SPARK-26984
> URL: https://issues.apache.org/jira/browse/SPARK-26984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Linux CentOS, Databricks.
>Reporter: Gerard Alexander
>Priority: Minor
>  Labels: newbie
> Fix For: 2.4.1, 2.4.2
>
>
> Please refer to 
> [https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]
> NB: Not sure of priority being correct - no doubt one will evaluate.
> It is noted that the following:
> {{val df = Seq( }}
> {{  (1, Some("a"), Some(1)), }}
> {{  (2, Some(null), Some(2)), }}
> {{  (3, Some("c"), Some(3)), }}
> {{  (4, None, None) ).toDF("c1", "c2", "c3")}}
> In Spark 2.2.1 (on mapr) the Some(null) works fine, in Spark 2.4.0 on 
> Databricks an error ensues.
> {{java.lang.RuntimeException: Error while encoding: 
> java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> unwrapoption(ObjectType(class java.lang.String), 
> assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) 
> AS _2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
> scala.Tuple3, true]))._3) AS _3#8 at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
>  at 
> org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) at 
> scala.collection.immutable.List.foreach(List.scala:388) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
> scala.collection.immutable.List.map(List.scala:294) at 
> org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
> org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
> org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
>  ... 57 elided Caused by: java.lang.NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
>  ... 66 more}}
>  
> You can argue it is solvable otherwise, but there may well be an existing 
> code base that could be affected.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26968.
--
Resolution: Won't Fix

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777508#comment-16777508
 ] 

Hyukjin Kwon commented on SPARK-26968:
--

There's no official compatibility between external spark-csv library and 
internal Spark's CSV datasource. This option existed in the external library, 
and it was removed when it comes to Spark. It's technically invalid to call it 
a regression.

I don't think it's worth to fix it since the core parser does not support this 
- we cannot add all the options for every corner case. Don't reopen until it's 
clear we see the intention of fixing it. Resolving JIRA doesn't necessarily 
mean we don't discuss about this anymore.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188

2019-02-25 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-26990:
--
Summary: Difference in handling of mixed-case partition column names after 
SPARK-26188  (was: Difference in handling of mixed-case partition columns after 
SPARK-26188)

> Difference in handling of mixed-case partition column names after SPARK-26188
> -
>
> Key: SPARK-26990
> URL: https://issues.apache.org/jira/browse/SPARK-26990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Bruce Robbins
>Priority: Major
>
> I noticed that the [PR for 
> SPARK-26188|https://github.com/apache/spark/pull/23165] changed how 
> mixed-cased partition columns are handled when the user provides a schema.
> Say I have this file structure (note that each instance of `pS` is mixed 
> case):
> {noformat}
> bash-3.2$ find partitioned5 -type d
> partitioned5
> partitioned5/pi=2
> partitioned5/pi=2/pS=foo
> partitioned5/pi=2/pS=bar
> partitioned5/pi=1
> partitioned5/pi=1/pS=foo
> partitioned5/pi=1/pS=bar
> bash-3.2$
> {noformat}
> If I load the file with a user-provided schema in 2.4 (before the PR was 
> committed) or 2.3, I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- ps: string (nullable = true)
> scala>
> {noformat}
> However, using 2.4 after the PR was committed. I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- pS: string (nullable = true)
> scala>
> {noformat}
> Spark is picking up the mixed-case column name {{pS}} from the directory 
> name, not the lower-case {{ps}} from my specified schema.
> In all tests, {{spark.sql.caseSensitive}} is set to the default (false).
> Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26990) Difference in handling of mixed-case partition columns after SPARK-26188

2019-02-25 Thread Bruce Robbins (JIRA)

Bruce Robbins created SPARK-26990:
-

 Summary: Difference in handling of mixed-case partition columns 
after SPARK-26188
 Key: SPARK-26990
 URL: https://issues.apache.org/jira/browse/SPARK-26990
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1
Reporter: Bruce Robbins


I noticed that the [PR for 
SPARK-26188|https://github.com/apache/spark/pull/23165] changed how mixed-cased 
partition columns are handled when the user provides a schema.

Say I have this file structure (note that each instance of `pS` is mixed case):
{noformat}
bash-3.2$ find partitioned5 -type d
partitioned5
partitioned5/pi=2
partitioned5/pi=2/pS=foo
partitioned5/pi=2/pS=bar
partitioned5/pi=1
partitioned5/pi=1/pS=foo
partitioned5/pi=1/pS=bar
bash-3.2$
{noformat}
If I load the file with a user-provided schema in 2.4 (before the PR was 
committed) or 2.3, I see:
{noformat}
scala> val df = spark.read.schema("intField int, pi int, ps 
string").parquet("partitioned5")
df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
scala> df.printSchema
root
 |-- intField: integer (nullable = true)
 |-- pi: integer (nullable = true)
 |-- ps: string (nullable = true)
scala>
{noformat}
However, using 2.4 after the PR was committed. I see:
{noformat}
scala> val df = spark.read.schema("intField int, pi int, ps 
string").parquet("partitioned5")
df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
scala> df.printSchema
root
 |-- intField: integer (nullable = true)
 |-- pi: integer (nullable = true)
 |-- pS: string (nullable = true)
scala>
{noformat}
Spark is picking up the mixed-case column name {{pS}} from the directory name, 
not the lower-case {{ps}} from my specified schema.

In all tests, {{spark.sql.caseSensitive}} is set to the default (false).

Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26674) Consolidate CompositeByteBuf when reading large frame

2019-02-25 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26674.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23602
[https://github.com/apache/spark/pull/23602]

> Consolidate CompositeByteBuf when reading large frame
> -
>
> Key: SPARK-26674
> URL: https://issues.apache.org/jira/browse/SPARK-26674
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, TransportFrameDecoder will not consolidate the buffers read from 
> network which may cause memory waste. Actually, bytebuf's writtenIndex is far 
> less than it's capacity  in most cases, so we can optimize it by doing 
> consolidation.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26674) Consolidate CompositeByteBuf when reading large frame

2019-02-25 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26674:
--

Assignee: liupengcheng

> Consolidate CompositeByteBuf when reading large frame
> -
>
> Key: SPARK-26674
> URL: https://issues.apache.org/jira/browse/SPARK-26674
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
>
> Currently, TransportFrameDecoder will not consolidate the buffers read from 
> network which may cause memory waste. Actually, bytebuf's writtenIndex is far 
> less than it's capacity  in most cases, so we can optimize it by doing 
> consolidation.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26989) Flaky test:DAGSchedulerSuite.Barrier task failures from the same stage attempt don't trigger multiple stage retries

2019-02-25 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-26989:
--

 Summary: Flaky test:DAGSchedulerSuite.Barrier task failures from 
the same stage attempt don't trigger multiple stage retries
 Key: SPARK-26989
 URL: https://issues.apache.org/jira/browse/SPARK-26989
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/102761/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/Barrier_task_failures_from_the_same_stage_attempt_don_t_trigger_multiple_stage_retries/

{noformat}
org.apache.spark.scheduler.DAGSchedulerSuite.Barrier task failures from the 
same stage attempt don't trigger multiple stage retries

Error Message
org.scalatest.exceptions.TestFailedException: ArrayBuffer() did not equal 
List(0)

Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
ArrayBuffer() did not equal List(0)
at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
at 
org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$144(DAGSchedulerSuite.scala:2644)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:104)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.scheduler.DAGSchedulerSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(DAGSchedulerSuite.scala:122)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23836) Support returning StructType to the level support in GroupedMap Arrow's "scalar" UDFS (or similar)

2019-02-25 Thread Bryan Cutler (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777416#comment-16777416
 ] 

Bryan Cutler commented on SPARK-23836:
--

I can work on this

> Support returning StructType to the level support in GroupedMap Arrow's 
> "scalar" UDFS (or similar)
> --
>
> Key: SPARK-23836
> URL: https://issues.apache.org/jira/browse/SPARK-23836
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Major
>
> Currently not all of the supported types can be returned from the scalar 
> pandas UDF type. This means if someone wants to return a struct type doing a 
> map operation right now they either have to do a "junk" groupBy or use the 
> non-vectorized results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24103.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 17084
[https://github.com/apache/spark/pull/17084]

> BinaryClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24103
> URL: https://issues.apache.org/jira/browse/SPARK-24103
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24103) BinaryClassificationEvaluator should use sample weight data

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24103:
-

Assignee: Ilya Matiach

> BinaryClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-24103
> URL: https://issues.apache.org/jira/browse/SPARK-24103
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Assignee: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24335) Dataset.map schema not applied in some cases

2019-02-25 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777379#comment-16777379
 ] 

Jungtaek Lim commented on SPARK-24335:
--

As an workaround SPARK-26987 will help Java API users to explicitly create Row 
which understands schema.

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18381) Wrong date conversion between spark and python for dates before 1583

2019-02-25 Thread Nicolas Tilmans (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777362#comment-16777362
 ] 

Nicolas Tilmans commented on SPARK-18381:
-

Any updates on this issue? We are encountering it with the same use case as 
[~nchammas], a placeholder date from a client is blowing up pyspark pipelines. 
Although a change to the documentation is welcome, a recommendation for how to 
actually handle the error elegantly would also be appreciated. We could remap 
to a different, more compatible date, but that just seems hacky.

> Wrong date conversion between spark and python for dates before 1583
> 
>
> Key: SPARK-18381
> URL: https://issues.apache.org/jira/browse/SPARK-18381
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Luca Caniparoli
>Priority: Major
>
> Dates before 1538 (julian/gregorian calendar transition) are processed 
> incorrectly. 
> * With python udf (datetime.strptime), .show() returns wrong dates but 
> .collect() returns correct dates
> * With pyspark.sql.functions.to_date, .show() shows correct dates but 
> .collect() returns wrong dates. Additionally, collecting '0001-01-01' returns 
> error when collecting dataframe. 
> {code:none}
> from pyspark.sql.types import DateType
> from pyspark.sql.functions import to_date, udf
> from datetime import datetime
> strToDate =  udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType())
> l = [('0002-01-01', 1), ('1581-01-01', 2), ('1582-01-01', 3), ('1583-01-01', 
> 4), ('1584-01-01', 5), ('2012-01-21', 6)]
> l_older = [('0001-01-01', 1)]
> test_df = spark.createDataFrame(l, ["date_string", "number"])
> test_df_older = spark.createDataFrame(l_older, ["date_string", "number"])
> test_df_strptime = test_df.withColumn( "date_cast", 
> strToDate(test_df["date_string"]))
> test_df_todate = test_df.withColumn( "date_cast", 
> to_date(test_df["date_string"]))
> test_df_older_todate = test_df_older.withColumn( "date_cast", 
> to_date(test_df_older["date_string"]))
> test_df_strptime.show()
> test_df_todate.show()
> print test_df_strptime.collect()
> print test_df_todate.collect()
> print test_df_older_todate.collect()
> {code}
> {noformat}
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-03|
> | 1581-01-01| 2|1580-12-22|
> | 1582-01-01| 3|1581-12-22|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> +---+--+--+
> |date_string|number| date_cast|
> +---+--+--+
> | 0002-01-01| 1|0002-01-01|
> | 1581-01-01| 2|1581-01-01|
> | 1582-01-01| 3|1582-01-01|
> | 1583-01-01| 4|1583-01-01|
> | 1584-01-01| 5|1584-01-01|
> | 2012-01-21| 6|2012-01-21|
> +---+--+--+
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(2, 1, 1)), 
> Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 1, 
> 1)), Row(date_string=u'1582-01-01', number=3, date_cast=datetime.date(1582, 
> 1, 1)), Row(date_string=u'1583-01-01', number=4, 
> date_cast=datetime.date(1583, 1, 1)), Row(date_string=u'1584-01-01', 
> number=5, date_cast=datetime.date(1584, 1, 1)), 
> Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 1, 
> 21))]
> [Row(date_string=u'0002-01-01', number=1, date_cast=datetime.date(1, 12, 
> 30)), Row(date_string=u'1581-01-01', number=2, date_cast=datetime.date(1581, 
> 1, 11)), Row(date_string=u'1582-01-01', number=3, 
> date_cast=datetime.date(1582, 1, 11)), Row(date_string=u'1583-01-01', 
> number=4, date_cast=datetime.date(1583, 1, 1)), 
> Row(date_string=u'1584-01-01', number=5, date_cast=datetime.date(1584, 1, 
> 1)), Row(date_string=u'2012-01-21', number=6, date_cast=datetime.date(2012, 
> 1, 21))]
> Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 267, in 
> raise Exception(traceback.format_exc())
> Exception: Traceback (most recent call last):
>   File "/tmp/zeppelin_pyspark-6043517212596195478.py", line 265, in 
> exec(code)
>   File "", line 15, in 
>   File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 311, in 
> collect
> return list(_load_from_socket(port, 
> BatchedSerializer(PickleSerializer(
>   File "/usr/local/spark/python/pyspark/rdd.py", line 142, in 
> _load_from_socket
> for item in serializer.load_stream(rf):
>   File "/usr/local/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/usr/local/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File

[jira] [Created] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-02-25 Thread Dave DeCaprio (JIRA)

Dave DeCaprio created SPARK-26988:
-

 Summary: Spark overwrites spark.scheduler.pool if set in configs
 Key: SPARK-26988
 URL: https://issues.apache.org/jira/browse/SPARK-26988
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: Dave DeCaprio


If you set a default spark.scheduler.pool in your configuration when you create 
a SparkSession and then you attempt to override that configuration by calling 
setLocalProperty on a SparkSession, as described in the Spark documentation - 
[https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools] 
- it won't work.

Spark will go with the original pool name.

I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
key that starts with "spark" from the the session state to the local 
properties.  The can end up overwriting the scheduler, which is set by 
spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26987) Add a new method to RowFactory: Row with schema

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26987:


Assignee: Apache Spark

> Add a new method to RowFactory: Row with schema
> ---
>
> Key: SPARK-26987
> URL: https://issues.apache.org/jira/browse/SPARK-26987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> In Java API, RowFactory is supposed to be the official way to create a Row. 
> There's only "create()" method which doesn't contain schema in Row, hence 
> Java API only guarantee the usage of Row as "access by index", though "access 
> by column name" works in some query according to the query plan.
>  
> We could add "createWithSchema()" method in RowFactory so that end users can 
> access by column name for Row consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26987) Add a new method to RowFactory: Row with schema

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26987:


Assignee: (was: Apache Spark)

> Add a new method to RowFactory: Row with schema
> ---
>
> Key: SPARK-26987
> URL: https://issues.apache.org/jira/browse/SPARK-26987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> In Java API, RowFactory is supposed to be the official way to create a Row. 
> There's only "create()" method which doesn't contain schema in Row, hence 
> Java API only guarantee the usage of Row as "access by index", though "access 
> by column name" works in some query according to the query plan.
>  
> We could add "createWithSchema()" method in RowFactory so that end users can 
> access by column name for Row consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26987) Add a new method to RowFactory: Row with schema

2019-02-25 Thread Jungtaek Lim (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-26987:
-
Summary: Add a new method to RowFactory: Row with schema  (was: Add new 
method to RowFactory: Row with schema)

> Add a new method to RowFactory: Row with schema
> ---
>
> Key: SPARK-26987
> URL: https://issues.apache.org/jira/browse/SPARK-26987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> In Java API, RowFactory is supposed to be the official way to create a Row. 
> There's only "create()" method which doesn't contain schema in Row, hence 
> Java API only guarantee the usage of Row as "access by index", though "access 
> by column name" works in some query according to the query plan.
>  
> We could add "createWithSchema()" method in RowFactory so that end users can 
> access by column name for Row consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26987) Add new method to RowFactory: Row with schema

2019-02-25 Thread Jungtaek Lim (JIRA)

Jungtaek Lim created SPARK-26987:


 Summary: Add new method to RowFactory: Row with schema
 Key: SPARK-26987
 URL: https://issues.apache.org/jira/browse/SPARK-26987
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


In Java API, RowFactory is supposed to be the official way to create a Row. 
There's only "create()" method which doesn't contain schema in Row, hence Java 
API only guarantee the usage of Row as "access by index", though "access by 
column name" works in some query according to the query plan.
 
We could add "createWithSchema()" method in RowFactory so that end users can 
access by column name for Row consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26788) Remove SchedulerExtensionService

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26788.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23839
[https://github.com/apache/spark/pull/23839]

> Remove SchedulerExtensionService
> 
>
> Key: SPARK-26788
> URL: https://issues.apache.org/jira/browse/SPARK-26788
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> This was added in SPARK-11314, but it has a few issues:
> - it's in the YARN module, which is not a public Spark API.
> - because of that it's also YARN specific
> - it's not used by Spark in any way other than providing this as an extension 
> point
> For the latter, it probably makes sense to use listeners instead, and enhance 
> the listener interface if it's lacking.
> Also pinging [~steve_l] since he added that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26788) Remove SchedulerExtensionService

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26788:
-

Assignee: Marcelo Vanzin

> Remove SchedulerExtensionService
> 
>
> Key: SPARK-26788
> URL: https://issues.apache.org/jira/browse/SPARK-26788
> Project: Spark
>  Issue Type: Task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> This was added in SPARK-11314, but it has a few issues:
> - it's in the YARN module, which is not a public Spark API.
> - because of that it's also YARN specific
> - it's not used by Spark in any way other than providing this as an extension 
> point
> For the latter, it probably makes sense to use listeners instead, and enhance 
> the listener interface if it's lacking.
> Also pinging [~steve_l] since he added that code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping

2019-02-25 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25035.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23688
[https://github.com/apache/spark/pull/23688]

> Replicating disk-stored blocks should avoid memory mapping
> --
>
> Key: SPARK-25035
> URL: https://issues.apache.org/jira/browse/SPARK-25035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: memory-analysis
> Fix For: 3.0.0
>
>
> This is a follow-up to SPARK-24296.
> When replicating a disk-cached block, even if we fetch-to-disk, we still 
> memory-map the file, just to copy it to another location.
> Ideally we'd just move the tmp file to the right location.  But even without 
> that, we could read the file as an input stream, instead of memory-mapping 
> the whole thing.  Memory-mapping is particularly a problem when running under 
> yarn, as the OS may believe there is plenty of memory available, meanwhile 
> yarn decides to kill the process for exceeding memory limits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping

2019-02-25 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25035:
--

Assignee: Attila Zsolt Piros

> Replicating disk-stored blocks should avoid memory mapping
> --
>
> Key: SPARK-25035
> URL: https://issues.apache.org/jira/browse/SPARK-25035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Assignee: Attila Zsolt Piros
>Priority: Major
>  Labels: memory-analysis
>
> This is a follow-up to SPARK-24296.
> When replicating a disk-cached block, even if we fetch-to-disk, we still 
> memory-map the file, just to copy it to another location.
> Ideally we'd just move the tmp file to the right location.  But even without 
> that, we could read the file as an input stream, instead of memory-mapping 
> the whole thing.  Memory-mapping is particularly a problem when running under 
> yarn, as the OS may believe there is plenty of memory available, meanwhile 
> yarn decides to kill the process for exceeding memory limits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-02-25 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777240#comment-16777240
 ] 

Jungtaek Lim commented on SPARK-24295:
--

[~alfredo-gimenez-bv]
Unfortunately no, cause I don't think metadata on file stream sink can be 
reliably purged as I explained earlier. My PR resolves the issue, but also 
breaks the internal of metadata, so have to bring some bad option too.

Btw, there's another concern SPARK-26411 which you may be interested to 
consider about it.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-02-25 Thread Yifei Huang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777231#comment-16777231
 ] 

Yifei Huang commented on SPARK-25299:
-

Here is an updated doc with the progress since the last update: 
[https://docs.google.com/document/d/1NQW1XgJ6bwktjq5iPyxnvasV9g-XsauiRRayQcGLiik/edit].
 Thanks!

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26986) Add JAXB reference impl to build for Java 9+

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26986:


Assignee: Sean Owen  (was: Apache Spark)

> Add JAXB reference impl to build for Java 9+
> 
>
> Key: SPARK-26986
> URL: https://issues.apache.org/jira/browse/SPARK-26986
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
> Environment: Under Java 9+, the Java JAXB implementation isn't 
> accessible (or not shipped?) It leads to errors when running PMML-related 
> tests, as it can't find an implementation. We should add the reference JAXB 
> impl from Glassfish.
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26986) Add JAXB reference impl to build for Java 9+

2019-02-25 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26986:


Assignee: Apache Spark  (was: Sean Owen)

> Add JAXB reference impl to build for Java 9+
> 
>
> Key: SPARK-26986
> URL: https://issues.apache.org/jira/browse/SPARK-26986
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
> Environment: Under Java 9+, the Java JAXB implementation isn't 
> accessible (or not shipped?) It leads to errors when running PMML-related 
> tests, as it can't find an implementation. We should add the reference JAXB 
> impl from Glassfish.
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-25 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777222#comment-16777222
 ] 

Huaxin Gao commented on SPARK-26970:


I will work on this. 

> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API

2019-02-25 Thread Erik Christiansen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777208#comment-16777208
 ] 

Erik Christiansen commented on SPARK-26449:
---

merged

https://github.com/apache/spark/pull/23877

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2019-02-25 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777201#comment-16777201
 ] 

Sean Owen commented on SPARK-24417:
---

[~michael.atef] looks like you compiled with Java 11 and are running with Java 
< 11. This isn't specific to Spark; that's what happens whenever you do that. 
Java 11 isn't supported yet; that's what this JIRA is about.

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-02-25 Thread Alfredo Gimenez (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777197#comment-16777197
 ] 

Alfredo Gimenez edited comment on SPARK-24295 at 2/25/19 6:57 PM:
--

We've run into the exact same issue, I uploaded a minimal reproducible example 
showing the continuously growing metadata compaction files. This is especially 
an issue in streaming jobs that rely on checkpointing, as we cannot purge 
metadata files and restart–the checkpointing mechanism depends on the metadata. 
A current workaround we have is to manually grab the last checkpoint offsets, 
purge both checkpoints and metadata, and set the "startingOffsets" to the 
latest offsets that we grabbed. This is obviously not ideal, as it relies on 
the current serialized data structure for the checkpoints, which can change 
with spark versions. It also introduces the possibility of losing checkpoint 
data if a spark job fails before creating a new checkpoint file.

[~kabhwan] taking a look at your PR now, thanks!

Is there another reliable workaround for this setup?


was (Author: alfredo-gimenez-bv):
We've run into the exact same issue, I uploaded a minimal reproducible example 
showing the continuously growing metadata compaction files. This is especially 
an issue in streaming jobs that rely on checkpointing, as we cannot purge 
metadata files and restart–the checkpointing mechanism depends on the metadata. 
A current workaround we have is to manually grab the last checkpoint offsets, 
purge both checkpoints and metadata, and set the "startingOffsets" to the 
latest offsets that we grabbed. This is obviously not ideal, as it relies on 
the current serialized data structure for the checkpoints, which can change 
with spark versions. It also introduces the possibility of losing checkpoint 
data if a spark job fails before creating a new checkpoint file.

[~kabhwan] can you point us to your PR? 

Is there another reliable workaround for this setup?

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-02-25 Thread Alfredo Gimenez (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777197#comment-16777197
 ] 

Alfredo Gimenez commented on SPARK-24295:
-

We've run into the exact same issue, I uploaded a minimal reproducible example 
showing the continuously growing metadata compaction files. This is especially 
an issue in streaming jobs that rely on checkpointing, as we cannot purge 
metadata files and restart–the checkpointing mechanism depends on the metadata. 
A current workaround we have is to manually grab the last checkpoint offsets, 
purge both checkpoints and metadata, and set the "startingOffsets" to the 
latest offsets that we grabbed. This is obviously not ideal, as it relies on 
the current serialized data structure for the checkpoints, which can change 
with spark versions. It also introduces the possibility of losing checkpoint 
data if a spark job fails before creating a new checkpoint file.

[~kabhwan] can you point us to your PR? 

Is there another reliable workaround for this setup?

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2019-02-25 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777198#comment-16777198
 ] 

Sean Owen commented on SPARK-24417:
---

[~mlebihan] How familiar are you with the JDK release cycle at the moment? Java 
8 was only publicly EOL last year, and remains widely deployed. Part of the 
complication is we need to keep it working on Java 8 even in Spark 3. 
Anecdotally, I think only a minority of production systems have moved off Java 
8 yet, but will increasingly do so this year and beyond.

Spark 3 will support Java 11 if we can resolve all the things that broke from 
Java 9. The module system and various behavior changes, coupled with the need 
for _Scala_ to support these things, makes this hard. In 2.4 we had to deal 
with major changes for Scala 2.12 support; these changes are pretty big and 
won't be back-ported to 2.4.x. Java 11 is now next.

Spark 3 is sometime the middle of this year.

I am not sure what your third question means. Java 17 is the next LTS release 
if I understand correctly, and so far the Java 12/13 changes don't look like 
big ones. Java 9 was a big, breaking change, and Java 11 is the first LTS 
release after 9. Therefore it's the thing to support next. I don't see a reason 
it won't work on 12, 13, etc.

These things get done by doing them; please try to resolve the remaining issues 
with us if you want this to get done.

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-02-25 Thread Alfredo Gimenez (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alfredo Gimenez updated SPARK-24295:

Attachment: spark_metadatalog_compaction_perfbug_repro.tar.gz

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26986) Add JAXB reference impl to build for Java 9+

2019-02-25 Thread Sean Owen (JIRA)

Sean Owen created SPARK-26986:
-

 Summary: Add JAXB reference impl to build for Java 9+
 Key: SPARK-26986
 URL: https://issues.apache.org/jira/browse/SPARK-26986
 Project: Spark
  Issue Type: Sub-task
  Components: ML, Spark Core
Affects Versions: 3.0.0
 Environment: Under Java 9+, the Java JAXB implementation isn't 
accessible (or not shipped?) It leads to errors when running PMML-related 
tests, as it can't find an implementation. We should add the reference JAXB 
impl from Glassfish.
Reporter: Sean Owen
Assignee: Sean Owen






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-02-25 Thread Sean Owen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777185#comment-16777185
 ] 

Sean Owen commented on SPARK-26839:
---

Is this the same error? I'm seeing this while running Hive tests on Java 11:
{code}
[ERROR] 
saveExternalTableWithSchemaAndQueryIt(org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite)
  Time elapsed: 0.021 s  <<< ERROR!
java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
metastore. Please set spark.sql.hive.metastore.jars.
at 
org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.tearDown(JavaMetastoreDataSourcesSuite.java:92)

[INFO] Running org.apache.spark.sql.hive.JavaDataFrameSuite
15:55:50.365 WARN org.apache.spark.sql.execution.command.DropTableCommand: 
java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
metastore. Please set spark.sql.hive.metastore.jars.
java.lang.IllegalArgumentException: Unable to locate hive jars to connect to 
metastore. Please set spark.sql.hive.metastore.jars.
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:335)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:295)
at 
org.apache.spark.sql.hive.test.TestHiveExternalCatalog.$anonfun$client$1(TestHive.scala:85)
at scala.Option.getOrElse(Option.scala:138)
at 
org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client$lzycompute(TestHive.scala:85)
at 
org.apache.spark.sql.hive.test.TestHiveExternalCatalog.client(TestHive.scala:83)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:217)
at 
scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:217)
at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.databaseExists(ExternalCatalogWithListener.scala:69)
...
{code}

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26764) [SPIP] Spark Relational Cache

2019-02-25 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16777060#comment-16777060
 ] 

Ruslan Dautkhanov commented on SPARK-26764:
---

That seems to be closely related to Hive materialized views - implemented in 
Hive 3.2
HIVE-10459 



> [SPIP] Spark Relational Cache
> -
>
> Key: SPARK-26764
> URL: https://issues.apache.org/jira/browse/SPARK-26764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Adrian Wang
>Priority: Major
> Attachments: Relational+Cache+SPIP.pdf
>
>
> In modern database systems, relational cache is a common technology to boost 
> ad-hoc queries. While Spark provides cache natively, Spark SQL should be able 
> to utilize the relationship between relations to boost all possible queries. 
> In this SPIP, we will make Spark be able to utilize all defined cached 
> relations if possible, without explicit substitution in user query, as well 
> as keep some user defined cache available in different sessions. Materialized 
> views in many database systems provide similar function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776903#comment-16776903
 ] 

M. Le Bihan edited comment on SPARK-26968 at 2/25/19 2:24 PM:
--

It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

 

P.S. : Adding to that, I don't understand why databricks would keep previous 
CSV system, as it is shown here on master branch [on line 504 of this unit 
test|https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala]
 still using and checking the results of NON_NUMERIC especially,

and have been exchanged with _Univocity_ in spark_core or spark_sql, without 
checking that it keeps abilities to give all the same results than before ?


was (Author: mlebihan):
It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

 

P.S. : Adding to that, I don't understand why databricks would keep previous 
CSV system, as it is shown here on master branch :

[https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala|https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scalahttps://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala]

with the unit test on line 504,

and have been exchanged with _Univocity_ in spark_core or spark_sql, without 
checking that it keeps abilities to give all the same results than before ?

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776903#comment-16776903
 ] 

M. Le Bihan edited comment on SPARK-26968 at 2/25/19 2:22 PM:
--

It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

 

P.S. : Adding to that, I don't understand why databricks would keep previous 
CSV system, as it is shown here on master branch :

[https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala|https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scalahttps://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala]

with the unit test on line 504,

and have been exchanged with _Univocity_ in spark_core or spark_sql, without 
checking that it keeps abilities to give all the same results than before ?


was (Author: mlebihan):
It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

 

P.S. : Adding to that, I don't understand why databricks would keep previous 
CSV system, as it is shown here on master branch :

[https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scalahttps://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala]

with the unit test on line 504,

and have been exchanged with _Univocity_ in spark_core or spark_sql, without 
checking that it keeps abilities to give all the same results than before ?

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776903#comment-16776903
 ] 

M. Le Bihan edited comment on SPARK-26968 at 2/25/19 2:21 PM:
--

It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

 

P.S. : Adding to that, I don't understand why databricks would keep previous 
CSV system, as it is shown here on master branch :

[https://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scalahttps://github.com/databricks/spark-csv/blob/master/src/test/scala/com/databricks/spark/csv/CsvSuite.scala]

with the unit test on line 504,

and have been exchanged with _Univocity_ in spark_core or spark_sql, without 
checking that it keeps abilities to give all the same results than before ?


was (Author: mlebihan):
It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}

 This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776903#comment-16776903
 ] 

M. Le Bihan edited comment on SPARK-26968 at 2/25/19 2:01 PM:
--

It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable""03142","LENAX",267,43
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.


was (Author: mlebihan):
It's still a problem, 
I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776903#comment-16776903
 ] 

M. Le Bihan edited comment on SPARK-26968 at 2/25/19 2:01 PM:
--

It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

 
{code:java}
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
"03142","LENAX",267,43{code}

 This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.


was (Author: mlebihan):
It's still a problem, 
 I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
 But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.
"codeCommuneCR","nomCommuneCR","populationCR","resultatComptable""03142","LENAX",267,43
This issue can be set as a regression if Univocity is unable to do it. Because 
before, it was possible. And the issue will be closed when this result could be 
reached again.

 

Don't close this issue too early please.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26968) option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation

2019-02-25 Thread M. Le Bihan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

M. Le Bihan reopened SPARK-26968:
-

It's still a problem, 
I see no equivalent with Univocity to obtain the result I expect, which is  :

String values surrounded by quotes
But the numeric values, not.

Else, the classic importation of that CSV in an Excel or OpenCalc program 
cannot easily do default conversions.

> option("quoteMode", "NON_NUMERIC") have no effect on a CSV generation
> -
>
> Key: SPARK-26968
> URL: https://issues.apache.org/jira/browse/SPARK-26968
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: M. Le Bihan
>Priority: Minor
>
> I have a CSV to write that has that schema :
> {code:java}
> StructType s = schema.add("codeCommuneCR", StringType, false);
> s = s.add("nomCommuneCR", StringType, false);
> s = s.add("populationCR", IntegerType, false);
> s = s.add("resultatComptable", IntegerType, false);{code}
> If I don't provide an option "_quoteMode_" or even if I set it to 
> {{NON_NUMERIC}}, this way :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteMode", "NON_NUMERIC").option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> the CSV written by {{Spark}} is this one :
> {code:java}
> codeCommuneCR,nomCommuneCR,populationCR,resultatComptable
> 03142,LENAX,267,43{code}
> If I set an option "_quoteAll_" instead, like that :
> {code:java}
> ds.coalesce(1).write().mode(SaveMode.Overwrite) .option("header", "true") 
> .option("quoteAll", true).option("quote", "\"") 
> .csv("./target/out_200071470.csv");{code}
> it generates :
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable" 
> "03142","LENAX","267","43"{code}
> It seems that the {{.option("quoteMode", "NON_NUMERIC")}} is broken. It 
> should generate:
>  
> {code:java}
> "codeCommuneCR","nomCommuneCR","populationCR","resultatComptable"
> "03142","LENAX",267,43
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26523) Getting this error while reading from kinesis :- Could not read until the end sequence number of the range: SequenceNumberRange

2019-02-25 Thread Alexey Romanenko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776885#comment-16776885
 ] 

Alexey Romanenko commented on SPARK-26523:
--

[~dchinu97] Did you manage to reproduce this issue? We have seen a similar one 
but, unfortunately, we struggle to reproduce it locally.  

> Getting this error while reading from kinesis :- Could not read until the end 
> sequence number of the range: SequenceNumberRange
> ---
>
> Key: SPARK-26523
> URL: https://issues.apache.org/jira/browse/SPARK-26523
> Project: Spark
>  Issue Type: Brainstorming
>  Components: DStreams, Spark Submit, Structured Streaming
>Affects Versions: 2.4.0
>Reporter: CHIRAG YADAV
>Priority: Major
>
> I am using spark to read data from kinesis stream and after reading data for 
> sometime i get this error ERROR Executor: Exception in task 74.0 in stage 
> 52.0 (TID 339) org.apache.spark.SparkException: Could not read until the end 
> sequence number of the range: 
> SequenceNumberRange(godel-logs,shardId-0007,49591040259365283625183097566179815847537156031957172338,49591040259365283625183097600068424422974441881954418802,4517)
>  
> Can someone please tell why am i getting this error and how to resolve this
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24335) Dataset.map schema not applied in some cases

2019-02-25 Thread Jungtaek Lim (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776867#comment-16776867
 ] 

Jungtaek Lim commented on SPARK-24335:
--

FYI: 
{code}
return new GenericRowWithSchema(fields, inSchema);
{code}

instead of 

{code}
return RowFactory.create(fields); 
{code}

would mitigate the issue, but I agree the behavior looks magical to others and 
definitely not intuitive when "withWatermark" impacts availability of fields.

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26942) spark v 2.3.2 test failure in hive module

2019-02-25 Thread ketan kunde (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776808#comment-16776808
 ] 

ketan kunde commented on SPARK-26942:
-

Logs attached

 

test statistics of LogicalRelation converted from Hive serde tables *** FAILED 
***
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 131.0 failed 1 times, most recent failure: Lost task 0.0 in stage 131.0 
(TID 191, localhost, executor driver): org.iq80.snappy.CorruptionException: 
Invalid copy offset for opcode starting at 4841
 at 
org.iq80.snappy.SnappyDecompressor.decompressAllTags(SnappyDecompressor.java:165)
 at org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:76)
 at org.iq80.snappy.Snappy.uncompress(Snappy.java:43)
 at org.apache.hadoop.hive.ql.io.orc.SnappyCodec.decompress(SnappyCodec.java:71)
 at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:214)
 at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
 at java.io.InputStream.read(InputStream.java:113)
 at 
org.apache.hive.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
 at 
org.apache.hive.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
 at 
org.apache.hive.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
 at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.(OrcProto.java:15780)
 at org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.(OrcProto.java:15744)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15886)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer$1.parsePartialFrom(OrcProto.java:15881)
 at 
org.apache.hive.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
 at 
org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
 at 
org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
 at 
org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
 at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Footer.parseFrom(OrcProto.java:16226)
 at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:479)
 at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:319)
 at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:187)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:75)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$getFileReader$2.apply(OrcFileOperator.scala:73)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
 at 
scala.collection.TraversableOnce$class.collectFirst(TraversableOnce.scala:145)
 at scala.collection.AbstractIterator.collectFirst(Iterator.scala:1336)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:86)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:95)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:95)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 at scala.collection.immutable.List.foreach(List.scala:381)
 at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
 at scala.collection.immutable.List.flatMap(List.scala:344)
 at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:95)
 at 
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:145)
 at 
org.apache.spark.sql.hive.orc.OrcFileFormat$$anonfun$buildReader$2.apply(OrcFileFormat.scala:136)
 at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:36)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:64)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
 at

[jira] [Updated] (SPARK-20984) Reading back from ORC format gives error on big endian systems.

2019-02-25 Thread ketan kunde (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ketan kunde updated SPARK-20984:

Attachment: hive_test_failure_log.txt

> Reading back from ORC format gives error on big endian systems.
> ---
>
> Key: SPARK-20984
> URL: https://issues.apache.org/jira/browse/SPARK-20984
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Redhat 7 on power 7 Big endian platform.
> [testuser@soe10-vm12 spark]$ cat /etc/redhat-
> redhat-access-insights/ redhat-release
> [testuser@soe10-vm12 spark]$ cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
> [testuser@soe10-vm12 spark]$ lscpu
> Architecture:  ppc64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Big Endian
> CPU(s):8
> On-line CPU(s) list:   0-7
> Thread(s) per core:1
> Core(s) per socket:1
> Socket(s): 8
> NUMA node(s):  1
> Model: IBM pSeries (emulated by qemu)
> L1d cache: 32K
> L1i cache: 32K
> NUMA node0 CPU(s): 0-7
> [testuser@soe10-vm12 spark]$
>Reporter: Mahesh
>Priority: Major
>  Labels: big-endian
> Attachments: hive_test_failure_log.txt
>
>
> All orc test cases seem to be failing here. Looks like spark is not able to 
> read back what is written. Following is a way to check it on spark shell. I 
> am also pasting the test case which probably passes on x86. 
> All test cases in OrcHadoopFsRelationSuite.scala are failing.
>  test("SPARK-12218: 'Not' is included in ORC filter pushdown") {
> import testImplicits._
> withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
>   withTempPath { dir =>
> val path = s"${dir.getCanonicalPath}/table1"
> (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", 
> "b").write.orc(path)
> checkAnswer(
>   spark.read.orc(path).where("not (a = 2) or not(b in ('1'))"),
>   (1 to 5).map(i => Row(i, (i % 2).toString)))
> checkAnswer(
>   spark.read.orc(path).where("not (a = 2 and b in ('1'))"),
>   (1 to 5).map(i => Row(i, (i % 2).toString)))
>   }
> }
>   }
> Same can be reproduced on spark shell
> **Create a DF and write it in orc
> scala> (1 to 5).map(i => (i, (i % 2).toString)).toDF("a", 
> "b").write.orc("test")
> **Now try to read it back
> scala> spark.read.orc("test").where("not (a = 2) or not(b in ('1'))").show
> 17/06/05 04:20:48 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> org.iq80.snappy.CorruptionException: Invalid copy offset for opcode starting 
> at 13
> at 
> org.iq80.snappy.SnappyDecompressor.decompressAllTags(SnappyDecompressor.java:165)
> at 
> org.iq80.snappy.SnappyDecompressor.uncompress(SnappyDecompressor.java:76)
> at org.iq80.snappy.Snappy.uncompress(Snappy.java:43)
> at 
> org.apache.hadoop.hive.ql.io.orc.SnappyCodec.decompress(SnappyCodec.java:71)
> at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:214)
> at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:238)
> at java.io.InputStream.read(InputStream.java:101)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701)
> at 
> org.apache.hive.com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10661)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.(OrcProto.java:10625)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10730)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:10725)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
> at 
> org.apache.hive.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:10937)
> at 
> org.apache.hadoop.hive.ql.io.orc.MetadataReader.readStripeFooter(MetadataReader.java:113)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:228)
> at 
>

[jira] [Resolved] (SPARK-26959) Join of two tables, bucketed the same way, on bucket columns and one or more other coulmns should not need a shuffle

2019-02-25 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-26959.
-
Resolution: Duplicate

> Join of two tables, bucketed the same way, on bucket columns and one or more 
> other coulmns should not need a shuffle
> 
>
> Key: SPARK-26959
> URL: https://issues.apache.org/jira/browse/SPARK-26959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Natang
>Priority: Major
>
> _When two tables, that are bucketed the same way, are joined using bucket 
> columns and one or more other columns, Spark should be able to perform the 
> join without doing a shuffle._
> Consider the example below. There are two tables, 'join_left_table' and 
> 'join_right_table', bucketed by 'col1' into 4 buckets. When these tables are 
> joined on 'col1' and 'col2', Spark should be able to do the join without 
> having to do a shuffle. All entries for a give value of 'col1' would be in 
> the same bucket for both the tables, irrespective of values of 'col2'.
>  
> 
>  
>  
> {noformat}
> def randomInt1to100 = scala.util.Random.nextInt(100)+1
> val left = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> val right = sc.parallelize(
>   Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
> ).toDF("col1", "col2", "col3")
> import org.apache.spark.sql.SaveMode
> left.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_left_table")
> 
> right.write
> .bucketBy(4,"col1")
> .sortBy("col1", "col2")
> .mode(SaveMode.Overwrite)
> .saveAsTable("join_right_table")
> val left_table = spark.read.table("join_left_table")
> val right_table = spark.read.table("join_right_table")
> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
> val join_on_col1=left_table.join(
> right_table,
> Seq("col1"))
> join_on_col1.explain
> ### BEGIN Output
> join_on_col1: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 3 
> more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col2#258, col3#259]
> +- *SortMergeJoin [col1#250], [col1#257], Inner
>:- *Sort [col1#250 ASC NULLS FIRST], false, 0
>:  +- *Project [col1#250, col2#251, col3#252]
>: +- *Filter isnotnull(col1#250)
>:+- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
>+- *Sort [col1#257 ASC NULLS FIRST], false, 0
>   +- *Project [col1#257, col2#258, col3#259]
>  +- *Filter isnotnull(col1#257)
> +- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: 
> struct
> ### END Output
> val join_on_col1_col2=left_table.join(
> right_table,
> Seq("col1","col2"))
> join_on_col1_col2.explain
> ### BEGIN Output
> join_on_col1_col2: org.apache.spark.sql.DataFrame = [col1: int, col2: int ... 
> 2 more fields]
> == Physical Plan ==
> *Project [col1#250, col2#251, col3#252, col3#259]
> +- *SortMergeJoin [col1#250, col2#251], [col1#257, col2#258], Inner
>:- *Sort [col1#250 ASC NULLS FIRST, col2#251 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(col1#250, col2#251, 200)
>: +- *Project [col1#250, col2#251, col3#252]
>:+- *Filter (isnotnull(col2#251) && isnotnull(col1#250))
>:   +- *FileScan parquet 
> default.join_left_table[col1#250,col2#251,col3#252] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_left_table],
>  PartitionFilters: [], PushedFilters: [IsNotNull(col2), IsNotNull(col1)], 
> ReadSchema: struct
>+- *Sort [col1#257 ASC NULLS FIRST, col2#258 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(col1#257, col2#258, 200)
>  +- *Project [col1#257, col2#258, col3#259]
> +- *Filter (isnotnull(col2#258) && isnotnull(col1#257))
>+- *FileScan parquet 
> default.join_right_table[col1#257,col2#258,col3#259] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[hdfs://ip-10-0-108-205.ec2.internal:8020/user/spark/warehouse/join_right_table],
>  PartitionFilters: [],

[jira] [Resolved] (SPARK-25590) kubernetes-model-2.0.0.jar masks default Spark logging config

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25590.
---
   Resolution: Fixed
 Assignee: Jiaxin Shan
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/23814 , per Marcelo

> kubernetes-model-2.0.0.jar masks default Spark logging config
> -
>
> Key: SPARK-25590
> URL: https://issues.apache.org/jira/browse/SPARK-25590
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Jiaxin Shan
>Priority: Major
> Fix For: 3.0.0
>
>
> That jar file, which is packaged when the k8s profile is enabled, has a log4j 
> configuration embedded in it:
> {noformat}
> $ jar tf /path/to/kubernetes-model-2.0.0.jar | grep log4j
> log4j.properties
> {noformat}
> What this causes is that Spark will always use that log4j configuration 
> instead of its own default (log4j-defaults.properties), unless the user 
> overrides it by somehow adding their own in the classpath before the 
> kubernetes one.
> You can see that by running spark-shell. With the k8s jar in:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Setting default log level to "WARN"
> {noformat}
> Removing the k8s jar:
> {noformat}
> $ ./bin/spark-shell 
> ...
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> {noformat}
> The proper fix would be for the k8s jar to not ship that file, and then just 
> upgrade the dependency in Spark, but if there's something easy we can do in 
> the meantime...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26742:
-

Assignee: Jiaxin Shan

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Assignee: Jiaxin Shan
>Priority: Major
>  Labels: easyfix
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26742.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23814
[https://github.com/apache/spark/pull/23814]

> Bump Kubernetes Client Version to 4.1.1
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Assignee: Jiaxin Shan
>Priority: Major
>  Labels: easyfix
> Fix For: 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-25 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Description: 
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***

  was:
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)


> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
>  Seems that the difference in mapping of float and decimal on big endian is 
> causing the assert to fail.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26940) Observed greater deviation on big endian platform for SingletonReplSuite test case

2019-02-25 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26940:
--
Priority: Critical  (was: Major)

> Observed greater deviation on big endian platform for SingletonReplSuite test 
> case
> --
>
> Key: SPARK-26940
> URL: https://issues.apache.org/jira/browse/SPARK-26940
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.2
> Environment: Ubuntu 16.04 LTS
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux (JIT enabled, AOT 
> enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: failure_log.txt
>
>
> I have built Apache Spark v2.3.2 on Big Endian platform with AdoptJDK OpenJ9 
> 1.8.0_202.
> My build is successful. However while running the scala tests of "*Spark 
> Project REPL*" module, I am facing failures at SingletonReplSuite with error 
> log as attached.
> The deviation observed on big endian is greater than the acceptable deviation 
> 0.2.
> How efficient is it to increase the deviation defined in 
> SingletonReplSuite.scala
> Can this be fixed? 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-25 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Attachment: InMemoryColumnarQuerySuite.txt
DataFrameTungstenSuite.txt

> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt
>
>
> I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
>  I am observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
>  Seems that the difference in mapping of float and decimal on big endian is 
> causing the assert to fail.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***
>  99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-25 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Description: 
While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)

  was:
I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
 I am observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)


> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
> Attachments: DataFrameTungstenSuite.txt, 
> InMemoryColumnarQuerySuite.txt
>
>
> While running tests on Apache Spark v2.3.2 with AdoptJDK on big endian, I am 
> observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
>  Seems that the difference in mapping of float and decimal on big endian is 
> causing the assert to fail.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***
>  99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-25 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-26985:
--
Description: 
I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
 I am observing test failures for 2 Suites of Project SQL.
 1. InMemoryColumnarQuerySuite
 2. DataFrameTungstenSuite
 In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
 Seems that the difference in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)

  was:
I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
I am obsorving test failures at 2 Suites of Prject SQL.
1. InMemoryColumnarQuerySuite
2. DataFrameTungstenSuite
In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
Seems that the differnce in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)


> Test "access only some column of the all of columns " fails on big endian
> -
>
> Key: SPARK-26985
> URL: https://issues.apache.org/jira/browse/SPARK-26985
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: Linux Ubuntu 16.04 
> openjdk version "1.8.0_202"
> OpenJDK Runtime Environment (build 1.8.0_202-b08)
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed 
> References 20190205_218 (JIT enabled, AOT enabled)
> OpenJ9 - 90dd8cb40
> OMR - d2f4534b
> JCL - d002501a90 based on jdk8u202-b08)
>  
>Reporter: Anuja Jakhade
>Priority: Critical
>
> I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
>  I am observing test failures for 2 Suites of Project SQL.
>  1. InMemoryColumnarQuerySuite
>  2. DataFrameTungstenSuite
>  In both the cases test "access only some column of the all of columns" fails 
> due to mismatch in the final assert.
>  Seems that the difference in mapping of float and decimal on big endian is 
> causing the assert to fail.
> Inside test !!- access only some column of the all of columns *** FAILED 
> ***
>  99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26985) Test "access only some column of the all of columns " fails on big endian

2019-02-25 Thread Anuja Jakhade (JIRA)

Anuja Jakhade created SPARK-26985:
-

 Summary: Test "access only some column of the all of columns " 
fails on big endian
 Key: SPARK-26985
 URL: https://issues.apache.org/jira/browse/SPARK-26985
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
 Environment: Linux Ubuntu 16.04 

openjdk version "1.8.0_202"
OpenJDK Runtime Environment (build 1.8.0_202-b08)
Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 64-Bit Compressed References 
20190205_218 (JIT enabled, AOT enabled)
OpenJ9 - 90dd8cb40
OMR - d2f4534b
JCL - d002501a90 based on jdk8u202-b08)

 
Reporter: Anuja Jakhade


I am running tests on Apache Spark v2.3.2 with AdoptJDK on big endian
I am obsorving test failures at 2 Suites of Prject SQL.
1. InMemoryColumnarQuerySuite
2. DataFrameTungstenSuite
In both the cases test "access only some column of the all of columns" fails 
due to mismatch in the final assert.
Seems that the differnce in mapping of float and decimal on big endian is 
causing the assert to fail.

Inside test !!- access only some column of the all of columns *** FAILED ***
 99 did not equal 9 (InMemoryColumnarQuerySuite.scala:153)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26966) Update JPMML to 1.4.8 for Java 9+

2019-02-25 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26966.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23868
[https://github.com/apache/spark/pull/23868]

> Update JPMML to 1.4.8 for Java 9+
> -
>
> Key: SPARK-26966
> URL: https://issues.apache.org/jira/browse/SPARK-26966
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> JPMML apparently only supports Java 9 in 1.4.2+. We are seeing text failures 
> from JPMML relating to JAXB when running on Java 11. It's shaded and not a 
> big change, so should be safe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26983) Spark PassThroughSuite,ColumnVectorSuite failure on bigendian

2019-02-25 Thread salamani (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salamani updated SPARK-26983:
-
Description: 
Following failures are observed in Spark Project SQL  on big endian system

PassThroughSuite :
 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***

 

ColumnVectorSuite:
 - CachedBatch long Apis
 - CachedBatch float Apis *** FAILED ***
 4.6006E-41 did not equal 1.0 (ColumnVectorSuite.scala:378)
 - CachedBatch double Apis *** FAILED ***
 3.03865E-319 did not equal 1.0 (ColumnVectorSuite.scala:402)
 Run completed in 8 seconds, 183 milliseconds.
 Total number of tests run: 21
 Suites: completed 2, aborted 0
 Tests: succeeded 19, failed 2, canceled 0, ignored 0, pending 0
 ** 
 *** 2 TESTS FAILED ***

  was:
Following failures are observed for PassThroughSuite in Spark Project SQL  on 
big endian system
 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***

Summary: Spark PassThroughSuite,ColumnVectorSuite failure on bigendian  
(was: Spark PassThroughSuite failure on bigendian)

> Spark PassThroughSuite,ColumnVectorSuite failure on bigendian
> -
>
> Key: SPARK-26983
> URL: https://issues.apache.org/jira/browse/SPARK-26983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: salamani
>Priority: Major
> Fix For: 2.3.2
>
>
> Following failures are observed in Spark Project SQL  on big endian system
> PassThroughSuite :
>  - PassThrough with FLOAT: empty column for decompress()
>  - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
>  Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with FLOAT: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with DOUBLE: empty column
>  - PassThrough with DOUBLE: long random series
>  - PassThrough with DOUBLE: empty column for decompress()
>  - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
>  Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
> decoded double value (PassThroughEncodingSuite.scala:150)
>  - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
> (PassThroughEncodingSuite.scala:150)
>  Run completed in 9 seconds, 72 milliseconds.
>  Total number of tests run: 30
>  Suites:

[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2019-02-25 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776677#comment-16776677
 ] 

Yuming Wang commented on SPARK-20049:
-

Could you try to set 
{{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}?

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25061) Spark SQL Thrift Server fails to not pick up hiveconf passing parameter

2019-02-25 Thread Udbhav Agrawal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776646#comment-16776646
 ] 

Udbhav Agrawal commented on SPARK-25061:


In my case it is able to overwrite successfully through passing parameters, 
both by using --conf as well as --hiveconf.

 

>  Spark SQL Thrift Server fails to not pick up hiveconf passing parameter
> 
>
> Key: SPARK-25061
> URL: https://issues.apache.org/jira/browse/SPARK-25061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zineng Yuan
>Priority: Critical
>
> Spark thrift server should use passing parameter value and overwrites the 
> same conf from hive-site.xml. For example,  the server should overwrite what 
> exists in hive-site.xml.
>  ./sbin/start-thriftserver.sh --master yarn-client ...
> --hiveconf 
> "hive.server2.authentication.kerberos.principal=" ...
> 
> hive.server2.authentication.kerberos.principal
> hive/_HOST@
> 
> However, the server takes what in hive-site.xml.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26824) Streaming queries may store checkpoint data in a wrong directory

2019-02-25 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26824:
-
Docs Text: Earlier version of Spark incorrectly escaped paths when writing 
out checkpoints and "_spark_metadata" for structured streaming. Queries 
affected by this issue will fail when running in Spark 3.0. It will report an 
instruction about how to migrate your queries.

> Streaming queries may store checkpoint data in a wrong directory
> 
>
> Key: SPARK-26824
> URL: https://issues.apache.org/jira/browse/SPARK-26824
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> When a user specifies a checkpoint location containing special chars that 
> need to be escaped in a path, the streaming query will store checkpoint in a 
> wrong place. For example, if you use "/chk chk", the metadata will be stored 
> in "/chk%20chk". File sink's "_spark_metadata" directory has the same issue. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26984) Incompatibility between Spark releases - Some(null)

2019-02-25 Thread Gerard Alexander (JIRA)

Gerard Alexander created SPARK-26984:


 Summary: Incompatibility between Spark releases - Some(null) 
 Key: SPARK-26984
 URL: https://issues.apache.org/jira/browse/SPARK-26984
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: Linux CentOS, Databricks.
Reporter: Gerard Alexander
 Fix For: 2.4.1, 2.4.2


Please refer to 
[https://stackoverflow.com/questions/54851205/why-does-somenull-throw-nullpointerexception-in-spark-2-4-but-worked-in-2-2/54861152#54861152.]

NB: Not sure of priority being correct - no doubt one will evaluate.

It is noted that the following:

{{val df = Seq( }}

{{  (1, Some("a"), Some(1)), }}

{{  (2, Some(null), Some(2)), }}

{{  (3, Some("c"), Some(3)), }}

{{  (4, None, None) ).toDF("c1", "c2", "c3")}}

In Spark 2.2.1 (on mapr) the Some(null) works fine, in Spark 2.4.0 on 
Databricks an error ensues.

{{java.lang.RuntimeException: Error while encoding: 
java.lang.NullPointerException assertnotnull(assertnotnull(input[0, 
scala.Tuple3, true]))._1 AS _1#6 staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
unwrapoption(ObjectType(class java.lang.String), 
assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) AS 
_2#7 unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, 
scala.Tuple3, true]))._3) AS _3#8 at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
 at 
org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233) 
at scala.collection.immutable.List.foreach(List.scala:388) at 
scala.collection.TraversableLike.map(TraversableLike.scala:233) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:226) at 
scala.collection.immutable.List.map(List.scala:294) at 
org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472) at 
org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377) at 
org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
 ... 57 elided Caused by: java.lang.NullPointerException at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source) at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
 ... 66 more}}

 

You can argue it is solvable otherwise, but there may well be an existing code 
base that could be affected.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16784) Configurable log4j settings

2019-02-25 Thread Narcis Andrei Moga (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776633#comment-16776633
 ] 

Narcis Andrei Moga edited comment on SPARK-16784 at 2/25/19 8:33 AM:
-

I have the same problem after migration from Spark 2.2.1 to 2.4.0 and deploy 
mode  cluster & standalone namager (it not happens in client mode deploy)

I test in docker and all required files are present in all containers (1 master 
& 2 workers - Spark have no config for this test - it is just untar)

*1) Executor command observed in the stderr file*

Spark Executor Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
"-Dspark.driver.port=45431" "-Dspark.cassandra.connection.port=9042" 
 "-Dspark.rpc.askTimeout=10s" "-Dspark.application.ldap.port=55389" 
_*"-Duser.timezone=UTC"*_ 
_*"-Dlog4j.configuration=[file:///log4j.properties.executor]"*_

"-Dcom.sun.management.jmxremote" 
 "-Dcom.sun.management.jmxremote.authenticate=false"

"-Dcom.sun.management.jmxremote.local.only=false"

"-Dcom.sun.management.jmxremote.ssl=false" "-Djava.net.preferIPv4Stack=true" 
 "-Dcom.sun.management.jmxremote.port=0" 
"-Djava.util.logging.config.file=/jmx-logging.properties" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
 "spark://CoarseGrainedScheduler@c1-spark-executor2:45431" "--executor-id" "1" 
"--hostname" "172.18.0.22" "--cores" "1" "--app-id" "app-20190224171936-0010" 
"--worker-url" 
 "spark://Worker@172.18.0.22:36555"

*2) Partial command of the Driver observed in the stderr file*

Launch Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
 _*"-Dspark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=[file:///log4j.properties.driver*_]
 "-Dspark.kafka.ppu.topic.name=..." 
 

*3) Submit command*

spark-submit \
 --deploy-mode cluster \
 --master spark://172.18.0.20:7077 \
 --properties-file /application.properties \
 --class com... \
 /logs-correlation-2.4.1-1.noarch.jar

*4) application.properties contains*

spark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=[file:///log4j.properties.driver]

spark.executor.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=[file:///log4j.properties.executor]

 

 


was (Author: andreim):
I have the same problem after migration from Spark 2.2.1 to 2.4.0 and deploy 
mode  cluster & standalone namager (it not happens in client mode deploy)

I test in docker and all required files are present in all containers (1 master 
& 2 workers - Spark have no config for this test - it is just untar)

*1) Executor command observed in the stderr file*

Spark Executor Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
"-Dspark.driver.port=45431" "-Dspark.cassandra.connection.port=9042" 
"-Dspark.rpc.askTimeout=10s" "-Dspark.application.ldap.port=55389" 
_*"-Duser.timezone=UTC"*_ 
_*"-Dlog4j.configuration=file:///log4j.properties.executor"*_ 
"-Dcom.sun.management.jmxremote" 
"-Dcom.sun.management.jmxremote.authenticate=false" 
"-Dcom.sun.management.jmxremote.local.only=false" 
"-Dcom.sun.management.jmxremote.ssl=false" "-Djava.net.preferIPv4Stack=true" 
"-Dcom.sun.management.jmxremote.port=0" 
"-Djava.util.logging.config.file=/jmx-logging.properties" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@c1-spark-executor2:45431" "--executor-id" "1" 
"--hostname" "172.18.0.22" "--cores" "1" "--app-id" "app-20190224171936-0010" 
"--worker-url" 
"spark://Worker@172.18.0.22:36555"

*2) Partial command of the Driver observed in the stderr file*

Launch Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
_*"-Dspark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.driver*_
"-Dspark.kafka.ppu.topic.name=..." 


*3) Submit command*

spark-submit \
--deploy-mode cluster \
--master spark://172.18.0.20:7077 \
--properties-file /application.properties \
--class com... \
/logs-correlation-2.4.1-1.noarch.jar

*4) application.properties contains*

spark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.driver

spark.executor.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.executor

 

 

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>Priority: Major
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult

[jira] [Commented] (SPARK-16784) Configurable log4j settings

2019-02-25 Thread Narcis Andrei Moga (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-16784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16776633#comment-16776633
 ] 

Narcis Andrei Moga commented on SPARK-16784:


I have the same problem after migration from Spark 2.2.1 to 2.4.0 and deploy 
mode  cluster & standalone namager (it not happens in client mode deploy)

I test in docker and all required files are present in all containers (1 master 
& 2 workers - Spark have no config for this test - it is just untar)

*1) Executor command observed in the stderr file*

Spark Executor Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
"-Dspark.driver.port=45431" "-Dspark.cassandra.connection.port=9042" 
"-Dspark.rpc.askTimeout=10s" "-Dspark.application.ldap.port=55389" 
_*"-Duser.timezone=UTC"*_ 
_*"-Dlog4j.configuration=file:///log4j.properties.executor"*_ 
"-Dcom.sun.management.jmxremote" 
"-Dcom.sun.management.jmxremote.authenticate=false" 
"-Dcom.sun.management.jmxremote.local.only=false" 
"-Dcom.sun.management.jmxremote.ssl=false" "-Djava.net.preferIPv4Stack=true" 
"-Dcom.sun.management.jmxremote.port=0" 
"-Djava.util.logging.config.file=/jmx-logging.properties" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@c1-spark-executor2:45431" "--executor-id" "1" 
"--hostname" "172.18.0.22" "--cores" "1" "--app-id" "app-20190224171936-0010" 
"--worker-url" 
"spark://Worker@172.18.0.22:36555"

*2) Partial command of the Driver observed in the stderr file*

Launch Command: "/srv/java/jdk/bin/java" "-cp" 
"/usr/lib/spark/conf/:/usr/lib/spark/jars/*" "-Xmx1024M" 
_*"-Dspark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.driver*_
"-Dspark.kafka.ppu.topic.name=..." 


*3) Submit command*

spark-submit \
--deploy-mode cluster \
--master spark://172.18.0.20:7077 \
--properties-file /application.properties \
--class com... \
/logs-correlation-2.4.1-1.noarch.jar

*4) application.properties contains*

spark.driver.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.driver

spark.executor.extraJavaOptions=-Duser.timezone=UTC 
-Dlog4j.configuration=file:///log4j.properties.executor

 

 

> Configurable log4j settings
> ---
>
> Key: SPARK-16784
> URL: https://issues.apache.org/jira/browse/SPARK-16784
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Michael Gummelt
>Priority: Major
>
> I often want to change the logging configuration on a single spark job.  This 
> is easy in client mode.  I just modify log4j.properties.  It's difficult in 
> cluster mode, because I need to modify the log4j.properties in the 
> distribution in which the driver runs.  I'd like a way of setting this 
> dynamically, such as a java system property.  Some brief searching showed 
> that log4j doesn't seem to accept such a property, but I'd like to open up 
> this idea for further comment.  Maybe we can find a solution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26983) Spark PassThroughSuite failure on bigendian

2019-02-25 Thread salamani (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

salamani updated SPARK-26983:
-
Description: 
Following failures are observed for PassThroughSuite in Spark Project SQL  


 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***


  was:
Following failures are observed for PassThroughSuite in Spark Project SQL  

```
 - PassThrough with FLOAT: empty column for decompress()
 - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
 Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with FLOAT: simple case with null for decompress() *** FAILED ***
 Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
(PassThroughEncodingSuite.scala:146)
 - PassThrough with DOUBLE: empty column
 - PassThrough with DOUBLE: long random series
 - PassThrough with DOUBLE: empty column for decompress()
 - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
 Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
decoded double value (PassThroughEncodingSuite.scala:150)
 - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
***
 Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
(PassThroughEncodingSuite.scala:150)
 Run completed in 9 seconds, 72 milliseconds.
 Total number of tests run: 30
 Suites: completed 2, aborted 0
 Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
 ** 
 *** 4 TESTS FAILED ***
 ```


> Spark PassThroughSuite failure on bigendian
> ---
>
> Key: SPARK-26983
> URL: https://issues.apache.org/jira/browse/SPARK-26983
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: salamani
>Priority: Major
> Fix For: 2.3.2
>
>
> Following failures are observed for PassThroughSuite in Spark Project SQL  
>  - PassThrough with FLOAT: empty column for decompress()
>  - PassThrough with FLOAT: long random series for decompress() *** FAILED ***
>  Expected 0.10990685, but got -6.6357654E14 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with FLOAT: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 9.0E-44 Wrong 0-th decoded float value 
> (PassThroughEncodingSuite.scala:146)
>  - PassThrough with DOUBLE: empty column
>  - PassThrough with DOUBLE: long random series
>  - PassThrough with DOUBLE: empty column for decompress()
>  - PassThrough with DOUBLE: long random series for decompress() *** FAILED ***
>  Expected 0.20634564007984624, but got 5.902392643940031E-230 Wrong 0-th 
> decoded double value (PassThroughEncodingSuite.scala:150)
>  - PassThrough with DOUBLE: simple case with null for decompress() *** FAILED 
> ***
>  Expected 2.0, but got 3.16E-322 Wrong 0-th decoded double value 
> (PassThroughEncodingSuite.scala:150)
>  Run completed in 9 seconds, 72 milliseconds.
>  Total number of tests run: 30
>  Suites: completed 2, aborted 0
>  Tests: succeeded 26, failed 4, canceled 0, ignored 0, pending 0
>  ** 
>  *** 4 TESTS FAILED ***



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

85 matches

Mail list logo