[jira] [Created] (SPARK-31242) Clone SparkSession should respect spark.sql.legacy.sessionInitWithConfigDefaults

2020-03-24 Thread wuyi (Jira)
wuyi created SPARK-31242:


 Summary: Clone SparkSession should respect 
spark.sql.legacy.sessionInitWithConfigDefaults
 Key: SPARK-31242
 URL: https://issues.apache.org/jira/browse/SPARK-31242
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


In SQL test, a conf specified by `withSQLConf` can be reverted to "original 
value" after cloning SparkSession if the "original value" is already set in 
SparkConf level. Because in `WithTestConf`, it doesn't  respect 
spark.sql.legacy.sessionInitWithConfigDefaults and always merge SQLConf with 
SparkConf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31207) Ensure the total number of blocks to fetch equals to the sum of local/hostLocal/remote blocks

2020-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31207.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27972
[https://github.com/apache/spark/pull/27972]

> Ensure the total number of blocks to fetch equals to the sum of 
> local/hostLocal/remote blocks
> -
>
> Key: SPARK-31207
> URL: https://issues.apache.org/jira/browse/SPARK-31207
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> Assert the number of blocks to fetch equals the number of local blocks + the 
> number of hostLocal blocks + the number of remote blocks in 
> ShuffleBlockFetcherIterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31239) Fix flaky test: WorkerDecommissionSuite.verify a task with all workers decommissioned succeeds

2020-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31239.
--
Target Version/s: 3.1.0
  Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28010

> Fix flaky test: WorkerDecommissionSuite.verify a task with all workers 
> decommissioned succeeds
> --
>
> Key: SPARK-31239
> URL: https://issues.apache.org/jira/browse/SPARK-31239
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120284/testReport/junit/org.apache.spark.scheduler/WorkerDecommissionSuite/verify_a_task_with_all_workers_decommissioned_succeeds/
> Error Message
> java.util.concurrent.TimeoutException: Futures timed out after [2 seconds]
> Stacktrace
> sbt.ForkMain$ForkError: java.util.concurrent.TimeoutException: Futures timed 
> out after [2 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
>   at org.apache.spark.SimpleFutureAction.result(FutureAction.scala:130)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:295)
>   at 
> org.apache.spark.scheduler.WorkerDecommissionSuite.$anonfun$new$3(WorkerDecommissionSuite.scala:73)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31241) Support Hive on DataSourceV2

2020-03-24 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-31241:
--

 Summary: Support Hive on DataSourceV2
 Key: SPARK-31241
 URL: https://issues.apache.org/jira/browse/SPARK-31241
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Jackey Lee


There are 3 reasons why we need to support Hive on DataSourceV2.
1. Hive itself is one of Spark data sources.
2. HiveTable is essentially a FileTable with its own input and output
formats, it works fine with FileTable.
3. HiveTable should be stateless, and users can freely read or write Hive
using batch or microbatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31240) Constant fold deterministic Scala UDFs with foldable arguments

2020-03-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31240.
--
Resolution: Duplicate

> Constant fold deterministic Scala UDFs with foldable arguments
> --
>
> Key: SPARK-31240
> URL: https://issues.apache.org/jira/browse/SPARK-31240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> Constant fold deterministic Scala UDFs with foldable arguments, 
> conservatively.
> ScalaUDFs that meet all following criteria are subject to constant folding in 
> this feature:
> * deterministic
> * all arguments are foldable
> * does not throw an exception when evaluating the UDF for constant folding



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31240) Constant fold deterministic Scala UDFs with foldable arguments

2020-03-24 Thread Kris Mok (Jira)
Kris Mok created SPARK-31240:


 Summary: Constant fold deterministic Scala UDFs with foldable 
arguments
 Key: SPARK-31240
 URL: https://issues.apache.org/jira/browse/SPARK-31240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kris Mok


Constant fold deterministic Scala UDFs with foldable arguments, conservatively.

ScalaUDFs that meet all following criteria are subject to constant folding in 
this feature:
* deterministic
* all arguments are foldable
* does not throw an exception when evaluating the UDF for constant folding



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31239) Fix flaky test: WorkerDecommissionSuite.verify a task with all workers decommissioned succeeds

2020-03-24 Thread Xingbo Jiang (Jira)
Xingbo Jiang created SPARK-31239:


 Summary: Fix flaky test: WorkerDecommissionSuite.verify a task 
with all workers decommissioned succeeds
 Key: SPARK-31239
 URL: https://issues.apache.org/jira/browse/SPARK-31239
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Xingbo Jiang
Assignee: Xingbo Jiang


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120284/testReport/junit/org.apache.spark.scheduler/WorkerDecommissionSuite/verify_a_task_with_all_workers_decommissioned_succeeds/

Error Message
java.util.concurrent.TimeoutException: Futures timed out after [2 seconds]
Stacktrace
sbt.ForkMain$ForkError: java.util.concurrent.TimeoutException: Futures timed 
out after [2 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:187)
at org.apache.spark.SimpleFutureAction.result(FutureAction.scala:130)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:295)
at 
org.apache.spark.scheduler.WorkerDecommissionSuite.$anonfun$new$3(WorkerDecommissionSuite.scala:73)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:58)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066210#comment-17066210
 ] 

Dongjoon Hyun commented on SPARK-31238:
---

I confirmed the issue, too.

> Incompatible ORC dates with Spark 2.4
> -
>
> Key: SPARK-31238
> URL: https://issues.apache.org/jira/browse/SPARK-31238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("select cast('1200-01-01' as date) 
> dt").write.mode("overwrite").orc("/tmp/datefile")
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-01|
> +--+
> scala> :quit
> {noformat}
> Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-08|
> +--+
> scala>
> {noformat}
> Dates are off.
> Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31236) Spark error while consuming data from Kinesis direct end point

2020-03-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31236:

Priority: Critical  (was: Blocker)

> Spark error while consuming data from Kinesis direct end point
> --
>
> Key: SPARK-31236
> URL: https://issues.apache.org/jira/browse/SPARK-31236
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Java API
>Affects Versions: 2.4.5
>Reporter: Thukarama Prabhu
>Priority: Critical
>
> Here is the summary of the issue I am experiencing when using kinesis direct 
> URL for consuming data using spark.
> *Kinesis direct URL:* 
> [https://kinesis-ae1.hdw.r53.deap.tv|https://kinesis-ae1.hdw.r53.deap.tv/] 
> (Failing with Credential should be scoped to a valid region, not 'ae1')
> *Kinesis default URL:* 
> [https://kinesis.us-east-1.amazonaws.com|https://kinesis.us-east-1.amazonaws.com/]
>  (Working)
> Spark code for consuming data
> SparkAWSCredentials credentials = 
> commonService.getSparkAWSCredentials(kinApp.propConfig);
> KinesisInputDStream kinesisStream = KinesisInputDStream.builder()
>     .streamingContext(jssc)
>     .checkpointAppName(applicationName)
>     .streamName(streamName)
>     .endpointUrl(endpointURL)
>     .regionName(regionName)
>     
> .initialPosition(KinesisInitialPositions.fromKinesisInitialPosition(initPosition))
>     .checkpointInterval(checkpointInterval)
>     .kinesisCredentials(credentials)
>     .storageLevel(StorageLevel.MEMORY_AND_DISK_2()).build();
>  
> Spark version 2.4.4
> 
>     org.apache.spark
>     spark-streaming-kinesis-asl_2.11
>     2.4.5
> 
> 
>     com.amazonaws
>     amazon-kinesis-client
>     1.13.3
> 
> 
>     com.amazonaws
>     aws-java-sdk
>     1.11.747
> 
>  
> The spark application works fine when I use default URL but fails when I 
> change to direct URL with below error. The direct URL works when I try to 
> publish to direct kinesis URL. Issue only when I try to consume data.
>  
> 2020-03-24 08:43:40,650 ERROR - Caught exception while sync'ing Kinesis 
> shards and leases
> com.amazonaws.services.kinesis.model.AmazonKinesisException: Credential 
> should be scoped to a valid region, not 'ae1'.  (Service: AmazonKinesis; 
> Status Code: 400; Error Code: InvalidSignatureException; Request ID: 
> fb43b636-8ce2-ec77-adb7-a8ead9e038c2)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1799)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1383)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1359)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)
>     at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)
>     at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)
>     at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)
>     at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2809)
>     at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2776)
>     at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2765)
>     at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1557)
>     at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
>     at 
> com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.listShards(KinesisProxy.java:326)
>     at 
> com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getShardList(KinesisProxy.java:441)
>     at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.getShardList(KinesisShardSyncer.java:349)
>     at 
> com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.syncShardLeases(KinesisShardSyncer.java:159)
>     at 
> 

[jira] [Commented] (SPARK-31081) Make display of stageId/stageAttemptId/taskId of sql metrics toggleable

2020-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066188#comment-17066188
 ] 

Dongjoon Hyun commented on SPARK-31081:
---

Hi, [~Gengliang.Wang]. Please set the `Fix Version`.

> Make display of stageId/stageAttemptId/taskId of sql metrics toggleable
> ---
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: Kousuke Saruta
>Priority: Major
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31081) Make display of stageId/stageAttemptId/taskId of sql metrics toggleable

2020-03-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31081:
--
Fix Version/s: 3.0.0

> Make display of stageId/stageAttemptId/taskId of sql metrics toggleable
> ---
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0
>
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066179#comment-17066179
 ] 

Dongjoon Hyun commented on SPARK-31238:
---

This issue inherits the `Priority` of the parent issue, SPARK-30951.

> Incompatible ORC dates with Spark 2.4
> -
>
> Key: SPARK-31238
> URL: https://issues.apache.org/jira/browse/SPARK-31238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("select cast('1200-01-01' as date) 
> dt").write.mode("overwrite").orc("/tmp/datefile")
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-01|
> +--+
> scala> :quit
> {noformat}
> Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-08|
> +--+
> scala>
> {noformat}
> Dates are off.
> Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31238:
--
Priority: Blocker  (was: Major)

> Incompatible ORC dates with Spark 2.4
> -
>
> Key: SPARK-31238
> URL: https://issues.apache.org/jira/browse/SPARK-31238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("select cast('1200-01-01' as date) 
> dt").write.mode("overwrite").orc("/tmp/datefile")
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-01|
> +--+
> scala> :quit
> {noformat}
> Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-08|
> +--+
> scala>
> {noformat}
> Dates are off.
> Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31238:
--
Target Version/s: 3.0.0

> Incompatible ORC dates with Spark 2.4
> -
>
> Key: SPARK-31238
> URL: https://issues.apache.org/jira/browse/SPARK-31238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Blocker
>
> Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("select cast('1200-01-01' as date) 
> dt").write.mode("overwrite").orc("/tmp/datefile")
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-01|
> +--+
> scala> :quit
> {noformat}
> Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
> {noformat}
> $ export TZ=UTC
> $ bin/spark-shell --conf spark.sql.session.timeZone=UTC
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.orc("/tmp/datefile").show
> +--+
> |dt|
> +--+
> |1200-01-08|
> +--+
> scala>
> {noformat}
> Dates are off.
> Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066177#comment-17066177
 ] 

Dongjoon Hyun commented on SPARK-30951:
---

Thank you so much, [~bersprockets]!

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata, there is no good 
> solution).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-24 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066170#comment-17066170
 ] 

Bruce Robbins commented on SPARK-30951:
---

I added a subtask for ORC. The issue is only for date type. Timestamp type 
seems fine, as far as I can tell.

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata, there is no good 
> solution).




[jira] [Created] (SPARK-31238) Incompatible ORC dates with Spark 2.4

2020-03-24 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-31238:
-

 Summary: Incompatible ORC dates with Spark 2.4
 Key: SPARK-31238
 URL: https://issues.apache.org/jira/browse/SPARK-31238
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Bruce Robbins


Using Spark 2.4.5, write pre-1582 date to ORC file and then read it:
{noformat}
$ export TZ=UTC
$ bin/spark-shell --conf spark.sql.session.timeZone=UTC
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.5-SNAPSHOT
  /_/
 
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("select cast('1200-01-01' as date) 
dt").write.mode("overwrite").orc("/tmp/datefile")

scala> spark.read.orc("/tmp/datefile").show
+--+
|dt|
+--+
|1200-01-01|
+--+

scala> :quit
{noformat}
Using Spark 3.0 (branch-3.0 at commit a934142f24), read the same file:
{noformat}
$ export TZ=UTC
$ bin/spark-shell --conf spark.sql.session.timeZone=UTC
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.orc("/tmp/datefile").show
+--+
|dt|
+--+
|1200-01-08|
+--+

scala>
{noformat}
Dates are off.

Timestamps, on the other hand, appear to work as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31081) Make display of stageId/stageAttemptId/taskId of sql metrics toggleable

2020-03-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-31081:
---
Summary: Make display of stageId/stageAttemptId/taskId of sql metrics 
toggleable  (was: Make the display of stageId/stageAttemptId/taskId of sql 
metrics configurable in UI )

> Make display of stageId/stageAttemptId/taskId of sql metrics toggleable
> ---
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: Kousuke Saruta
>Priority: Major
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31081) Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI

2020-03-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31081.

  Assignee: Kousuke Saruta
Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/27927

> Make the display of stageId/stageAttemptId/taskId of sql metrics configurable 
> in UI 
> 
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: Kousuke Saruta
>Priority: Major
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31081) Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI

2020-03-24 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-31081:
---
Affects Version/s: (was: 3.1.0)
   3.0.0

> Make the display of stageId/stageAttemptId/taskId of sql metrics configurable 
> in UI 
> 
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30494) Duplicates cached RDD when create or replace an existing view

2020-03-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30494:
--
Fix Version/s: 2.4.6

> Duplicates cached RDD when create or replace an existing view
> -
>
> Key: SPARK-30494
> URL: https://issues.apache.org/jira/browse/SPARK-30494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> We can reproduce by below commands:
> {code}
> beeline> create or replace temporary view temp1 as select 1
> beeline> cache table temp1
> beeline> create or replace temporary view temp1 as select 1, 2
> beeline> cache table temp1
> {code}
> The cached RDD for plan "select 1" stays in memory forever until the session 
> close. This cached data cannot be used since the view temp1 has been replaced 
> by another plan. It's a memory leak.
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 
> 2")).isDefined)
> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 
> 1")).isDefined)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066159#comment-17066159
 ] 

Dongjoon Hyun commented on SPARK-30951:
---

Hi, [~bersprockets]. 
Could you file a new JIRA about that ORC issue as a subtask of this JIRA issue?

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> consists of a mix of calendar types with no metadata, there is no good 
> solution).



--
This 

[jira] [Created] (SPARK-31237) Replace 3-letter time zones by zone offsets

2020-03-24 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31237:
--

 Summary: Replace 3-letter time zones by zone offsets
 Key: SPARK-31237
 URL: https://issues.apache.org/jira/browse/SPARK-31237
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


3-letter time zones are ambitious, and have been already deprecated in JDK, see 
[https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html] . Also, 
some short names are mapped to region-based zone IDs, and don't conform to 
actual definitions. For example, the PST short name is mapped to 
America/Los_Angeles. It has different zone offsets in Java 7 and Java 8 APIs:
{code:scala}
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-05 
23:00:00").getTime)/360.0
res11: Double = -7.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
00:00:00").getTime)/360.0
res12: Double = -7.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
01:00:00").getTime)/360.0
res13: Double = -8.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
02:00:00").getTime)/360.0
res14: Double = -8.0
scala> TimeZone.getTimeZone("PST").getOffset(Timestamp.valueOf("2016-11-06 
03:00:00").getTime)/360.0
res15: Double = -8.0
{code}
and in Java 8 API 
https://github.com/apache/spark/pull/27980#discussion_r396287278

By definition, PST must be a constant and equals to UTC-08:00, see 
https://www.timeanddate.com/time/zones/pst

The ticket aims to replace all short time zone names by zone offsets in tests.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-24 Thread Bruce Robbins (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066153#comment-17066153
 ] 

Bruce Robbins commented on SPARK-30951:
---

[~cloud_fan]
{quote}
For ORC, it follows the Java `Timestamp`/`Date` semantic and Spark still 
respects it in 3.0, so there is no legacy data as nothing changed in 3.0.
{quote}
Sorry if I misunderstand, but my  example case (above in description) uses ORC, 
and I can still reproduce with latest master vs. spark 2.4.

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's 

[jira] [Commented] (SPARK-31209) Not compatible with new version of scalatest (3.1.0 and above)

2020-03-24 Thread Timothy Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17066090#comment-17066090
 ] 

Timothy Zhang commented on SPARK-31209:
---

Sure, I'll try to work on it soon. 

> Not compatible with new version of scalatest (3.1.0 and above)
> --
>
> Key: SPARK-31209
> URL: https://issues.apache.org/jira/browse/SPARK-31209
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Timothy Zhang
>Priority: Major
>
> Since  ScalaTest's style traits and classes were moved and renamed 
> ([http://www.scalatest.org/release_notes/3.1.0]) there are errors as not find 
> FunSpec when I add new version of scalatest in library dependency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31161) Refactor the on-click timeline action in streagming-page.js

2020-03-24 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31161.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27921
[https://github.com/apache/spark/pull/27921]

> Refactor the on-click timeline action in streagming-page.js
> ---
>
> Key: SPARK-31161
> URL: https://issues.apache.org/jira/browse/SPARK-31161
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> streaming-page.js is used for Streaming page and Structured Steraming page 
> but the implementation of on-click timeline action is strongly dependent on 
> Streaming page.
> So let's refactor to remove the dependency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30127) UDF should work for case class like Dataset operations

2020-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30127:
---

Assignee: wuyi

> UDF should work for case class like Dataset operations
> --
>
> Key: SPARK-30127
> URL: https://issues.apache.org/jira/browse/SPARK-30127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark UDF can only work on data types like java.lang.String, 
> o.a.s.sql.Row, Seq[_], etc. This is inconvenient if you want to apply an 
> operation on one column, and the column is struct type. You must access data 
> from a Row object, instead of your domain object like Dataset operations. It 
> will be great if UDF can work on types that are supported by Dataset, e.g. 
> case classes.
> Note that, there are multiple ways to register a UDF, and it's only possible 
> to support this feature if the UDF is registered using Scala API that 
> provides type tag, e.g. `def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, 
> RT])`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30127) UDF should work for case class like Dataset operations

2020-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30127.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27937
[https://github.com/apache/spark/pull/27937]

> UDF should work for case class like Dataset operations
> --
>
> Key: SPARK-30127
> URL: https://issues.apache.org/jira/browse/SPARK-30127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark UDF can only work on data types like java.lang.String, 
> o.a.s.sql.Row, Seq[_], etc. This is inconvenient if you want to apply an 
> operation on one column, and the column is struct type. You must access data 
> from a Row object, instead of your domain object like Dataset operations. It 
> will be great if UDF can work on types that are supported by Dataset, e.g. 
> case classes.
> Note that, there are multiple ways to register a UDF, and it's only possible 
> to support this feature if the UDF is registered using Scala API that 
> provides type tag, e.g. `def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, 
> RT])`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31236) Spark error while consuming data from Kinesis direct end point

2020-03-24 Thread Thukarama Prabhu (Jira)
Thukarama Prabhu created SPARK-31236:


 Summary: Spark error while consuming data from Kinesis direct end 
point
 Key: SPARK-31236
 URL: https://issues.apache.org/jira/browse/SPARK-31236
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Java API
Affects Versions: 2.4.5
Reporter: Thukarama Prabhu


Here is the summary of the issue I am experiencing when using kinesis direct 
URL for consuming data using spark.

*Kinesis direct URL:* 
[https://kinesis-ae1.hdw.r53.deap.tv|https://kinesis-ae1.hdw.r53.deap.tv/] 
(Failing with Credential should be scoped to a valid region, not 'ae1')

*Kinesis default URL:* 
[https://kinesis.us-east-1.amazonaws.com|https://kinesis.us-east-1.amazonaws.com/]
 (Working)

Spark code for consuming data

SparkAWSCredentials credentials = 
commonService.getSparkAWSCredentials(kinApp.propConfig);
KinesisInputDStream kinesisStream = KinesisInputDStream.builder()
    .streamingContext(jssc)
    .checkpointAppName(applicationName)
    .streamName(streamName)
    .endpointUrl(endpointURL)
    .regionName(regionName)
    
.initialPosition(KinesisInitialPositions.fromKinesisInitialPosition(initPosition))
    .checkpointInterval(checkpointInterval)
    .kinesisCredentials(credentials)
    .storageLevel(StorageLevel.MEMORY_AND_DISK_2()).build();

 

Spark version 2.4.4


    org.apache.spark
    spark-streaming-kinesis-asl_2.11
    2.4.5



    com.amazonaws
    amazon-kinesis-client
    1.13.3



    com.amazonaws
    aws-java-sdk
    1.11.747


 

The spark application works fine when I use default URL but fails when I change 
to direct URL with below error. The direct URL works when I try to publish to 
direct kinesis URL. Issue only when I try to consume data.

 

2020-03-24 08:43:40,650 ERROR - Caught exception while sync'ing Kinesis shards 
and leases

com.amazonaws.services.kinesis.model.AmazonKinesisException: Credential should 
be scoped to a valid region, not 'ae1'.  (Service: AmazonKinesis; Status Code: 
400; Error Code: InvalidSignatureException; Request ID: 
fb43b636-8ce2-ec77-adb7-a8ead9e038c2)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1799)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1383)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1359)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1139)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:796)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:764)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:738)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:698)

    at 
com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:680)

    at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:544)

    at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:524)

    at 
com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2809)

    at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2776)

    at 
com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2765)

    at 
com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1557)

    at 
com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)

    at 
com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.listShards(KinesisProxy.java:326)

    at 
com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy.getShardList(KinesisProxy.java:441)

    at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.getShardList(KinesisShardSyncer.java:349)

    at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.syncShardLeases(KinesisShardSyncer.java:159)

    at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisShardSyncer.checkAndCreateLeasesForNewShards(KinesisShardSyncer.java:112)

    at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncTask.call(ShardSyncTask.java:84)

    at 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)

    

[jira] [Resolved] (SPARK-31221) Rebase all dates/timestamps in conversion in Java types

2020-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31221.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27980
[https://github.com/apache/spark/pull/27980]

> Rebase all dates/timestamps in conversion in Java types
> ---
>
> Key: SPARK-31221
> URL: https://issues.apache.org/jira/browse/SPARK-31221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the fromJavaDate(), toJavaDate(), toJavaTimestamp() and 
> fromJavaTimestamp() methods of DateTimeUtils perform rebase only dates before 
> Gregorian cutover date 1582-10-15 assuming that Gregorian calendar has the 
> same behavior in Java 7 and Java 8 API. The assumption is incorrect, in 
> particular, in getting zone offsets, for instance:
> {code:scala}
> scala> java.time.ZoneId.systemDefault
> res16: java.time.ZoneId = America/Los_Angeles
> scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 
> 60.0
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> res17: Double = 8.0
> scala> 
> java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00"))
> res18: java.time.ZoneOffset = -07:52:58
> {code}
> Java 7 is not accurate, America/Los_Angeles changed time zone shift from
> {code}
> -7:52:58
> {code}
> to
> {code}
> -8:00 
> {code}
> The ticket aims to perform rebasing for any dates/timestamps independently 
> from calendar cutover date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31235) Separates different categories of applications

2020-03-24 Thread wangzhun (Jira)
wangzhun created SPARK-31235:


 Summary: Separates different categories of applications
 Key: SPARK-31235
 URL: https://issues.apache.org/jira/browse/SPARK-31235
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.0.0
Reporter: wangzhun
 Fix For: 3.0.0


The current application defaults to the SPARK type. 
In fact, different types of applications have different characteristics and are 
suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
I recommend distinguishing them by the parameter `spark.yarn.applicationType` 
so that we can more easily manage and maintain different types of applications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31221) Rebase all dates/timestamps in conversion in Java types

2020-03-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31221:
---

Assignee: Maxim Gekk

> Rebase all dates/timestamps in conversion in Java types
> ---
>
> Key: SPARK-31221
> URL: https://issues.apache.org/jira/browse/SPARK-31221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, the fromJavaDate(), toJavaDate(), toJavaTimestamp() and 
> fromJavaTimestamp() methods of DateTimeUtils perform rebase only dates before 
> Gregorian cutover date 1582-10-15 assuming that Gregorian calendar has the 
> same behavior in Java 7 and Java 8 API. The assumption is incorrect, in 
> particular, in getting zone offsets, for instance:
> {code:scala}
> scala> java.time.ZoneId.systemDefault
> res16: java.time.ZoneId = America/Los_Angeles
> scala> java.sql.Timestamp.valueOf("1883-11-10 00:00:00").getTimezoneOffset / 
> 60.0
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> res17: Double = 8.0
> scala> 
> java.time.ZoneId.of("America/Los_Angeles").getRules.getOffset(java.time.LocalDateTime.parse("1883-11-10T00:00:00"))
> res18: java.time.ZoneOffset = -07:52:58
> {code}
> Java 7 is not accurate, America/Los_Angeles changed time zone shift from
> {code}
> -7:52:58
> {code}
> to
> {code}
> -8:00 
> {code}
> The ticket aims to perform rebasing for any dates/timestamps independently 
> from calendar cutover date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31234) ResetCommand should not wipe out all configs

2020-03-24 Thread Kent Yao (Jira)
Kent Yao created SPARK-31234:


 Summary: ResetCommand should not wipe out all configs
 Key: SPARK-31234
 URL: https://issues.apache.org/jira/browse/SPARK-31234
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


Currently, ResetCommand clear all configurations, including sql configs, static 
sql configs and spark context level configs.
for example:
```
spark-sql> set xyz=abc;
xyz abc
spark-sql> set;
spark.app.idlocal-1585055396930
spark.app.name  SparkSQL::10.242.189.214
spark.driver.host   10.242.189.214
spark.driver.port   65094
spark.executor.id   driver
spark.jars
spark.masterlocal[*]
spark.sql.catalogImplementation hive
spark.sql.hive.version  1.2.1
spark.submit.deployMode client
xyz abc
spark-sql> reset;
spark-sql> set;
spark-sql> set spark.sql.hive.version;
spark.sql.hive.version  1.2.1
spark-sql> set spark.app.id;
spark.app.id
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr

2020-03-24 Thread Yi Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Huang updated SPARK-31233:
-
Description: 
Application log: 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Driver log:
{code:java}
[block-manager-ask-thread-pool-149] WARN 
org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
Cannot receive any reply from null in 800 seconds. This timeout is controlled 
by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
receive any reply from null in 800 seconds. This timeout is controlled by 
spark.network.timeout{code}
The log message does not provide RpcAddress of the destination RpcEndpoint. It 
is due to 
{noformat}
* The `rpcAddress` may be null, in which case the endpoint is registered via a 
client-only
* connection and can only be reached via the client that sent the endpoint 
reference.{noformat}
 Solution:

using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
resides in client mode.

 

  was:
Application log: 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Driver log:
{code:java}
[block-manager-ask-thread-pool-149] WARN 
org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
Cannot receive any reply from null in 800 seconds. This timeout is controlled 
by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
receive any reply from null in 800 seconds. This timeout is controlled by 
spark.network.timeout{code}
The log message does not provide RpcAddress of the destination RpcEndpoint. It 
is due to 
{noformat}
* The `rpcAddress` may be null, in which case the endpoint is registered via a 
client-only
* connection and can only be reached via the client that sent the endpoint 
reference.{noformat}
 Solution:

Check if 'remoteReceAddr' is null. 

 


> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr

2020-03-24 Thread Yi Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Huang updated SPARK-31233:
-
Description: 
Application log: 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Driver log:
{code:java}
[block-manager-ask-thread-pool-149] WARN 
org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
Cannot receive any reply from null in 800 seconds. This timeout is controlled 
by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
receive any reply from null in 800 seconds. This timeout is controlled by 
spark.network.timeout{code}
The log message does not provide RpcAddress of the destination RpcEndpoint. It 
is due to 
{noformat}
* The `rpcAddress` may be null, in which case the endpoint is registered via a 
client-only
* connection and can only be reached via the client that sent the endpoint 
reference.{noformat}
 Solution:

Check if 'remoteReceAddr' is null. 

 

  was:
Application log: 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Driver log:
{code:java}
[block-manager-ask-thread-pool-149] WARN 
org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
Cannot receive any reply from null in 800 seconds. This timeout is controlled 
by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
receive any reply from null in 800 seconds. This timeout is controlled by 
spark.network.timeout{code}
The log message does not provide RpcAddress of the destination RpcEndpoint. It 
is due to 
{noformat}
* The `rpcAddress` may be null, in which case the endpoint is registered via a 
client-only
* connection and can only be reached via the client that sent the endpoint 
reference.{noformat}
 

 


> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> Check if 'remoteReceAddr' is null. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr

2020-03-24 Thread Yi Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Huang updated SPARK-31233:
-
Description: 
Application log: 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
Driver log:
{code:java}
[block-manager-ask-thread-pool-149] WARN 
org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
Cannot receive any reply from null in 800 seconds. This timeout is controlled 
by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
receive any reply from null in 800 seconds. This timeout is controlled by 
spark.network.timeout{code}
The log message does not provide RpcAddress of the destination RpcEndpoint. It 
is due to 
{noformat}
* The `rpcAddress` may be null, in which case the endpoint is registered via a 
client-only
* connection and can only be reached via the client that sent the endpoint 
reference.{noformat}
 

 

  was:
 

 

 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
 


> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31233) TimeoutException contains null remoteAddr

2020-03-24 Thread Yi Huang (Jira)
Yi Huang created SPARK-31233:


 Summary: TimeoutException contains null remoteAddr
 Key: SPARK-31233
 URL: https://issues.apache.org/jira/browse/SPARK-31233
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.4
Reporter: Yi Huang


 

 

 
{code:java}
Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures timed 
out after [800 seconds]. This timeout is controlled by spark.network.timeout:
org.apache.spark.rpc.RpcTimeout.org
$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31232) Specify formats of `spark.sql.session.timeZone`

2020-03-24 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31232:
--

 Summary: Specify formats of `spark.sql.session.timeZone`
 Key: SPARK-31232
 URL: https://issues.apache.org/jira/browse/SPARK-31232
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Maxim Gekk


There are two distinct types of ID (see 
https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html):
# Fixed offsets - a fully resolved offset from UTC/Greenwich, that uses the 
same offset for all local date-times
# Geographical regions - an area where a specific set of rules for finding the 
offset from UTC/Greenwich apply

For example three-letter time zone IDs are ambitious, and depend on the locale. 
They have been already deprecated in JDK, see 
https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html :
{code}
For compatibility with JDK 1.1.x, some other three-letter time zone IDs (such 
as "PST", "CTT", "AST") are also supported. However, their use is deprecated 
because the same abbreviation is often used for multiple time zones (for 
example, "CST" could be U.S. "Central Standard Time" and "China Standard 
Time"), and the Java platform can then only recognize one of them.
{code}

The ticket aims to specify formats of the SQL config 
*spark.sql.session.timeZone* in the 2 forms mentioned above.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-03-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31231:
-
Priority: Blocker  (was: Critical)

> Support setuptools 46.1.0+ in PySpark packaging
> ---
>
> Key: SPARK-31231
> URL: https://issues.apache.org/jira/browse/SPARK-31231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Blocker
>
> PIP packaging test started to fail (see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
>  as of  setuptools 46.1.0 release.
> In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
> the modes in {{package_data}}. In PySpark pip installation, we keep the 
> executable scripts in {{package_data}} 
> https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
> expose their symbolic links as executable scripts.
> So, the symbolic links (or copied scripts) executes the scripts copied from 
> {{package_data}}, which didn't keep the modes:
> {code}
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> Permission denied
> /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
> /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
> cannot execute: Permission denied
> {code}
> The current issue is being tracked at 
> https://github.com/pypa/setuptools/issues/2041



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31231) Support setuptools 46.1.0+ in PySpark packaging

2020-03-24 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31231:


 Summary: Support setuptools 46.1.0+ in PySpark packaging
 Key: SPARK-31231
 URL: https://issues.apache.org/jira/browse/SPARK-31231
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Hyukjin Kwon


PIP packaging test started to fail (see 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)
 as of  setuptools 46.1.0 release.

In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep 
the modes in {{package_data}}. In PySpark pip installation, we keep the 
executable scripts in {{package_data}} 
https://github.com/apache/spark/blob/master/python/setup.py#L199-L200, and 
expose their symbolic links as executable scripts.

So, the symbolic links (or copied scripts) executes the scripts copied from 
{{package_data}}, which didn't keep the modes:

{code}
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: 
/tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
Permission denied
/tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: 
/tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: 
cannot execute: Permission denied
{code}

The current issue is being tracked at 
https://github.com/pypa/setuptools/issues/2041




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31220) repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum when spark.sql.adaptive.enabled

2020-03-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31220:

Comment: was deleted

(was: I'm working on.)

> repartition obeys spark.sql.adaptive.coalescePartitions.initialPartitionNum 
> when spark.sql.adaptive.enabled
> ---
>
> Key: SPARK-31220
> URL: https://issues.apache.org/jira/browse/SPARK-31220
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> spark.sql("CREATE TABLE spark_31220(id int)")
> spark.sql("set 
> spark.sql.adaptive.coalescePartitions.initialPartitionNum=1000")
> spark.sql("set spark.sql.adaptive.enabled=true")
> {code}
> {noformat}
> scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- HashAggregate(keys=[id#5], functions=[])
>+- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
>   +- HashAggregate(keys=[id#5], functions=[])
>  +- FileScan parquet default.spark_31220[id#5] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/spark-warehouse/spark_31220],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
> == Physical Plan ==
> AdaptiveSparkPlan(isFinalPlan=false)
> +- Exchange hashpartitioning(id#5, 200), false, [id=#179]
>+- FileScan parquet default.spark_31220[id#5] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/root/opensource/apache-spark/spark-warehouse/spark_31220],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org