[jira] [Created] (SPARK-25849) Improve document for task cancellation.

2018-10-25 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25849:


 Summary: Improve document for task cancellation.
 Key: SPARK-25849
 URL: https://issues.apache.org/jira/browse/SPARK-25849
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


As suggested by [~markhamstra] in 
https://github.com/apache/spark/pull/22771#discussion_r228371144 , we should 
update the document to clarify how task cancellation works in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20780) Spark Kafka10 Consumer Hangs

2018-10-25 Thread Donghyun Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664686#comment-16664686
 ] 

Donghyun Kim commented on SPARK-20780:
--

Hello [~jayadeepj]. I am facing the same problem, have you found the solution 
of this issue? 

> Spark Kafka10 Consumer Hangs
> 
>
> Key: SPARK-20780
> URL: https://issues.apache.org/jira/browse/SPARK-20780
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0
> Spark Streaming Kafka 010
> Yarn - Cluster Mode
> CDH 5.8.4
> CentOS Linux release 7.2
>Reporter: jayadeepj
>Priority: Major
> Attachments: streaming_1.png, streaming_2.png, tasks_timing_out_3.png
>
>
> We have recently upgraded our Streaming App with Direct Stream to Spark 2 
> (spark-streaming-kafka-0-10 - 2.1.0) with Kafka version (0.10.0.0) & Consumer 
> 10 . We find abnormal delays after the application has run for a couple of 
> hours or completed consumption of approx. ~ 5 million records.
> See screenshot 1 & 2
> There is a sudden dip in the processing time from ~15 seconds (usual for this 
> app) to ~3 minutes & from then on the processing time keeps degrading 
> throughout.
> We have seen that the delay is due to certain tasks taking the exact time 
> duration of the configured Kafka Consumer 'request.timeout.ms' . We have 
> tested this by varying timeout property to different values.
> See screenshot 3.
> I think the get(offset: Long, timeout: Long): ConsumerRecord[K, V] method  & 
> subsequent poll(timeout) method in CachedKafkaConsumer.scala is actually 
> timing out on some of the partitions without reading data. But the executor 
> logs it as successfully completed after the exact timeout duration. Note that 
> most other tasks are completing successfully with millisecond duration. The 
> timeout is most likely from the 
> org.apache.kafka.clients.consumer.KafkaConsumer & we did not observe any 
> network latency difference.
> We have observed this across multiple clusters & multiple apps with & without 
> TLS/SSL. Spark 1.6 with 0-8 consumer seems to be fine with consistent 
> performance
> 17/05/17 10:30:06 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446288
> 17/05/17 10:30:06 INFO executor.Executor: Running task 11.0 in stage 5663.0 
> (TID 446288)
> 17/05/17 10:30:06 INFO kafka010.KafkaRDD: Computing topic XX-XXX-XX, 
> partition 0 offsets 776843 -> 779591
> 17/05/17 10:30:06 INFO kafka010.CachedKafkaConsumer: Initial fetch for 
> spark-executor-default1 XX-XXX-XX 0 776843
> 17/05/17 10:30:56 INFO executor.Executor: Finished task 11.0 in stage 5663.0 
> (TID 446288). 1699 bytes result sent to driver
> 17/05/17 10:30:56 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 446329
> 17/05/17 10:30:56 INFO executor.Executor: Running task 0.0 in stage 5667.0 
> (TID 446329)
> 17/05/17 10:30:56 INFO spark.MapOutputTrackerWorker: Updating epoch to 3116 
> and clearing cache
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 6807
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807_piece0 stored 
> as bytes in memory (estimated size 13.1 KB, free 4.1 GB)
> 17/05/17 10:30:56 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 6807 took 4 ms
> 17/05/17 10:30:56 INFO memory.MemoryStore: Block broadcast_6807 stored as 
> values in m
> We can see that the log statement differ with the exact timeout duration.
> Our consumer config is below.
> 17/05/17 12:33:13 INFO dstream.ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@1171dde4
> 17/05/17 12:33:13 INFO consumer.ConsumerConfig: ConsumerConfig values:
>   metric.reporters = []
>   metadata.max.age.ms = 30
>   partition.assignment.strategy = 
> [org.apache.kafka.clients.consumer.RangeAssignor]
>   reconnect.backoff.ms = 50
>   sasl.kerberos.ticket.renew.window.factor = 0.8
>   max.partition.fetch.bytes = 1048576
>   bootstrap.servers = [x.xxx.xxx:9092]
>   ssl.keystore.type = JKS
>   enable.auto.commit = true
>   sasl.mechanism = GSSAPI
>   interceptor.classes = null
>   exclude.internal.topics = true
>   ssl.truststore.password = null
>   client.id =
>   ssl.endpoint.identification.algorithm = null
>   max.poll.records = 2147483647
>   check.crcs = true
>   request.timeout.ms = 5
>   heartbeat.interval.ms = 3000
>   auto.commit.interval.ms = 5000
>   receive.buffer.bytes = 65536
>   ssl.truststore.type = JKS
>   ssl.truststore.location = null
>   ssl.keystore.password = null
>   fetch.min.bytes = 1
>   send.buffer.bytes = 131072
>   value.deserializer = class 
> 

[jira] [Updated] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25848:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25475

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25848:

Affects Version/s: (was: 2.4.1)
   3.0.0

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25847:

Affects Version/s: (was: 2.4.1)
   3.0.0

> Refactor JSONBenchmarks to use main method
> --
>
> Key: SPARK-25847
> URL: https://issues.apache.org/jira/browse/SPARK-25847
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25847:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-25475

> Refactor JSONBenchmarks to use main method
> --
>
> Key: SPARK-25847
> URL: https://issues.apache.org/jira/browse/SPARK-25847
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25842.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22841
[https://github.com/apache/spark/pull/22841]

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25822) Fix a race condition when releasing a Python worker

2018-10-25 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-25822.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.3

Issue resolved by pull request 22816
https://github.com/apache/spark/pull/22816

> Fix a race condition when releasing a Python worker
> ---
>
> Key: SPARK-25822
> URL: https://issues.apache.org/jira/browse/SPARK-25822
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.3, 2.4.0
>
>
> There is a race condition when releasing a Python worker. If 
> "ReaderIterator.handleEndOfDataSection" is not running in the task thread, 
> when a task is early terminated (such as "take(N)"), the task completion 
> listener may close the worker but "handleEndOfDataSection" can still put the 
> worker into the worker pool to reuse.
> https://github.com/zsxwing/spark/commit/0e07b483d2e7c68f3b5c3c118d0bf58c501041b7
>  is a patch to reproduce this issue.
> I also found a user reported this in the mail list: 
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCAAUq=h+yluepd23nwvq13ms5hostkhx3ao4f4zqv6sgo5zm...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25848:


Assignee: Apache Spark

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664637#comment-16664637
 ] 

Apache Spark commented on SPARK-25848:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22845

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25848:


Assignee: (was: Apache Spark)

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664636#comment-16664636
 ] 

Apache Spark commented on SPARK-25848:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22845

> Refactor CSVBenchmarks to use main method
> -
>
> Key: SPARK-25848
> URL: https://issues.apache.org/jira/browse/SPARK-25848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25848) Refactor CSVBenchmarks to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25848:
-

 Summary: Refactor CSVBenchmarks to use main method
 Key: SPARK-25848
 URL: https://issues.apache.org/jira/browse/SPARK-25848
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.csv.CSVBenchmarks"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25847:


Assignee: Apache Spark

> Refactor JSONBenchmarks to use main method
> --
>
> Key: SPARK-25847
> URL: https://issues.apache.org/jira/browse/SPARK-25847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664633#comment-16664633
 ] 

Apache Spark commented on SPARK-25847:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22844

> Refactor JSONBenchmarks to use main method
> --
>
> Key: SPARK-25847
> URL: https://issues.apache.org/jira/browse/SPARK-25847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25847:


Assignee: (was: Apache Spark)

> Refactor JSONBenchmarks to use main method
> --
>
> Key: SPARK-25847
> URL: https://issues.apache.org/jira/browse/SPARK-25847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
> ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15545) R remove non-exported unused methods, like jsonRDD

2018-10-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-15545.
--
Resolution: Duplicate

> R remove non-exported unused methods, like jsonRDD
> --
>
> Key: SPARK-15545
> URL: https://issues.apache.org/jira/browse/SPARK-15545
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Need to review what should be removed.
> one reason to not remove this right away is because we have been talking 
> about calling internal methods via `SparkR:::jsonRDD` for this and other RDD 
> methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15545) R remove non-exported unused methods, like jsonRDD

2018-10-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-15545:
-
Affects Version/s: 2.3.2
External issue ID: SPARK-12172

> R remove non-exported unused methods, like jsonRDD
> --
>
> Key: SPARK-15545
> URL: https://issues.apache.org/jira/browse/SPARK-15545
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Need to review what should be removed.
> one reason to not remove this right away is because we have been talking 
> about calling internal methods via `SparkR:::jsonRDD` for this and other RDD 
> methods.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25847) Refactor JSONBenchmarks to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25847:
-

 Summary: Refactor JSONBenchmarks to use main method
 Key: SPARK-25847
 URL: https://issues.apache.org/jira/browse/SPARK-25847
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.datasources.json.JSONBenchmarks"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664631#comment-16664631
 ] 

Felix Cheung edited comment on SPARK-12172 at 10/26/18 4:11 AM:


ok, what's our option for spark.lapply?

I'll consider at least removing all other methods that are not used for 
spark.lapply in spark 3.0.0


was (Author: felixcheung):
ok, what's our option for spark.lapply?

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664631#comment-16664631
 ] 

Felix Cheung commented on SPARK-12172:
--

ok, what's our option for spark.lapply?

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664628#comment-16664628
 ] 

Felix Cheung commented on SPARK-16611:
--

ping - we are going to consider removing RDD methods in spark 3.0.0

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Priority: Major
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16611) Expose several hidden DataFrame/RDD functions

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664630#comment-16664630
 ] 

Felix Cheung commented on SPARK-16611:
--

see SPARK-12172

> Expose several hidden DataFrame/RDD functions
> -
>
> Key: SPARK-16611
> URL: https://issues.apache.org/jira/browse/SPARK-16611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>Priority: Major
>
> Expose the following functions:
> - lapply or map
> - lapplyPartition or mapPartition
> - flatMap
> - RDD
> - toRDD
> - getJRDD
> - cleanup.jobj
> cc:
> [~javierluraschi] [~j...@rstudio.com] [~shivaram]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664627#comment-16664627
 ] 

Apache Spark commented on SPARK-16693:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/22843

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -> 3.0.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16693) Remove R deprecated methods

2018-10-25 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664626#comment-16664626
 ] 

Felix Cheung commented on SPARK-16693:
--

rebuilt this on spark 3.0.0

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -> 3.0.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23084) Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark

2018-10-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664619#comment-16664619
 ] 

Dongjoon Hyun commented on SPARK-23084:
---

Unfortunately, this is going to be reverted via 
https://github.com/apache/spark/pull/22841.

> Add unboundedPreceding(), unboundedFollowing() and currentRow() to PySpark 
> ---
>
> Key: SPARK-23084
> URL: https://issues.apache.org/jira/browse/SPARK-23084
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> Add the new APIs (introduced by https://github.com/apache/spark/pull/18814) 
> to PySpark. Also update the rangeBetween API
> {noformat}
> /**
>  * Window function: returns the special frame boundary that represents the 
> first row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedPreceding(): Column = Column(UnboundedPreceding)
> /**
>  * Window function: returns the special frame boundary that represents the 
> last row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def unboundedFollowing(): Column = Column(UnboundedFollowing)
> /**
>  * Window function: returns the special frame boundary that represents the 
> current row in the
>  * window partition.
>  *
>  * @group window_funcs
>  * @since 2.3.0
>  */
>  def currentRow(): Column = Column(CurrentRow)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23938:
--
Target Version/s:   (was: 2.4.0)

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23938:
--
Fix Version/s: (was: 2.4.0)
   3.0.0

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664616#comment-16664616
 ] 

Dongjoon Hyun commented on SPARK-23939:
---

I updated the version since this is reverted from `branch-2.4` during RC4.

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23938) High-order function: map_zip_with(map, map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664617#comment-16664617
 ] 

Dongjoon Hyun commented on SPARK-23938:
---

I updated the version since this is reverted from `branch-2.4` during RC4.

> High-order function: map_zip_with(map, map, function V3>) → map
> ---
>
> Key: SPARK-23938
> URL: https://issues.apache.org/jira/browse/SPARK-23938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref:  https://prestodb.io/docs/current/functions/map.html
> Merges the two given maps into a single map by applying function to the pair 
> of values with the same key. For keys only presented in one map, NULL will be 
> passed as the value for the missing key.
> {noformat}
> SELECT map_zip_with(MAP(ARRAY[1, 2, 3], ARRAY['a', 'b', 'c']), -- {1 -> ad, 2 
> -> be, 3 -> cf}
> MAP(ARRAY[1, 2, 3], ARRAY['d', 'e', 'f']),
> (k, v1, v2) -> concat(v1, v2));
> SELECT map_zip_with(MAP(ARRAY['k1', 'k2'], ARRAY[1, 2]), -- {k1 -> ROW(1, 
> null), k2 -> ROW(2, 4), k3 -> ROW(null, 9)}
> MAP(ARRAY['k2', 'k3'], ARRAY[4, 9]),
> (k, v1, v2) -> (v1, v2));
> SELECT map_zip_with(MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 8, 27]), -- {a -> a1, 
> b -> b4, c -> c9}
> MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 3]),
> (k, v1, v2) -> k || CAST(v1/v2 AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23940) High-order function: transform_values(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23940:
--
Target Version/s:   (was: 2.4.0)

> High-order function: transform_values(map, function) → 
> map
> ---
>
> Key: SPARK-23940
> URL: https://issues.apache.org/jira/browse/SPARK-23940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> values.
> {noformat}
> SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v 
> + k); -- {1 -> 11, 2 -> 22, 3 -> 33}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) 
> -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9}
> SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a -> a1, b -> b2}
> SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> 
> one_1.0, 2 -> two_1.4}
> (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || 
> '_' || CAST(v AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23939:
--
Fix Version/s: (was: 2.4.0)
   3.0.0

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23940) High-order function: transform_values(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664615#comment-16664615
 ] 

Dongjoon Hyun commented on SPARK-23940:
---

I updated the version since this is reverted from `branch-2.4` during RC4.

> High-order function: transform_values(map, function) → 
> map
> ---
>
> Key: SPARK-23940
> URL: https://issues.apache.org/jira/browse/SPARK-23940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> values.
> {noformat}
> SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v 
> + k); -- {1 -> 11, 2 -> 22, 3 -> 33}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) 
> -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9}
> SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a -> a1, b -> b2}
> SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> 
> one_1.0, 2 -> two_1.4}
> (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || 
> '_' || CAST(v AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23939) High-order function: transform_keys(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23939:
--
Target Version/s:   (was: 2.4.0)

> High-order function: transform_keys(map, function) → 
> map
> 
>
> Key: SPARK-23939
> URL: https://issues.apache.org/jira/browse/SPARK-23939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> keys.
> {noformat}
> SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1); -- {}
> SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) -> 
> k + 1); -- {2 -> a, 3 -> b, 4 -> c}
> SELECT transform_keys(MAP(ARRAY ['a', 'b', 'c'], ARRAY [1, 2, 3]), (k, v) -> 
> v * v); -- {1 -> 1, 4 -> 2, 9 -> 3}
> SELECT transform_keys(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a1 -> 1, b2 -> 2}
> SELECT transform_keys(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {one -> 1.0, 
> two -> 1.4}
>   (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k]);
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23940) High-order function: transform_values(map, function) → map

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23940:
--
Fix Version/s: (was: 2.4.0)
   3.0.0

> High-order function: transform_values(map, function) → 
> map
> ---
>
> Key: SPARK-23940
> URL: https://issues.apache.org/jira/browse/SPARK-23940
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Neha Patil
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map that applies function to each entry of map and transforms the 
> values.
> {noformat}
> SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1); -- {}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v 
> + k); -- {1 -> 11, 2 -> 22, 3 -> 33}
> SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY ['a', 'b', 'c']), (k, v) 
> -> k * k); -- {1 -> 1, 2 -> 4, 3 -> 9}
> SELECT transform_values(MAP(ARRAY ['a', 'b'], ARRAY [1, 2]), (k, v) -> k || 
> CAST(v as VARCHAR)); -- {a -> a1, b -> b2}
> SELECT transform_values(MAP(ARRAY [1, 2], ARRAY [1.0, 1.4]), -- {1 -> 
> one_1.0, 2 -> two_1.4}
> (k, v) -> MAP(ARRAY[1, 2], ARRAY['one', 'two'])[k] || 
> '_' || CAST(v AS VARCHAR));
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23937:
--
Target Version/s:   (was: 2.4.0)

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23937:
--
Fix Version/s: (was: 2.4.0)
   3.0.0

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23937) High-order function: map_filter(map, function) → MAP

2018-10-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664614#comment-16664614
 ] 

Dongjoon Hyun commented on SPARK-23937:
---

I updated the version since this is reverted from `branch-2.4` during RC4.

> High-order function: map_filter(map, function) → MAP
> --
>
> Key: SPARK-23937
> URL: https://issues.apache.org/jira/browse/SPARK-23937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> Constructs a map from those entries of map for which function returns true:
> {noformat}
> SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true); -- {}
> SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY['a', NULL, 'c']), (k, v) -> v 
> IS NOT NULL); -- {10 -> a, 30 -> c}
> SELECT map_filter(MAP(ARRAY['k1', 'k2', 'k3'], ARRAY[20, 3, 15]), (k, v) -> v 
> > 10); -- {k1 -> 20, k3 -> 15}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23935) High-order function: map_entries(map) → array>

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23935:
--
Fix Version/s: (was: 2.4.0)
   3.0.0

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23935) High-order function: map_entries(map) → array>

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23935:
--
Target Version/s:   (was: 2.4.0)

> High-order function: map_entries(map) → array>
> -
>
> Key: SPARK-23935
> URL: https://issues.apache.org/jira/browse/SPARK-23935
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 3.0.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns an array of all entries in the given map.
> {noformat}
> SELECT map_entries(MAP(ARRAY[1, 2], ARRAY['x', 'y'])); -- [ROW(1, 'x'), 
> ROW(2, 'y')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25819) Support parse mode option for function `from_avro`

2018-10-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25819:


Assignee: Gengliang Wang

> Support parse mode option for function `from_avro`
> --
>
> Key: SPARK-25819
> URL: https://issues.apache.org/jira/browse/SPARK-25819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Current the function `from_avro` throws exception on reading corrupt records.
> To follow the behavior of `from_csv` and `from_json`, let's support parse 
> mode "PERMISSIVE" and "FAILFAST" in `from_avro`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25819) Support parse mode option for function `from_avro`

2018-10-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25819.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22814
[https://github.com/apache/spark/pull/22814]

> Support parse mode option for function `from_avro`
> --
>
> Key: SPARK-25819
> URL: https://issues.apache.org/jira/browse/SPARK-25819
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Current the function `from_avro` throws exception on reading corrupt records.
> To follow the behavior of `from_csv` and `from_json`, let's support parse 
> mode "PERMISSIVE" and "FAILFAST" in `from_avro`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16693) Remove R deprecated methods

2018-10-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-16693:
-
Description: For methods deprecated in Spark 2.0.0, we should remove them 
in 2.1.0 -> 3.0.0  (was: For methods deprecated in Spark 2.0.0, we should 
remove them in 2.1.0)

> Remove R deprecated methods
> ---
>
> Key: SPARK-16693
> URL: https://issues.apache.org/jira/browse/SPARK-16693
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Priority: Major
>
> For methods deprecated in Spark 2.0.0, we should remove them in 2.1.0 -> 3.0.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25840:
--
Affects Version/s: (was: 3.0.0)
   2.4.0

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> dev/make-distribution.sh is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository. Individual 
> contributors usually don't have the downstream repository and starts to try 
> build the voting source artifacts to help the verification for the source 
> artifact during voting phase. (Personally, I did before.)
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25840.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22840
[https://github.com/apache/spark/pull/22840]

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0, 2.4.0
>
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> dev/make-distribution.sh is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository. Individual 
> contributors usually don't have the downstream repository and starts to try 
> build the voting source artifacts to help the verification for the source 
> artifact during voting phase. (Personally, I did before.)
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25840:
-

Assignee: Dongjoon Hyun

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> dev/make-distribution.sh is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository. Individual 
> contributors usually don't have the downstream repository and starts to try 
> build the voting source artifacts to help the verification for the source 
> artifact during voting phase. (Personally, I did before.)
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25840:
--
Fix Version/s: (was: 3.0.0)

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> dev/make-distribution.sh is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository. Individual 
> contributors usually don't have the downstream repository and starts to try 
> build the voting source artifacts to help the verification for the source 
> artifact during voting phase. (Personally, I did before.)
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25793) Loading model bug in BisectingKMeans

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25793:
---

Assignee: Huaxin Gao

> Loading model bug in BisectingKMeans
> 
>
> Key: SPARK-25793
> URL: https://issues.apache.org/jira/browse/SPARK-25793
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 2.4.0
>
>
> See this line:
> [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129]
>  
> This also affects `ml.clustering.BisectingKMeansModel`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25793) Loading model bug in BisectingKMeans

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25793.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22790
[https://github.com/apache/spark/pull/22790]

> Loading model bug in BisectingKMeans
> 
>
> Key: SPARK-25793
> URL: https://issues.apache.org/jira/browse/SPARK-25793
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> See this line:
> [https://github.com/apache/spark/blob/fc64e83f9538d6b7e13359a4933a454ba7ed89ec/mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala#L129]
>  
> This also affects `ml.clustering.BisectingKMeansModel`
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-25846.
-
Resolution: Duplicate

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
> -
>
> Key: SPARK-25846
> URL: https://issues.apache.org/jira/browse/SPARK-25846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark 
> --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25846:


Assignee: Apache Spark

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
> -
>
> Key: SPARK-25846
> URL: https://issues.apache.org/jira/browse/SPARK-25846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark 
> --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664555#comment-16664555
 ] 

Apache Spark commented on SPARK-25846:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22842

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
> -
>
> Key: SPARK-25846
> URL: https://issues.apache.org/jira/browse/SPARK-25846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark 
> --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25846:


Assignee: (was: Apache Spark)

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
> -
>
> Key: SPARK-25846
> URL: https://issues.apache.org/jira/browse/SPARK-25846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark 
> --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664556#comment-16664556
 ] 

Apache Spark commented on SPARK-25846:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22842

> Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method
> -
>
> Key: SPARK-25846
> URL: https://issues.apache.org/jira/browse/SPARK-25846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.4.1
>Reporter: caoxuewen
>Priority: Major
>
> use spark-submit:
> bin/spark-submit --class  
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark 
> --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
> ./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
> Generate benchmark result:
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25846) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use main method

2018-10-25 Thread caoxuewen (JIRA)
caoxuewen created SPARK-25846:
-

 Summary: Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to use 
main method
 Key: SPARK-25846
 URL: https://issues.apache.org/jira/browse/SPARK-25846
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.4.1
Reporter: caoxuewen


use spark-submit:
bin/spark-submit --class  
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark --jars 
./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar 
./sql/catalyst/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar
Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArrayBenchmark"
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-10-25 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664529#comment-16664529
 ] 

Jackey Lee commented on SPARK-24630:


[~shijinkui] [~kabhwan] [~uncleGen]

I have removed the stream keyword. Table API is supported in SQLStreaming now.

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25841) Redesign window function rangeBetween API

2018-10-25 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664454#comment-16664454
 ] 

Wenchen Fan commented on SPARK-25841:
-

Makes sense to me, we need a better API.

> Redesign window function rangeBetween API
> -
>
> Key: SPARK-25841
> URL: https://issues.apache.org/jira/browse/SPARK-25841
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As I was reviewing the Spark API changes for 2.4, I found that through 
> organic, ad-hoc evolution the current API for window functions in Scala is 
> pretty bad.
>   
>  To illustrate the problem, we have two rangeBetween functions in Window 
> class:
>   
> {code:java}
> class Window {
>  def unboundedPreceding: Long
>  ...
>  def rangeBetween(start: Long, end: Long): WindowSpec
>  def rangeBetween(start: Column, end: Column): WindowSpec
> }{code}
>  
>  The Column version of rangeBetween was added in Spark 2.3 because the 
> previous version (Long) could only support integral values and not time 
> intervals. Now in order to support specifying unboundedPreceding in the 
> rangeBetween(Column, Column) API, we added an unboundedPreceding that returns 
> a Column in functions.scala.
>   
>  There are a few issues I have with the API:
>   
>  1. To the end user, this can be just super confusing. Why are there two 
> unboundedPreceding functions, in different classes, that are named the same 
> but return different types?
>   
>  2. Using Column as the parameter signature implies this can be an actual 
> Column, but in practice rangeBetween can only accept literal values.
>   
>  3. We added the new APIs to support intervals, but they don't actually work, 
> because in the implementation we try to validate the start is less than the 
> end, but calendar interval types are not comparable, and as a result we throw 
> a type mismatch exception at runtime: scala.MatchError: CalendarIntervalType 
> (of class org.apache.spark.sql.types.CalendarIntervalType$)
>   
>  4. In order to make interval work, users need to create an interval using 
> CalendarInterval, which is an internal class that has no documentation and no 
> stable API.
>   
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25845) Fix MatchError for calendar interval type in rangeBetween

2018-10-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25845:
---

 Summary: Fix MatchError for calendar interval type in rangeBetween
 Key: SPARK-25845
 URL: https://issues.apache.org/jira/browse/SPARK-25845
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin


WindowSpecDefinition checks start < less, but CalendarIntervalType is not 
comparable, so it would throw the following exception at runtime:

 
 
{noformat}
 scala.MatchError: CalendarIntervalType (of class 
org.apache.spark.sql.types.CalendarIntervalType$)  at 
org.apache.spark.sql.catalyst.util.TypeUtils$.getInterpretedOrdering(TypeUtils.scala:58)
 at 
org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering$lzycompute(predicates.scala:592)
 at 
org.apache.spark.sql.catalyst.expressions.BinaryComparison.ordering(predicates.scala:592)
 at 
org.apache.spark.sql.catalyst.expressions.GreaterThan.nullSafeEval(predicates.scala:797)
 at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:496)
 at 
org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.isGreaterThan(windowExpressions.scala:245)
 at 
org.apache.spark.sql.catalyst.expressions.SpecifiedWindowFrame.checkInputDataTypes(windowExpressions.scala:216)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:171)
 at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:171)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
 at 
scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
 at 
scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43) 
at scala.collection.mutable.ArrayBuffer.forall(ArrayBuffer.scala:48) at 
org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:183)
 at 
org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved$lzycompute(windowExpressions.scala:48)
 at 
org.apache.spark.sql.catalyst.expressions.WindowSpecDefinition.resolved(windowExpressions.scala:48)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
 at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:183)
 at 
scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:83)   
 {noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25844) Implement Python API once we have a new API

2018-10-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25844:
---

 Summary: Implement Python API once we have a new API
 Key: SPARK-25844
 URL: https://issues.apache.org/jira/browse/SPARK-25844
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21608) Window rangeBetween() API should allow literal boundary

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664439#comment-16664439
 ] 

Apache Spark commented on SPARK-21608:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22841

> Window rangeBetween() API should allow literal boundary
> ---
>
> Key: SPARK-21608
> URL: https://issues.apache.org/jira/browse/SPARK-21608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.3.0
>
>
> Window rangeBetween() API should allow literal boundary, that means, the 
> window range frame can calculate frame of double/date/timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21608) Window rangeBetween() API should allow literal boundary

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664440#comment-16664440
 ] 

Apache Spark commented on SPARK-21608:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22841

> Window rangeBetween() API should allow literal boundary
> ---
>
> Key: SPARK-21608
> URL: https://issues.apache.org/jira/browse/SPARK-21608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.3.0
>
>
> Window rangeBetween() API should allow literal boundary, that means, the 
> window range frame can calculate frame of double/date/timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25842:


Assignee: Apache Spark  (was: Reynold Xin)

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Major
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21608) Window rangeBetween() API should allow literal boundary

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664438#comment-16664438
 ] 

Apache Spark commented on SPARK-21608:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22841

> Window rangeBetween() API should allow literal boundary
> ---
>
> Key: SPARK-21608
> URL: https://issues.apache.org/jira/browse/SPARK-21608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xingbo Jiang
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 2.3.0
>
>
> Window rangeBetween() API should allow literal boundary, that means, the 
> window range frame can calculate frame of double/date/timestamp.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25842:


Assignee: Reynold Xin  (was: Apache Spark)

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664435#comment-16664435
 ] 

Apache Spark commented on SPARK-25842:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22841

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664437#comment-16664437
 ] 

Apache Spark commented on SPARK-25842:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22841

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25832) remove newly added map related functions

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25832:

Summary: remove newly added map related functions  (was: remove newly added 
map related functions from FunctionRegistry)

> remove newly added map related functions
> 
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25832:
---

Assignee: Xiao Li  (was: Wenchen Fan)

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25832) remove newly added map related functions from FunctionRegistry

2018-10-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25832.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22827
[https://github.com/apache/spark/pull/22827]

> remove newly added map related functions from FunctionRegistry
> --
>
> Key: SPARK-25832
> URL: https://issues.apache.org/jira/browse/SPARK-25832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-25842:

Target Version/s: 2.4.0

> Deprecate APIs introduced in SPARK-21608
> 
>
> Key: SPARK-25842
> URL: https://issues.apache.org/jira/browse/SPARK-25842
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> See the parent ticket for more information. The newly introduced API is not 
> only confusing, but doesn't work. We should deprecate it in 2.4, and 
> introduce a new version in 3.0.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25843) Redesign rangeBetween API

2018-10-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25843:
---

 Summary: Redesign rangeBetween API
 Key: SPARK-25843
 URL: https://issues.apache.org/jira/browse/SPARK-25843
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


See parent ticket for more information. I have a rough design that I will post 
later.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25842) Deprecate APIs introduced in SPARK-21608

2018-10-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25842:
---

 Summary: Deprecate APIs introduced in SPARK-21608
 Key: SPARK-25842
 URL: https://issues.apache.org/jira/browse/SPARK-25842
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin


See the parent ticket for more information. The newly introduced API is not 
only confusing, but doesn't work. We should deprecate it in 2.4, and introduce 
a new version in 3.0.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25841) Redesign window function rangeBetween API

2018-10-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-25841:

Description: 
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
  
 To illustrate the problem, we have two rangeBetween functions in Window class:
  
{code:java}
class Window {
 def unboundedPreceding: Long
 ...
 def rangeBetween(start: Long, end: Long): WindowSpec
 def rangeBetween(start: Column, end: Column): WindowSpec
}{code}
 
 The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
  
 There are a few issues I have with the API:
  
 1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
  
 2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
  
 3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
  
 4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
  
  

  was:
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
  
 To illustrate the problem, we have two rangeBetween functions in Window class:
  
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
 The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
  
 There are a few issues I have with the API:
  
 1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
  
 2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
  
 3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
  
 4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
  
  


> Redesign window function rangeBetween API
> -
>
> Key: SPARK-25841
> URL: https://issues.apache.org/jira/browse/SPARK-25841
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As I was reviewing the Spark API changes for 2.4, I found that through 
> organic, ad-hoc evolution the current API for window functions in Scala is 
> pretty bad.
>   
>  To illustrate the problem, we have two rangeBetween functions in Window 
> class:
>   
> {code:java}
> class Window {
>  def unboundedPreceding: Long
>  ...
>  def rangeBetween(start: Long, end: Long): WindowSpec
>  def rangeBetween(start: Column, end: Column): WindowSpec
> }{code}
>  
>  The Column version of rangeBetween was added in Spark 2.3 because the 
> previous version (Long) could only support integral values and not time 
> intervals. Now in order to support specifying unboundedPreceding in the 
> rangeBetween(Column, Column) API, we added an unboundedPreceding that returns 
> a Column in functions.scala.
>   
>  There are a few issues I have with the API:
>   
>  1. To the end user, this can be just super confusing. Why are there two 
> unboundedPreceding functions, in different classes, that are 

[jira] [Created] (SPARK-25841) Redesign window function rangeBetween API

2018-10-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-25841:
---

 Summary: Redesign window function rangeBetween API
 Key: SPARK-25841
 URL: https://issues.apache.org/jira/browse/SPARK-25841
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.3.2, 2.4.0
Reporter: Reynold Xin
Assignee: Reynold Xin


As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
 
To illustrate the problem, we have two rangeBetween functions in Window class:
 
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
 
There are a few issues I have with the API:
 
1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
 
2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
 
3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
 
4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
 
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25841) Redesign window function rangeBetween API

2018-10-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-25841:

Description: 
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
  
 To illustrate the problem, we have two rangeBetween functions in Window class:
  
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
 The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
  
 There are a few issues I have with the API:
  
 1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
  
 2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
  
 3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
  
 4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
  
  

  was:
As I was reviewing the Spark API changes for 2.4, I found that through organic, 
ad-hoc evolution the current API for window functions in Scala is pretty bad.
 
To illustrate the problem, we have two rangeBetween functions in Window class:
 
class Window {
  def unboundedPreceding: Long
  ...
  def rangeBetween(start: Long, end: Long): WindowSpec
  def rangeBetween(start: Column, end: Column): WindowSpec

}
 
The Column version of rangeBetween was added in Spark 2.3 because the previous 
version (Long) could only support integral values and not time intervals. Now 
in order to support specifying unboundedPreceding in the rangeBetween(Column, 
Column) API, we added an unboundedPreceding that returns a Column in 
functions.scala.
 
There are a few issues I have with the API:
 
1. To the end user, this can be just super confusing. Why are there two 
unboundedPreceding functions, in different classes, that are named the same but 
return different types?
 
2. Using Column as the parameter signature implies this can be an actual 
Column, but in practice rangeBetween can only accept literal values.
 
3. We added the new APIs to support intervals, but they don't actually work, 
because in the implementation we try to validate the start is less than the 
end, but calendar interval types are not comparable, and as a result we throw a 
type mismatch exception at runtime: scala.MatchError: CalendarIntervalType (of 
class org.apache.spark.sql.types.CalendarIntervalType$)
 
4. In order to make interval work, users need to create an interval using 
CalendarInterval, which is an internal class that has no documentation and no 
stable API.
 
 


> Redesign window function rangeBetween API
> -
>
> Key: SPARK-25841
> URL: https://issues.apache.org/jira/browse/SPARK-25841
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Major
>
> As I was reviewing the Spark API changes for 2.4, I found that through 
> organic, ad-hoc evolution the current API for window functions in Scala is 
> pretty bad.
>   
>  To illustrate the problem, we have two rangeBetween functions in Window 
> class:
>   
> class Window {
>   def unboundedPreceding: Long
>   ...
>   def rangeBetween(start: Long, end: Long): WindowSpec
>   def rangeBetween(start: Column, end: Column): WindowSpec
> }
>  
>  The Column version of rangeBetween was added in Spark 2.3 because the 
> previous version (Long) could only support integral values and not time 
> intervals. Now in order to support specifying unboundedPreceding in the 
> rangeBetween(Column, Column) API, we added an unboundedPreceding that returns 
> a Column in functions.scala.
>   
>  There are a few issues I have with the API:
>   
>  1. To the end user, this can be just super confusing. Why are there two 
> unboundedPreceding functions, in different classes, that are named the same 
> but return different types?
>  

[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-10-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664418#comment-16664418
 ] 

Hyukjin Kwon commented on SPARK-18673:
--

Please provide some input in SPARK-20202 or 
https://github.com/apache/spark/pull/21588

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-10-25 Thread Dagang Wei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664286#comment-16664286
 ] 

Dagang Wei edited comment on SPARK-18673 at 10/25/18 9:20 PM:
--

Is it possible to fix in org.spark-project.hive before SPARK-20202 "Remove 
references to org.spark-project.hive" is resolved? In my Hadoop depolyment 
(Hadoop 3.1.0, Hive 3.1.0 and Spark 2.3.1), when I run spark-shell, I got

java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 
3.1.0
 at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
 at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
 at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
 at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)
 at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)

After examining the JARs, it turns out that the 
org.apache.hadoop.hive.shims.ShimLoader class was from 
/jars/hive-exec-1.2.1.spark2.jar (instead of 
/lib/hive-shims-common-3.1.0.jar). Could somebody let me know where 
the source code of hive-exec-1.2.1.spark2.jar is? Or in general how spark fork 
of hive works, so that I can fix the problem in it.

 


was (Author: functicons):
Is it possible to fix in org.spark-project.hive before SPARK-20202 "Remove 
references to org.spark-project.hive" is resolved? In my Hadoop depolyment 
(Hadoop 3.1.0, Hive 3.1.0 and Spark 2.3.1), when I run spark-shell, I got

 java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 
3.1.0
 at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
 at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
 at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
 at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)
 at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)

After examining the JARs, it turns out that the 
org.apache.hadoop.hive.shims.ShimLoader class that spark-shell trying to load 
was from /jars/hive-exec-1.2.1.spark2.jar (instead of 
/lib/hive-shims-common-3.1.0.jar). Could somebody let me know where 
the source code of hive-exec-1.2.1.spark2.jar is? Or in general how spark fork 
of hive works, so that I can fix the problem in it.

 

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25656) Add an example section about how to use Parquet/ORC library options

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25656:
--
Fix Version/s: 2.4.0

> Add an example section about how to use Parquet/ORC library options
> ---
>
> Key: SPARK-25656
> URL: https://issues.apache.org/jira/browse/SPARK-25656
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0, 3.0.0
>
>
> Our current doc does not explain we are passing the data source specific 
> options to the underlying data source:
> - 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
> We can add some introduction section for both Parquet/ORC examples there. We 
> had better give both read/write side configuration examples, too. One example 
> candidate is `dictionary encoding`: `parquet.enable.dictionary` and 
> `orc.dictionary.key.threshold` et al.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25840:
--
Description: 
We vote for the artifacts. All releases are in the form of the *source* 
materials needed to make changes to the software being released.

http://www.apache.org/legal/release-policy.html#artifacts

>From Spark 2.4.0, the source artifact and binary artifact starts to contain 
>own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
>However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
>start to fail because it expects `LICENSE-binary` and source artifact have 
>only the LICENSE file.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

dev/make-distribution.sh is used during the voting phase because we are voting 
on that source artifact instead of GitHub repository. Individual contributors 
usually don't have the downstream repository and starts to try build the voting 
source artifacts to help the verification for the source artifact during voting 
phase. (Personally, I did before.)

This issue aims to recover that script to work in any way. This doesn't aim for 
source artifacts to reproduce the compiled artifacts.

  was:
We vote for the artifacts. All releases are in the form of the *source* 
materials needed to make changes to the software being released.

http://www.apache.org/legal/release-policy.html#artifacts

>From Spark 2.4.0, the source artifact and binary artifact starts to contain 
>own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
>However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
>start to fail because it expects `LICENSE-binary` and source artifact have 
>only the LICENSE file.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

`dev/make-distribution.sh` is used during the voting phase because we are 
voting on that source artifact instead of GitHub repository.

This issue aims to recover that script to work in any way. This doesn't aim for 
source artifacts to reproduce the compiled artifacts.


> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> dev/make-distribution.sh is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository. Individual 
> contributors usually don't have the downstream repository and starts to try 
> build the voting source artifacts to help the verification for the source 
> artifact during voting phase. (Personally, I did before.)
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25840:


Assignee: (was: Apache Spark)

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> `dev/make-distribution.sh` is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository.
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25840:


Assignee: Apache Spark

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> `dev/make-distribution.sh` is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository.
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664293#comment-16664293
 ] 

Apache Spark commented on SPARK-25840:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22840

> `make-distribution.sh` should not fail due to missing LICENSE-binary
> 
>
> Key: SPARK-25840
> URL: https://issues.apache.org/jira/browse/SPARK-25840
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We vote for the artifacts. All releases are in the form of the *source* 
> materials needed to make changes to the software being released.
> http://www.apache.org/legal/release-policy.html#artifacts
> From Spark 2.4.0, the source artifact and binary artifact starts to contain 
> own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
> However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
> start to fail because it expects `LICENSE-binary` and source artifact have 
> only the LICENSE file.
> https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz
> `dev/make-distribution.sh` is used during the voting phase because we are 
> voting on that source artifact instead of GitHub repository.
> This issue aims to recover that script to work in any way. This doesn't aim 
> for source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25840) `make-distribution.sh` should not fail due to missing LICENSE-binary

2018-10-25 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25840:
-

 Summary: `make-distribution.sh` should not fail due to missing 
LICENSE-binary
 Key: SPARK-25840
 URL: https://issues.apache.org/jira/browse/SPARK-25840
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


We vote for the artifacts. All releases are in the form of the *source* 
materials needed to make changes to the software being released.

http://www.apache.org/legal/release-policy.html#artifacts

>From Spark 2.4.0, the source artifact and binary artifact starts to contain 
>own proper LICENSE files (LICENSE, LICENSE-binary). It's great to have them. 
>However, unfortunately, `dev/make-distribution.sh` inside source artifacts 
>start to fail because it expects `LICENSE-binary` and source artifact have 
>only the LICENSE file.

https://dist.apache.org/repos/dist/dev/spark/v2.4.0-rc4-bin/spark-2.4.0.tgz

`dev/make-distribution.sh` is used during the voting phase because we are 
voting on that source artifact instead of GitHub repository.

This issue aims to recover that script to work in any way. This doesn't aim for 
source artifacts to reproduce the compiled artifacts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24787.

   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 22752
[https://github.com/apache/spark/pull/22752]

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 3.0.0, 2.4.1
>
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-10-25 Thread Dagang Wei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664286#comment-16664286
 ] 

Dagang Wei commented on SPARK-18673:


Is it possible to fix in org.spark-project.hive before SPARK-20202 "Remove 
references to org.spark-project.hive" is resolved? In my Hadoop depolyment 
(Hadoop 3.1.0, Hive 3.1.0 and Spark 2.3.1), when I run spark-shell, I got

 java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 
3.1.0
 at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
 at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
 at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
 at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)
 at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:105)

After examining the JARs, it turns out that the 
org.apache.hadoop.hive.shims.ShimLoader class that spark-shell trying to load 
was from /jars/hive-exec-1.2.1.spark2.jar (instead of 
/lib/hive-shims-common-3.1.0.jar). Could somebody let me know where 
the source code of hive-exec-1.2.1.spark2.jar is? Or in general how spark fork 
of hive works, so that I can fix the problem in it.

 

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24787:
--

Assignee: Devaraj K

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Assignee: Devaraj K
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25803) The -n option to docker-image-tool.sh causes other options to be ignored

2018-10-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25803.

   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> The -n option to docker-image-tool.sh causes other options to be ignored
> 
>
> Key: SPARK-25803
> URL: https://issues.apache.org/jira/browse/SPARK-25803
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: * OS X 10.14
>  * iTerm2
>  * bash3
>  * Docker 2.0.0.0-beta1-mac75 (27117)
> (NB: I don't believe the above has a bearing; I imagine this issue is present 
> also on linux and can confirm if needed.)
>Reporter: Steve Larkin
>Assignee: Steve Larkin
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> To reproduce:-
> 1. Build spark
>  $ ./build/mvn -Pkubernetes -DskipTests clean package
> 2. Create a Dockerfile (a simple one, just for demonstration)
>  $ cat > hello-world.dockerfile <  > FROM hello-world
>  > EOF
> 3. Build container images with our Dockerfile
>  $ ./bin/docker-image-tool.sh -R hello-world.dockerfile -r docker.io/myrepo 
> -t myversion build
> The result is that the -R option is honoured and the hello-world image is 
> built for spark-r, as expected.
> 4. Build container images with our Dockerfile and the -n option
>  $ ./bin/docker-image-tool.sh -n -R hello-world.dockerfile -r 
> docker.io/myrepo -t myversion build
> The result is that the -R option is ignored and the default container image 
> for R is built
> docker-image-tool.sh, uses 
> [getopts|http://pubs.opengroup.org/onlinepubs/9699919799/utilities/getopts.html]
>  in which a colon, ':', signifies that an option takes an argument.  Since -n 
> does not take an argument it should not have a colon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25803) The -n option to docker-image-tool.sh causes other options to be ignored

2018-10-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25803:
--

Assignee: Steve Larkin

> The -n option to docker-image-tool.sh causes other options to be ignored
> 
>
> Key: SPARK-25803
> URL: https://issues.apache.org/jira/browse/SPARK-25803
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
> Environment: * OS X 10.14
>  * iTerm2
>  * bash3
>  * Docker 2.0.0.0-beta1-mac75 (27117)
> (NB: I don't believe the above has a bearing; I imagine this issue is present 
> also on linux and can confirm if needed.)
>Reporter: Steve Larkin
>Assignee: Steve Larkin
>Priority: Minor
>
> To reproduce:-
> 1. Build spark
>  $ ./build/mvn -Pkubernetes -DskipTests clean package
> 2. Create a Dockerfile (a simple one, just for demonstration)
>  $ cat > hello-world.dockerfile <  > FROM hello-world
>  > EOF
> 3. Build container images with our Dockerfile
>  $ ./bin/docker-image-tool.sh -R hello-world.dockerfile -r docker.io/myrepo 
> -t myversion build
> The result is that the -R option is honoured and the hello-world image is 
> built for spark-r, as expected.
> 4. Build container images with our Dockerfile and the -n option
>  $ ./bin/docker-image-tool.sh -n -R hello-world.dockerfile -r 
> docker.io/myrepo -t myversion build
> The result is that the -R option is ignored and the default container image 
> for R is built
> docker-image-tool.sh, uses 
> [getopts|http://pubs.opengroup.org/onlinepubs/9699919799/utilities/getopts.html]
>  in which a colon, ':', signifies that an option takes an argument.  Since -n 
> does not take an argument it should not have a colon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25665.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22804
[https://github.com/apache/spark/pull/22804]

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25665) Refactor ObjectHashAggregateExecBenchmark to use main method

2018-10-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25665:
-

Assignee: Peter Toth

> Refactor ObjectHashAggregateExecBenchmark to use main method
> 
>
> Key: SPARK-25665
> URL: https://issues.apache.org/jira/browse/SPARK-25665
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Peter Toth
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25656) Add an example section about how to use Parquet/ORC library options

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664213#comment-16664213
 ] 

Apache Spark commented on SPARK-25656:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22839

> Add an example section about how to use Parquet/ORC library options
> ---
>
> Key: SPARK-25656
> URL: https://issues.apache.org/jira/browse/SPARK-25656
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Our current doc does not explain we are passing the data source specific 
> options to the underlying data source:
> - 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
> We can add some introduction section for both Parquet/ORC examples there. We 
> had better give both read/write side configuration examples, too. One example 
> candidate is `dictionary encoding`: `parquet.enable.dictionary` and 
> `orc.dictionary.key.threshold` et al.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25656) Add an example section about how to use Parquet/ORC library options

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664212#comment-16664212
 ] 

Apache Spark commented on SPARK-25656:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22839

> Add an example section about how to use Parquet/ORC library options
> ---
>
> Key: SPARK-25656
> URL: https://issues.apache.org/jira/browse/SPARK-25656
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Examples, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Our current doc does not explain we are passing the data source specific 
> options to the underlying data source:
> - 
> https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options
> We can add some introduction section for both Parquet/ORC examples there. We 
> had better give both read/write side configuration examples, too. One example 
> candidate is `dictionary encoding`: `parquet.enable.dictionary` and 
> `orc.dictionary.key.threshold` et al.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25839) Implement use of KryoPool in KryoSerializer

2018-10-25 Thread Patrick Brown (JIRA)
Patrick Brown created SPARK-25839:
-

 Summary: Implement use of KryoPool in KryoSerializer
 Key: SPARK-25839
 URL: https://issues.apache.org/jira/browse/SPARK-25839
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.2, 2.3.1, 2.0.2
Reporter: Patrick Brown


The current implementation of KryoSerializer does not use KryoPool, which is 
recommended by Kryo due to the creation of a Kryo instance being slow.

 

The current implementation references the KryoSerializerInstance private 
variable cachedKryo as effectively being a pool of size 1. However (in my 
admittedly somewhat limited research) it seems that frequently (such as in the 
ClosureCleaner ensureSerializable method) a new instance of 
KryoSerializerInstance is created, which in turn forces a new instance of Kryo 
itself to be created, this instance is then dropped from scope, causing the 
"pool" not to be re-used.

 

I have a small set of proposed changes we have been using on an internal 
production application (running 24x7 for 6+ months, processing 10k+ jobs a day) 
which implements using a KryoPool inside KryoSerializer which is then used by 
each KryoSerializerInstance to borrow a Kryo instance.

 

I believe this is mainly a performance improvement for applications processing 
a large number of small jobs, where the cost of instantiating Kryo instances is 
a larger portion of execution time compared to larger jobs.

 

I have discussed this proposed change in the dev mailing list and it was 
suggested I create this issue and a PR. It was also suggested I accompany that 
with some performance metrics, which it is my plan to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25835) Propagate scala 2.12 profile in k8s integration tests

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664194#comment-16664194
 ] 

Apache Spark commented on SPARK-25835:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/22838

> Propagate scala 2.12 profile in k8s integration tests
> -
>
> Key: SPARK-25835
> URL: https://issues.apache.org/jira/browse/SPARK-25835
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> The 
> [line|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L106]
>  that calls k8s integration tests ignores the scala version: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25835) Propagate scala 2.12 profile in k8s integration tests

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664190#comment-16664190
 ] 

Apache Spark commented on SPARK-25835:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/22838

> Propagate scala 2.12 profile in k8s integration tests
> -
>
> Key: SPARK-25835
> URL: https://issues.apache.org/jira/browse/SPARK-25835
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> The 
> [line|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L106]
>  that calls k8s integration tests ignores the scala version: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25835) Propagate scala 2.12 profile in k8s integration tests

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25835:


Assignee: (was: Apache Spark)

> Propagate scala 2.12 profile in k8s integration tests
> -
>
> Key: SPARK-25835
> URL: https://issues.apache.org/jira/browse/SPARK-25835
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> The 
> [line|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L106]
>  that calls k8s integration tests ignores the scala version: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25835) Propagate scala 2.12 profile in k8s integration tests

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25835:


Assignee: Apache Spark

> Propagate scala 2.12 profile in k8s integration tests
> -
>
> Key: SPARK-25835
> URL: https://issues.apache.org/jira/browse/SPARK-25835
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Minor
>
> The 
> [line|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh#L106]
>  that calls k8s integration tests ignores the scala version: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25838) Remove formatVersion from Saveable

2018-10-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16664110#comment-16664110
 ] 

Apache Spark commented on SPARK-25838:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22830

> Remove formatVersion from Saveable
> --
>
> Key: SPARK-25838
> URL: https://issues.apache.org/jira/browse/SPARK-25838
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The {{Saveable}} interface introduces a {{formatVersion}} term which is used 
> nowhere and it is protected. So this JIRA proposes to get rid of it, which is 
> useless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25838) Remove formatVersion from Saveable

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25838:


Assignee: Apache Spark

> Remove formatVersion from Saveable
> --
>
> Key: SPARK-25838
> URL: https://issues.apache.org/jira/browse/SPARK-25838
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> The {{Saveable}} interface introduces a {{formatVersion}} term which is used 
> nowhere and it is protected. So this JIRA proposes to get rid of it, which is 
> useless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25838) Remove formatVersion from Saveable

2018-10-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25838:


Assignee: (was: Apache Spark)

> Remove formatVersion from Saveable
> --
>
> Key: SPARK-25838
> URL: https://issues.apache.org/jira/browse/SPARK-25838
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> The {{Saveable}} interface introduces a {{formatVersion}} term which is used 
> nowhere and it is protected. So this JIRA proposes to get rid of it, which is 
> useless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >