[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator

2018-12-05 Thread qian han (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711079#comment-16711079
 ] 

qian han commented on SPARK-26265:
--

# There are hundreds of thousand application running on our cluster per day. 
And this deadlock is happened only once. This cannot be reproduce easily.
 # I ran spark sql. INSERT OVERWRITE TABLE dm_abtest.rpt_live_tag_metric_daily 
PARTITION(date='20181129_bak') select vid, tag_name, tag_value, count(*) 
impr_user, avg(impr) impr_per_u, stddev_pop(impr) var_impr_per_u, avg(read) 
read_per_u, stddev_pop(read) var_read_per_u, avg(stay) stay_per_u, 
stddev_pop(stay) var_stay_per_u, sum(stay)/sum(read) stay_per_r, 
sum(read)/sum(impr) read_per_i, avg(finish) finish_per_u, stddev_pop(finish) 
var_finish_per_u from ( select vid, user_uid, user_uid_type, tag_name, 
tag_value, sum(impr) impr, sum(read) read, sum(stay) stay, sum(stay_count) 
stay_count, 0 finish from ( select 
transform(vid,user_uid,user_uid_type,tags,impr,read,stay,stay_count) USING 
'python transform.py 11' AS 
(vids,user_uid,user_uid_type,tag_name,tag_value,impr,read,stay,stay_count) from 
( SELECT vid, user_uid, user_uid_type, tags, count(*) impr, sum(all_read) read, 
sum(video_stay) stay, sum(if(video_stay>0, 1, 0)) stay_count FROM 
dm_abtest.stg_live_impression_stats_daily WHERE date='20181129' and vid <> '' 
GROUP BY vid,user_uid,user_uid_type,tags ) t distribute by 
vids,user_uid,user_uid_type,tag_name,tag_value ) t lateral view 
explode(split(vids, ',')) b as vid group by 
vid,user_uid,user_uid_type,tag_name,tag_value ) t group by 
vid,tag_name,tag_value
 # When deadlock happen, the executor hang and do nothing.

> deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
> --
>
> Key: SPARK-26265
> URL: https://issues.apache.org/jira/browse/SPARK-26265
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: qian han
>Priority: Major
>
> The application is running on a cluster with 72000 cores and 182000G mem.
> Enviroment:
> |spark.dynamicAllocation.minExecutors|5|
> |spark.dynamicAllocation.initialExecutors|30|
> |spark.dynamicAllocation.maxExecutors|400|
> |spark.executor.cores|4|
> |spark.executor.memory|20g|
>  
>   
> Stage description:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364)
>  org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193)
>  
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  java.lang.reflect.Method.invoke(Method.java:498) 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
>  org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) 
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  
> jstack information as follow:
> Found one Java-level deadlock: = 
> "Thread-ScriptTransformation-Feed": waiting to lock monitor 
> 0x00e0cb18 (object 0x0002f1641538, a 
> org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task 
> launch worker for task 18899" "Executor task launch worker for task 18899": 
> waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by 
> "Thread-ScriptTransformation-Feed" Java stack information for the threads 
> listed above: === 
> "Thread-ScriptTransformation-Feed": at 
> org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332)
>  - waiting to lock <0x0002f1641538> (a 
> org.apache.spark.memory.TaskMemoryManager) at 
> org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at 
> org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66)
>  at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274)
>  - locked <0x000302faa3b0> (a 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at 
> org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.

[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711068#comment-16711068
 ] 

Apache Spark commented on SPARK-26288:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/23243

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information, when the ExternalShuffleService restart and it can be reloaded, 
> which will be used as well .
> While neither spark's standalone nor spark on k8s can record it's 
> RegisteredExecutors information by db or others ,so when 
> ExternalShuffleService restart ,which RegisteredExecutors information will be 
> lost,it is't what we looking forward to .
> This commit add initRegisteredExecutorsDB which can be used either spark 
> standalone or spark on k8s to record RegisteredExecutors information , when 
> the ExternalShuffleService restart and it can be reloaded, which will be used 
> as well .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26288:


Assignee: (was: Apache Spark)

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information, when the ExternalShuffleService restart and it can be reloaded, 
> which will be used as well .
> While neither spark's standalone nor spark on k8s can record it's 
> RegisteredExecutors information by db or others ,so when 
> ExternalShuffleService restart ,which RegisteredExecutors information will be 
> lost,it is't what we looking forward to .
> This commit add initRegisteredExecutorsDB which can be used either spark 
> standalone or spark on k8s to record RegisteredExecutors information , when 
> the ExternalShuffleService restart and it can be reloaded, which will be used 
> as well .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26288:


Assignee: Apache Spark

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information, when the ExternalShuffleService restart and it can be reloaded, 
> which will be used as well .
> While neither spark's standalone nor spark on k8s can record it's 
> RegisteredExecutors information by db or others ,so when 
> ExternalShuffleService restart ,which RegisteredExecutors information will be 
> lost,it is't what we looking forward to .
> This commit add initRegisteredExecutorsDB which can be used either spark 
> standalone or spark on k8s to record RegisteredExecutors information , when 
> the ExternalShuffleService restart and it can be reloaded, which will be used 
> as well .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-05 Thread weixiuli (JIRA)
weixiuli created SPARK-26288:


 Summary: add initRegisteredExecutorsDB in ExternalShuffleService
 Key: SPARK-26288
 URL: https://issues.apache.org/jira/browse/SPARK-26288
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes, Shuffle
Affects Versions: 2.4.0
Reporter: weixiuli
 Fix For: 2.4.0


As we all know that spark on Yarn uses DB to record RegisteredExecutors 
information, when the ExternalShuffleService restart and it can be reloaded, 
which will be used as well .

While neither spark's standalone nor spark on k8s can record it's 
RegisteredExecutors information by db or others ,so when ExternalShuffleService 
restart ,which RegisteredExecutors information will be lost,it is't what we 
looking forward to .

This commit add initRegisteredExecutorsDB which can be used either spark 
standalone or spark on k8s to record RegisteredExecutors information , when the 
ExternalShuffleService restart and it can be reloaded, which will be used as 
well .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-12-05 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711014#comment-16711014
 ] 

Takeshi Yamamuro commented on SPARK-26182:
--

This is an expected behaviour  and a known issue, e.g., 
https://issues.apache.org/jira/browse/SPARK-15282. This is not a bug because 
this doesn't affect correctness.

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-12-05 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26182:
-
Issue Type: Improvement  (was: Bug)

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26287) Don't need to create an empty spill file when memory has no records

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711004#comment-16711004
 ] 

Apache Spark commented on SPARK-26287:
--

User 'wangjiaochun' has created a pull request for this issue:
https://github.com/apache/spark/pull/23225

> Don't need to create an empty spill file when memory has no records
> ---
>
> Key: SPARK-26287
> URL: https://issues.apache.org/jira/browse/SPARK-26287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Priority: Minor
>
> In the function writeSortedFile of the class ShuffleExternalSorter 
> when If there are no records in memory, then we don't need to create an empty 
> spill file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26287) Don't need to create an empty spill file when memory has no records

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26287:


Assignee: (was: Apache Spark)

> Don't need to create an empty spill file when memory has no records
> ---
>
> Key: SPARK-26287
> URL: https://issues.apache.org/jira/browse/SPARK-26287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Priority: Minor
>
> In the function writeSortedFile of the class ShuffleExternalSorter 
> when If there are no records in memory, then we don't need to create an empty 
> spill file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26287) Don't need to create an empty spill file when memory has no records

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711001#comment-16711001
 ] 

Apache Spark commented on SPARK-26287:
--

User 'wangjiaochun' has created a pull request for this issue:
https://github.com/apache/spark/pull/23225

> Don't need to create an empty spill file when memory has no records
> ---
>
> Key: SPARK-26287
> URL: https://issues.apache.org/jira/browse/SPARK-26287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Priority: Minor
>
> In the function writeSortedFile of the class ShuffleExternalSorter 
> when If there are no records in memory, then we don't need to create an empty 
> spill file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26287) Don't need to create an empty spill file when memory has no records

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26287:


Assignee: Apache Spark

> Don't need to create an empty spill file when memory has no records
> ---
>
> Key: SPARK-26287
> URL: https://issues.apache.org/jira/browse/SPARK-26287
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Assignee: Apache Spark
>Priority: Minor
>
> In the function writeSortedFile of the class ShuffleExternalSorter 
> when If there are no records in memory, then we don't need to create an empty 
> spill file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26287) Don't need to create an empty spill file when memory has no records

2018-12-05 Thread wangjiaochun (JIRA)
wangjiaochun created SPARK-26287:


 Summary: Don't need to create an empty spill file when memory has 
no records
 Key: SPARK-26287
 URL: https://issues.apache.org/jira/browse/SPARK-26287
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: wangjiaochun


In the function writeSortedFile of the class ShuffleExternalSorter 

when If there are no records in memory, then we don't need to create an empty 
spill file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26286:


Assignee: (was: Apache Spark)

> Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
> ---
>
> Key: SPARK-26286
> URL: https://issues.apache.org/jira/browse/SPARK-26286
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Priority: Minor
>
> Add max page size exception bounds checking test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26286:


Assignee: Apache Spark

> Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
> ---
>
> Key: SPARK-26286
> URL: https://issues.apache.org/jira/browse/SPARK-26286
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Assignee: Apache Spark
>Priority: Minor
>
> Add max page size exception bounds checking test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710997#comment-16710997
 ] 

Apache Spark commented on SPARK-26286:
--

User 'wangjiaochun' has created a pull request for this issue:
https://github.com/apache/spark/pull/23226

> Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
> ---
>
> Key: SPARK-26286
> URL: https://issues.apache.org/jira/browse/SPARK-26286
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: wangjiaochun
>Priority: Minor
>
> Add max page size exception bounds checking test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test

2018-12-05 Thread wangjiaochun (JIRA)
wangjiaochun created SPARK-26286:


 Summary: Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
 Key: SPARK-26286
 URL: https://issues.apache.org/jira/browse/SPARK-26286
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: wangjiaochun


Add max page size exception bounds checking test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26285:


Assignee: (was: Apache Spark)

> Add a metric source for accumulators (aka AccumulatorSource)
> 
>
> Key: SPARK-26285
> URL: https://issues.apache.org/jira/browse/SPARK-26285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> We'd like a simple mechanism to register spark accumulators against the 
> codahale metrics registry. 
> This task proposes adding a LongAccumulatorSource and a 
> DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26285:


Assignee: Apache Spark

> Add a metric source for accumulators (aka AccumulatorSource)
> 
>
> Key: SPARK-26285
> URL: https://issues.apache.org/jira/browse/SPARK-26285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Apache Spark
>Priority: Minor
>
> We'd like a simple mechanism to register spark accumulators against the 
> codahale metrics registry. 
> This task proposes adding a LongAccumulatorSource and a 
> DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710975#comment-16710975
 ] 

Apache Spark commented on SPARK-26285:
--

User 'abellina' has created a pull request for this issue:
https://github.com/apache/spark/pull/23242

> Add a metric source for accumulators (aka AccumulatorSource)
> 
>
> Key: SPARK-26285
> URL: https://issues.apache.org/jira/browse/SPARK-26285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> We'd like a simple mechanism to register spark accumulators against the 
> codahale metrics registry. 
> This task proposes adding a LongAccumulatorSource and a 
> DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710974#comment-16710974
 ] 

Apache Spark commented on SPARK-26285:
--

User 'abellina' has created a pull request for this issue:
https://github.com/apache/spark/pull/23242

> Add a metric source for accumulators (aka AccumulatorSource)
> 
>
> Key: SPARK-26285
> URL: https://issues.apache.org/jira/browse/SPARK-26285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> We'd like a simple mechanism to register spark accumulators against the 
> codahale metrics registry. 
> This task proposes adding a LongAccumulatorSource and a 
> DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Alessandro Bellina (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710970#comment-16710970
 ] 

Alessandro Bellina commented on SPARK-26285:


I can't assign this issue, but I am putting up a PR for it.

> Add a metric source for accumulators (aka AccumulatorSource)
> 
>
> Key: SPARK-26285
> URL: https://issues.apache.org/jira/browse/SPARK-26285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Priority: Minor
>
> We'd like a simple mechanism to register spark accumulators against the 
> codahale metrics registry. 
> This task proposes adding a LongAccumulatorSource and a 
> DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)

2018-12-05 Thread Alessandro Bellina (JIRA)
Alessandro Bellina created SPARK-26285:
--

 Summary: Add a metric source for accumulators (aka 
AccumulatorSource)
 Key: SPARK-26285
 URL: https://issues.apache.org/jira/browse/SPARK-26285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Alessandro Bellina


We'd like a simple mechanism to register spark accumulators against the 
codahale metrics registry. 

This task proposes adding a LongAccumulatorSource and a DoubleAccumulatorSource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file

2018-12-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710902#comment-16710902
 ] 

Hyukjin Kwon commented on SPARK-26261:
--

It would be easy to verify if the codes are posted together so that other 
people could work on this if you're not going to work on this. 

> Spark does not check completeness temporary file 
> -
>
> Key: SPARK-26261
> URL: https://issues.apache.org/jira/browse/SPARK-26261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jialin LIu
>Priority: Minor
>
> Spark does not check temporary files' completeness. When persisting to disk 
> is enabled on some RDDs, a bunch of temporary files will be created on 
> blockmgr folder. Block manager is able to detect missing blocks while it is 
> not able detect file content being modified during execution. 
> Our initial test shows that if we truncate the block file before being used 
> by executors, the program will finish without detecting any error, but the 
> result content is totally wrong.
> We believe there should be a file checksum on every RDD file block and these 
> files should be protected by checksum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2018-12-05 Thread Alan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710898#comment-16710898
 ] 

Alan commented on SPARK-12312:
--

I agree! Can we please get this implemented as soon as possible?  This prevents 
us from being compliant with our internal security policies.  

> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file

2018-12-05 Thread Jialin LIu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710883#comment-16710883
 ] 

Jialin LIu commented on SPARK-26261:


Our initial test is:

We start a word count workflow including persisting blocks to disk. After we 
make sure that there are some blocks on the disk already, we use the truncate 
command to truncate part of the block. We compare the result with the result 
produced by workflow without fault injection. 

> Spark does not check completeness temporary file 
> -
>
> Key: SPARK-26261
> URL: https://issues.apache.org/jira/browse/SPARK-26261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jialin LIu
>Priority: Minor
>
> Spark does not check temporary files' completeness. When persisting to disk 
> is enabled on some RDDs, a bunch of temporary files will be created on 
> blockmgr folder. Block manager is able to detect missing blocks while it is 
> not able detect file content being modified during execution. 
> Our initial test shows that if we truncate the block file before being used 
> by executors, the program will finish without detecting any error, but the 
> result content is totally wrong.
> We believe there should be a file checksum on every RDD file block and these 
> files should be protected by checksum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26275.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23236
[https://github.com/apache/spark/pull/23236]

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26275:


Assignee: Hyukjin Kwon

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name

2018-12-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-25148:


> Executors launched with Spark on K8s client mode should prefix name with 
> spark.app.name
> ---
>
> Key: SPARK-25148
> URL: https://issues.apache.org/jira/browse/SPARK-25148
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Timothy Chen
>Priority: Major
>
> With the latest added client mode with Spark on k8s, executors launched by 
> default are all named "spark-exec-#". Which means when multiple jobs are 
> launched in the same cluster, they often have to retry to find unused pod 
> names. Also it's hard to correlate which executors are launched for which 
> spark app. The work around is to manually use the executor prefix 
> configuration for each job launched.
> Ideally the experience should be the same for cluster mode, which each 
> executor is default prefix with the spark.app.name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name

2018-12-05 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710726#comment-16710726
 ] 

Marcelo Vanzin commented on SPARK-25148:


Actually there was a separate bug for the same issue. Duping...

> Executors launched with Spark on K8s client mode should prefix name with 
> spark.app.name
> ---
>
> Key: SPARK-25148
> URL: https://issues.apache.org/jira/browse/SPARK-25148
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Timothy Chen
>Priority: Major
>
> With the latest added client mode with Spark on k8s, executors launched by 
> default are all named "spark-exec-#". Which means when multiple jobs are 
> launched in the same cluster, they often have to retry to find unused pod 
> names. Also it's hard to correlate which executors are launched for which 
> spark app. The work around is to manually use the executor prefix 
> configuration for each job launched.
> Ideally the experience should be the same for cluster mode, which each 
> executor is default prefix with the spark.app.name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name

2018-12-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25148.

Resolution: Duplicate

> Executors launched with Spark on K8s client mode should prefix name with 
> spark.app.name
> ---
>
> Key: SPARK-25148
> URL: https://issues.apache.org/jira/browse/SPARK-25148
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Timothy Chen
>Priority: Major
>
> With the latest added client mode with Spark on k8s, executors launched by 
> default are all named "spark-exec-#". Which means when multiple jobs are 
> launched in the same cluster, they often have to retry to find unused pod 
> names. Also it's hard to correlate which executors are launched for which 
> spark app. The work around is to manually use the executor prefix 
> configuration for each job launched.
> Ideally the experience should be the same for cluster mode, which each 
> executor is default prefix with the spark.app.name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name

2018-12-05 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25148.

Resolution: Cannot Reproduce

This seems to work for me locally. Executor pods are prefixed with a unique 
identifier based on the app name, unless overridden with 
{{spark.kubernetes.executor.podNamePrefix}}.

> Executors launched with Spark on K8s client mode should prefix name with 
> spark.app.name
> ---
>
> Key: SPARK-25148
> URL: https://issues.apache.org/jira/browse/SPARK-25148
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Timothy Chen
>Priority: Major
>
> With the latest added client mode with Spark on k8s, executors launched by 
> default are all named "spark-exec-#". Which means when multiple jobs are 
> launched in the same cluster, they often have to retry to find unused pod 
> names. Also it's hard to correlate which executors are launched for which 
> spark app. The work around is to manually use the executor prefix 
> configuration for each job launched.
> Ideally the experience should be the same for cluster mode, which each 
> executor is default prefix with the spark.app.name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710630#comment-16710630
 ] 

shane knapp edited comment on SPARK-26282 at 12/5/18 9:02 PM:
--

and the centos workers are updated:
{noformat}
[ sknapp@amp-jenkins-master ] [ ~ ]
$ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java 
-version"
[1] 12:57:19 [SUCCESS] amp-jenkins-worker-04
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[2] 12:57:19 [SUCCESS] amp-jenkins-worker-02
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[3] 12:57:19 [SUCCESS] amp-jenkins-worker-03
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[4] 12:57:19 [SUCCESS] amp-jenkins-worker-06
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[5] 12:57:19 [SUCCESS] amp-jenkins-worker-05
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[6] 12:57:19 [SUCCESS] amp-jenkins-worker-01
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat}
i have a PR open to update the jenkins job configs, and once that's approved 
i'll deploy it immediately.


was (Author: shaneknapp):
and the centos workers are updated:
{noformat}
[ sknapp@amp-jenkins-master ] [ ~ ]
$ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java 
-version"
[1] 12:57:19 [SUCCESS] amp-jenkins-worker-04
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[2] 12:57:19 [SUCCESS] amp-jenkins-worker-02
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[3] 12:57:19 [SUCCESS] amp-jenkins-worker-03
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[4] 12:57:19 [SUCCESS] amp-jenkins-worker-06
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[5] 12:57:19 [SUCCESS] amp-jenkins-worker-05
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[6] 12:57:19 [SUCCESS] amp-jenkins-worker-01
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat}
i have a PR open to update the jenkins job configs, and once that's approved 
i'll deploy that immediately.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710630#comment-16710630
 ] 

shane knapp commented on SPARK-26282:
-

and the centos workers are updated:
{noformat}
[ sknapp@amp-jenkins-master ] [ ~ ]
$ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java 
-version"
[1] 12:57:19 [SUCCESS] amp-jenkins-worker-04
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[2] 12:57:19 [SUCCESS] amp-jenkins-worker-02
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[3] 12:57:19 [SUCCESS] amp-jenkins-worker-03
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[4] 12:57:19 [SUCCESS] amp-jenkins-worker-06
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[5] 12:57:19 [SUCCESS] amp-jenkins-worker-05
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[6] 12:57:19 [SUCCESS] amp-jenkins-worker-01
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat}
i have a PR open to update the jenkins job configs, and once that's approved 
i'll deploy that immediately.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710622#comment-16710622
 ] 

Apache Spark commented on SPARK-26281:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23160

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26233:
--
Fix Version/s: 2.4.1
   2.3.3
   2.2.3

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel Canes
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0
>
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26284) Spark History server object vs file storage behavior difference

2018-12-05 Thread Damien Doucet-Girard (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Doucet-Girard updated SPARK-26284:
-
Description: 
I am using the spark history server in order to view running/complete jobs on 
spark using the kubernetes scheduling backend introduced in 2.3.0. Using a 
local file path in both {color:#33}{{spark.eventLog.dir}}{color} and 
{{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and 
completed tasks, with {{.inprogress}} files being flushed regularly. However, 
when using an {{s3a://}} path, it seems the calls to flush the file 
([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)]
 don't actually upload the file to s3. Due to this, I am unable to see 
currently incomplete tasks using an s3a path.

>From the behavior I've observed, it only uploads on completion of the task 
>(hadoop 2.7) or upon the log file filling up the block size set for s3a 
>{{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended 
>behavior?

  was:
I am using the spark history server in order to view running/complete jobs on 
spark using the kubernetes scheduling backend introduced in 2.3.0. Using a 
local file path in both `{color:#33}spark.eventLog.dir{color}` and 
`spark.history.fs.logDirectory`, I have no issue seeing both incomplete and 
completed tasks, with `.inprogress` files being flushed regularly. However, 
when using an `s3a://` path, it seems the calls to flush the file 
([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)]
 don't actually upload the file to s3. Due to this, I am unable to see 
currently incomplete tasks using an s3a path.

>From the behavior I've observed, it only uploads on completion of the task 
>(hadoop 2.7) or upon the log file filling up the block size set for s3a 
>`{color:#6a8759}{color:#33}spark.hadoop.fs.s3a.multipart.size{color}` 
>{color}(hadoop 3.0.0). Is this intended behavior?


> Spark History server object vs file storage behavior difference
> ---
>
> Key: SPARK-26284
> URL: https://issues.apache.org/jira/browse/SPARK-26284
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Damien Doucet-Girard
>Priority: Minor
>
> I am using the spark history server in order to view running/complete jobs on 
> spark using the kubernetes scheduling backend introduced in 2.3.0. Using a 
> local file path in both {color:#33}{{spark.eventLog.dir}}{color} and 
> {{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and 
> completed tasks, with {{.inprogress}} files being flushed regularly. However, 
> when using an {{s3a://}} path, it seems the calls to flush the file 
> ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)]
>  don't actually upload the file to s3. Due to this, I am unable to see 
> currently incomplete tasks using an s3a path.
> From the behavior I've observed, it only uploads on completion of the task 
> (hadoop 2.7) or upon the log file filling up the block size set for s3a 
> {{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended 
> behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26284) Spark History server object vs file storage behavior difference

2018-12-05 Thread Damien Doucet-Girard (JIRA)
Damien Doucet-Girard created SPARK-26284:


 Summary: Spark History server object vs file storage behavior 
difference
 Key: SPARK-26284
 URL: https://issues.apache.org/jira/browse/SPARK-26284
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Damien Doucet-Girard


I am using the spark history server in order to view running/complete jobs on 
spark using the kubernetes scheduling backend introduced in 2.3.0. Using a 
local file path in both `{color:#33}spark.eventLog.dir{color}` and 
`spark.history.fs.logDirectory`, I have no issue seeing both incomplete and 
completed tasks, with `.inprogress` files being flushed regularly. However, 
when using an `s3a://` path, it seems the calls to flush the file 
([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)]
 don't actually upload the file to s3. Due to this, I am unable to see 
currently incomplete tasks using an s3a path.

>From the behavior I've observed, it only uploads on completion of the task 
>(hadoop 2.7) or upon the log file filling up the block size set for s3a 
>`{color:#6a8759}{color:#33}spark.hadoop.fs.s3a.multipart.size{color}` 
>{color}(hadoop 3.0.0). Is this intended behavior?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26282:
--
Summary: Update JVM to 8u191 on jenkins workers  (was: update jvm on 
jenkins workers)

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-12-05 Thread Pawan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawan resolved SPARK-25919.
---
Resolution: Fixed

This was fixed by Hive in later versions of Jar which are not currently used by 
Spark yet.

https://issues.apache.org/jira/browse/HIVE-11771

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710565#comment-16710565
 ] 

shane knapp commented on SPARK-26282:
-

ubuntu workers are done...
{noformat}
[ sknapp@amp-jenkins-master ] [ ~ ]
$ pssh -h ubuntu_workers.txt -i "java -version"
[1] 11:53:05 [SUCCESS] research-jenkins-worker-07
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[2] 11:53:05 [SUCCESS] amp-jenkins-staging-worker-02
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[3] 11:53:05 [SUCCESS] amp-jenkins-staging-worker-01
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[4] 11:53:05 [SUCCESS] research-jenkins-worker-08
Stderr: java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat}
 

i'll get to the centos workers after lunch.

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26282:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers

2018-12-05 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710558#comment-16710558
 ] 

Dongjoon Hyun commented on SPARK-26282:
---

+1, great!

> Update JVM to 8u191 on jenkins workers
> --
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) update jvm on jenkins workers

2018-12-05 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710540#comment-16710540
 ] 

shane knapp commented on SPARK-26282:
-

looks like 191 is the most current java8...  deploying that today.

 

> update jvm on jenkins workers
> -
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-12-05 Thread Pawan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawan closed SPARK-25919.
-

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-12-05 Thread Pawan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710535#comment-16710535
 ] 

Pawan commented on SPARK-25919:
---

I just figured out why is this the issue. Its because of the hive-exec jar 
packaged with Spark. The latest version which is packaged Spark-2.1.0  to 
Spark-2.3.1 is hive-exec-1.2.1.spark2.jar. However the parquet timestamp bug 
was fixed by Hive in hive-exec-2.0.0.jar, which is not available in spark 
packages which I mentioned earlier.

It was fixed as a part of below Hive Jira

https://issues.apache.org/jira/browse/HIVE-11771

Thanks & regards,

Pawan Lawale

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Blocker
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710508#comment-16710508
 ] 

Apache Spark commented on SPARK-26283:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23241

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710506#comment-16710506
 ] 

Apache Spark commented on SPARK-26283:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23241

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26283:


Assignee: (was: Apache Spark)

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26283:


Assignee: Apache Spark

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Minor
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26282) update jvm on jenkins workers

2018-12-05 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710451#comment-16710451
 ] 

Sean Owen commented on SPARK-26282:
---

Yes the latest Java 8 JDK (_192?) is best. That may well be one of the final 
releases anyway. Whatever most recent version you can easily install through 
the OS updates is fine, as it will be much newer than _60.

You're welcome to also install Java 11 while you're at it, as we will need it 
in the medium term to start running tests against Java 11. 

> update jvm on jenkins workers
> -
>
> Key: SPARK-26282
> URL: https://issues.apache.org/jira/browse/SPARK-26282
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> the jvm we're using to build/test spark on the centos workers is a bit...  
> long in the teeth:
> {noformat}
> [sknapp@amp-jenkins-worker-04 ~]$ java -version
> java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
> on the ubuntu nodes, it's only a little bit less old:
> {noformat}
> sknapp@amp-jenkins-staging-worker-01:~$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> steps to update on centos:
>  * manually install new(er) java
>  * update /etc/alternatives
>  * update JJB configs and update JAVA_HOME/JAVA_BIN
> steps to update on ubuntu:
>  * update ansible to install newer java
>  * deploy ansible
> questions:
>  * do we stick w/java8 for now?
>  * which version is sufficient?
> [~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread ABHISHEK KUMAR GUPTA (JIRA)
ABHISHEK KUMAR GUPTA created SPARK-26283:


 Summary: When zstd compression enabled, Inprogress application in 
the history server appUI showing finished job as running
 Key: SPARK-26283
 URL: https://issues.apache.org/jira/browse/SPARK-26283
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.4.0, 3.0.0
Reporter: ABHISHEK KUMAR GUPTA


When zstd compression enabled, Inprogress application in the history server 
appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running

2018-12-05 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710444#comment-16710444
 ] 

shahid commented on SPARK-26283:


Thanks. I am working on it.

> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running
> -
>
> Key: SPARK-26283
> URL: https://issues.apache.org/jira/browse/SPARK-26283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> When zstd compression enabled, Inprogress application in the history server 
> appUI showing finished job as running



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26282) update jvm on jenkins workers

2018-12-05 Thread shane knapp (JIRA)
shane knapp created SPARK-26282:
---

 Summary: update jvm on jenkins workers
 Key: SPARK-26282
 URL: https://issues.apache.org/jira/browse/SPARK-26282
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: shane knapp
Assignee: shane knapp


the jvm we're using to build/test spark on the centos workers is a bit...  long 
in the teeth:
{noformat}
[sknapp@amp-jenkins-worker-04 ~]$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat}
on the ubuntu nodes, it's only a little bit less old:
{noformat}
sknapp@amp-jenkins-staging-worker-01:~$ java -version
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
steps to update on centos:
 * manually install new(er) java
 * update /etc/alternatives
 * update JJB configs and update JAVA_HOME/JAVA_BIN

steps to update on ubuntu:
 * update ansible to install newer java
 * deploy ansible

questions:
 * do we stick w/java8 for now?
 * which version is sufficient?

[~srowen]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26281:


Assignee: Apache Spark

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-05 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-26281:
--

 Summary: Duration column of task table should be executor run time 
instead of real duration
 Key: SPARK-26281
 URL: https://issues.apache.org/jira/browse/SPARK-26281
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Gengliang Wang


In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
changed to executor run time. The behavior is consistent with the summary 
metrics table and previous Spark version.

However, after PR https://github.com/apache/spark/pull/21688, the issue can be 
reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26281:


Assignee: (was: Apache Spark)

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26281) Duration column of task table should be executor run time instead of real duration

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710403#comment-16710403
 ] 

Apache Spark commented on SPARK-26281:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23240

> Duration column of task table should be executor run time instead of real 
> duration
> --
>
> Key: SPARK-26281
> URL: https://issues.apache.org/jira/browse/SPARK-26281
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR https://github.com/apache/spark/pull/23081/ , the duration column is 
> changed to executor run time. The behavior is consistent with the summary 
> metrics table and previous Spark version.
> However, after PR https://github.com/apache/spark/pull/21688, the issue can 
> be reproduced again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26278) V2 Streaming sources cannot be written to V1 sinks

2018-12-05 Thread Seth Fitzsimmons (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710397#comment-16710397
 ] 

Seth Fitzsimmons commented on SPARK-26278:
--

I was thinking specifically of the SerializedOffset / Offset incompatibility 
referenced in SPARK-25257 and fixed in SPARK-23092 (but just the part that 
affects v2 source -> v1 sinks).

> V2 Streaming sources cannot be written to V1 sinks
> --
>
> Key: SPARK-26278
> URL: https://issues.apache.org/jira/browse/SPARK-26278
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Justin Polchlopek
>Priority: Major
>
> Starting from a streaming DataFrame derived from a custom v2 MicroBatch 
> reader, we have
> {code:java}
> val df: DataFrame = ... 
> assert(df.isStreaming)
> val outputFormat = "orc" // also applies to "csv" and "json" but not 
> "console" 
> df.writeStream
>   .format(outputFormat)
>   .option("checkpointLocation", "/tmp/checkpoints")
>   .option("path", "/tmp/result")
>   .start
> {code}
> This code fails with the following stack trace:
> {code:java}
> 2018-12-04 08:24:27 ERROR MicroBatchExecution:91 - Query [id = 
> 193f97bf-8064-4658-8aa6-0f481919eafe, runId = 
> e96ed7e5-aaf4-4ef4-a3f3-05fe0b01a715] terminated with error
> java.lang.ClassCastException: 
> org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
> org.apache.spark.sql.sources.v2.reader.streaming.Offset
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
>     at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>     at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>     at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>     at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>     at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>     at 
> org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
>     at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>     at 
> org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>     at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
>     at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>     at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>     at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){code}
> I'm filing th

[jira] [Commented] (SPARK-26222) Scan: track file listing time

2018-12-05 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710303#comment-16710303
 ] 

Yuanjian Li commented on SPARK-26222:
-

Leave some thoughts for further discussion:
 * There's one place has track file listing duration now in 
`FileSourceScanExec`, metrics name is `metadataTime`(maybe an inaccurate name, 
should be changed to file listing time), we should add the phase tracking here.
 * We should also add the duration and phase tracking in these 2 places:
 ** HiveMetastoreCatalog inferred Scehma.
 ** replaceTableScanWithPartitionMetadata in OptimizeMetadataOnlyQuery rule.
 * IIUC, the phase tracking can use `QueryPlanningTracker` directly cause its 
thread locally and passed through within all `RuleExecution`.
 * About the meaning of listing time, maybe we can define it's only refers to 
reading without cache because loading from cache is not the 'heavy' operator we 
want to tracking and also spend less time. The listing time not only contains 
the first time `listFiles` called, but also each time after cache was refreshed.

> Scan: track file listing time
> -
>
> Key: SPARK-26222
> URL: https://issues.apache.org/jira/browse/SPARK-26222
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Reynold Xin
>Priority: Major
>
> We should track file listing time and add it to the scan node's SQL metric, 
> so we have visibility how much is spent in file listing. It'd be useful to 
> track not just duration, but also start and end time so we can construct a 
> timeline.
> This requires a little bit design to define what file listing time means, 
> when we are reading from cache, vs not cache.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710266#comment-16710266
 ] 

Apache Spark commented on SPARK-26021:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23239

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Alon Doron
>Priority: Critical
> Fix For: 2.4.1, 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26280) Spark will read entire CSV file even when limit is used

2018-12-05 Thread Amir Bar-Or (JIRA)
Amir Bar-Or created SPARK-26280:
---

 Summary: Spark will read entire CSV file even when limit is used
 Key: SPARK-26280
 URL: https://issues.apache.org/jira/browse/SPARK-26280
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Amir Bar-Or


When you read CSV as below , the parser still waste time and read the entire 
file:

var lineDF1 = spark.read
 .format("com.databricks.spark.csv")
 .option("header", "true") //reading the headers
 .option("mode", "DROPMALFORMED")
 .option("delimiter",",")
 .option("inferSchema", "false")
 .schema(line_schema)
 .load(i_lineitem)
 .lineDF1.limit(10)

 

Even though a  LocalLimit is created , this does not stop the FileScan and the 
parser from parsing entire file.   Is it possible to push the limit down and 
stop the parsing ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26273:
--
Priority: Minor  (was: Major)

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710219#comment-16710219
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23238

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-26273.
-
Resolution: Won't Fix

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710216#comment-16710216
 ] 

Apache Spark commented on SPARK-25132:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23238

> Case-insensitive field resolution when reading from Parquet
> ---
>
> Key: SPARK-25132
> URL: https://issues.apache.org/jira/browse/SPARK-25132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
>  Labels: Parquet
> Fix For: 2.4.0
>
>
> Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
> schema are in different letter cases, regardless of spark.sql.caseSensitive 
> set to true or false.
> Here is a simple example to reproduce this issue:
> scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1")
> spark-sql> show create table t1;
> CREATE TABLE `t1` (`id` BIGINT)
> USING parquet
> OPTIONS (
>  `serialization.format` '1'
> )
> spark-sql> CREATE TABLE `t2` (`ID` BIGINT)
>  > USING parquet
>  > LOCATION 'hdfs://localhost/user/hive/warehouse/t1';
> spark-sql> select * from t1;
> 0
> 1
> 2
> 3
> 4
> spark-sql> select * from t2;
> NULL
> NULL
> NULL
> NULL
> NULL
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710192#comment-16710192
 ] 

Liang-Chi Hsieh commented on SPARK-26273:
-

For now the idea collected from the PR is we don't need to keep such alias even 
it is mentioned in the ml migration guide. So I close this and the PR.

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26279) Remove unused method in Logging

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26279:


Assignee: (was: Apache Spark)

> Remove unused method in Logging
> ---
>
> Key: SPARK-26279
> URL: https://issues.apache.org/jira/browse/SPARK-26279
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> The method isTraceEnabled is not used anywhere. We should remove it to avoid 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming (mapWithState)

2018-12-05 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710144#comment-16710144
 ] 

Dan Dutrow commented on SPARK-2629:
---

This PR should not reference SPARK-2629





> Improved state management for Spark Streaming (mapWithState)
> 
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
>  Issue Type: Epic
>  Components: DStreams
>Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 1.6.0
>
>
>  Current updateStateByKey provides stateful processing in Spark Streaming. It 
> allows the user to maintain per-key state and manage that state using an 
> updateFunction. The updateFunction is called for each key, and it uses new 
> data and existing state of the key, to generate an updated state. However, 
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle 
> data, (b) returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user 
> to update per-key state, and emit arbitrary records. The new API is necessary 
> as this will have significantly different semantics than the existing 
> updateStateByKey API. This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the 
> partitions of the state RDDs. The new data RDDs will be partitioned 
> appropriately, and for all the key-value data, it will lookup the map/list in 
> the state RDD partition and create a new list/map of updated state data. The 
> new state RDD partition will be created based on the update data and if 
> necessary, with old data. 
> Here is the detailed design doc (*outdated, to be updated*). Please take a 
> look and provide feedback as comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26279) Remove unused method in Logging

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26279:


Assignee: Apache Spark

> Remove unused method in Logging
> ---
>
> Key: SPARK-26279
> URL: https://issues.apache.org/jira/browse/SPARK-26279
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> The method isTraceEnabled is not used anywhere. We should remove it to avoid 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26279) Remove unused method in Logging

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710109#comment-16710109
 ] 

Apache Spark commented on SPARK-26279:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23237

> Remove unused method in Logging
> ---
>
> Key: SPARK-26279
> URL: https://issues.apache.org/jira/browse/SPARK-26279
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> The method isTraceEnabled is not used anywhere. We should remove it to avoid 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26279) Remove unused method in Logging

2018-12-05 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-26279:
-
Summary: Remove unused method in Logging  (was: Remove unused methods in 
Logging)

> Remove unused method in Logging
> ---
>
> Key: SPARK-26279
> URL: https://issues.apache.org/jira/browse/SPARK-26279
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> The method isTraceEnabled is not used anywhere. We should remove it to avoid 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26279) Remove unused methods in Logging

2018-12-05 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26279:


 Summary: Remove unused methods in Logging
 Key: SPARK-26279
 URL: https://issues.apache.org/jira/browse/SPARK-26279
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


The method isTraceEnabled is not used anywhere. We should remove it to avoid 
confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24417) Build and Run Spark on JDK11

2018-12-05 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710091#comment-16710091
 ] 

M. Le Bihan edited comment on SPARK-24417 at 12/5/18 1:53 PM:
--

Hello, 

Unaware if the problem with the JDK 11, I used it with _Spark 2.3.x_ without 
troubles for months, calling most of the times _lookup()_ functions on RDDs.

But when I attempted a _collect()_, I had a failure (an 
_IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a 
class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._".

Is it a trouble coming from memory management or from _Scala_ language ?

If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait 
for _Spark 3.0,_ when this version is planned to be released ?

 

Sorry if it's out of subject, but :

Will this next major version still be built over _Scala_ (meaning that it has 
to wait that _Scala_ project can follow _Java_ JDK versions) or only over 
_Java_, with _Scala_ offered as an independant option ?

Because it seems to me, who do not use _Scala_ for programming _Spark_ but 
plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a 
_Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ 
would confort me : a cause of issues would disappear.

 

Regards,


was (Author: mlebihan):
Hello, 

Unaware if the problem with the JDK 11, I used it with _Spark 2.3.2_ without 
troubles for months, calling most of the times _lookup()_ functions on RDDs.

But when I attempted a _collect()_, I had a failure (an 
_IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a 
class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._".


Is it a trouble coming from memory management or from _Scala_ language ?


If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait 
for _Spark 3.0,_ when this version is planned to be released ?

 

Sorry if it's out of subject, but :

Will this next major version still be built over _Scala_ (meaning that it has 
to wait that _Scala_ project can follow _Java_ JDK versions) or only over 
_Java_, with _Scala_ offered as an independant option ?

Because it seems to me, who do not use _Scala_ for programming _Spark_ but 
plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a 
_Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ 
would confort me : a cause of issues would disappear.

 

Regards,

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11

2018-12-05 Thread M. Le Bihan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710091#comment-16710091
 ] 

M. Le Bihan commented on SPARK-24417:
-

Hello, 

Unaware if the problem with the JDK 11, I used it with _Spark 2.3.2_ without 
troubles for months, calling most of the times _lookup()_ functions on RDDs.

But when I attempted a _collect()_, I had a failure (an 
_IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a 
class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._".


Is it a trouble coming from memory management or from _Scala_ language ?


If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait 
for _Spark 3.0,_ when this version is planned to be released ?

 

Sorry if it's out of subject, but :

Will this next major version still be built over _Scala_ (meaning that it has 
to wait that _Scala_ project can follow _Java_ JDK versions) or only over 
_Java_, with _Scala_ offered as an independant option ?

Because it seems to me, who do not use _Scala_ for programming _Spark_ but 
plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a 
_Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ 
would confort me : a cause of issues would disappear.

 

Regards,

> Build and Run Spark on JDK11
> 
>
> Key: SPARK-24417
> URL: https://issues.apache.org/jira/browse/SPARK-24417
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Major
>
> This is an umbrella JIRA for Apache Spark to support JDK11
> As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per 
> community discussion, we will skip JDK9 and 10 to support JDK 11 directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26278) V2 Streaming sources cannot be written to V1 sinks

2018-12-05 Thread Justin Polchlopek (JIRA)
Justin Polchlopek created SPARK-26278:
-

 Summary: V2 Streaming sources cannot be written to V1 sinks
 Key: SPARK-26278
 URL: https://issues.apache.org/jira/browse/SPARK-26278
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Structured Streaming
Affects Versions: 2.3.2
Reporter: Justin Polchlopek


Starting from a streaming DataFrame derived from a custom v2 MicroBatch reader, 
we have
{code:java}
val df: DataFrame = ... 
assert(df.isStreaming)

val outputFormat = "orc" // also applies to "csv" and "json" but not "console" 

df.writeStream
  .format(outputFormat)
  .option("checkpointLocation", "/tmp/checkpoints")
  .option("path", "/tmp/result")
  .start
{code}
This code fails with the following stack trace:
{code:java}
2018-12-04 08:24:27 ERROR MicroBatchExecution:91 - Query [id = 
193f97bf-8064-4658-8aa6-0f481919eafe, runId = 
e96ed7e5-aaf4-4ef4-a3f3-05fe0b01a715] terminated with error
java.lang.ClassCastException: 
org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to 
org.apache.spark.sql.sources.v2.reader.streaming.Offset
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390)
    at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at 
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at 
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390)
    at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
    at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
    at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
    at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
    at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
    at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){code}
I'm filing this issue on the suggestion of [~mojodna] who suggests that this 
problem could be resolved by backporting streaming sinks from spark 2.4.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26277:


Assignee: (was: Apache Spark)

> WholeStageCodegen metrics should be tested with whole-stage codegen enabled
> ---
>
> Key: SPARK-26277
> URL: https://issues.apache.org/jira/browse/SPARK-26277
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
> case named "WholeStageCodegen metrics". However, it is executed with 
> whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26270) Having clause does not work with explode anymore

2018-12-05 Thread Olli Kuonanoja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710026#comment-16710026
 ] 

Olli Kuonanoja commented on SPARK-26270:


Makes sense, thanks [~mgaido]

> Having clause does not work with explode anymore
> 
>
> Key: SPARK-26270
> URL: https://issues.apache.org/jira/browse/SPARK-26270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Olli Kuonanoja
>Priority: Major
>
> Hi,
> In Spark 2.3.0 it was possible to execute queries like
> {code:sql}
> select explode(col1) as v from values array(1,2) having v>1
> {code}
> but in 2.4.0 it leads to 
> {noformat}
> org.apache.spark.sql.AnalysisException: Generators are not supported outside 
> the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0];
> {noformat}
> Before looking into a fix I'm trying to understand if this has been changed 
> on purpose and if there is an alternate construct available. Could not find 
> any pre-existing tests for the explode-having combination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710037#comment-16710037
 ] 

Apache Spark commented on SPARK-26277:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23224

> WholeStageCodegen metrics should be tested with whole-stage codegen enabled
> ---
>
> Key: SPARK-26277
> URL: https://issues.apache.org/jira/browse/SPARK-26277
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
> case named "WholeStageCodegen metrics". However, it is executed with 
> whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710039#comment-16710039
 ] 

Apache Spark commented on SPARK-26277:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23224

> WholeStageCodegen metrics should be tested with whole-stage codegen enabled
> ---
>
> Key: SPARK-26277
> URL: https://issues.apache.org/jira/browse/SPARK-26277
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
> case named "WholeStageCodegen metrics". However, it is executed with 
> whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26277:


Assignee: Apache Spark

> WholeStageCodegen metrics should be tested with whole-stage codegen enabled
> ---
>
> Key: SPARK-26277
> URL: https://issues.apache.org/jira/browse/SPARK-26277
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
> case named "WholeStageCodegen metrics". However, it is executed with 
> whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled

2018-12-05 Thread Chenxiao Mao (JIRA)
Chenxiao Mao created SPARK-26277:


 Summary: WholeStageCodegen metrics should be tested with 
whole-stage codegen enabled
 Key: SPARK-26277
 URL: https://issues.apache.org/jira/browse/SPARK-26277
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Chenxiao Mao


In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test 
case named "WholeStageCodegen metrics". However, it is executed with 
whole-stage codegen disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26276) Broken link on download page

2018-12-05 Thread Sebb (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb resolved SPARK-26276.
--
Resolution: Invalid

Wrong project

> Broken link on download page
> 
>
> Key: SPARK-26276
> URL: https://issues.apache.org/jira/browse/SPARK-26276
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.3.2
>Reporter: Sebb
>Priority: Major
>
> The download page [1] links to release notes at
> http://bahir.apache.org/releases/spark/2.3.2/release-notes
> This does not exist.
> [1] http://bahir.apache.org/downloads/spark/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26276) Broken link on download page

2018-12-05 Thread Sebb (JIRA)
Sebb created SPARK-26276:


 Summary: Broken link on download page
 Key: SPARK-26276
 URL: https://issues.apache.org/jira/browse/SPARK-26276
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.2
Reporter: Sebb


The download page [1] links to release notes at

http://bahir.apache.org/releases/spark/2.3.2/release-notes

This does not exist.

[1] http://bahir.apache.org/downloads/spark/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709994#comment-16709994
 ] 

Apache Spark commented on SPARK-26275:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23236

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26275:
-
Priority: Minor  (was: Major)

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26275:


Assignee: Apache Spark

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26275:


Assignee: (was: Apache Spark)

> Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
> StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> --
>
> Key: SPARK-26275
> URL: https://issues.apache.org/jira/browse/SPARK-26275
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Looks this test is flaky
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console
> {code}
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 367, in test_training_and_prediction
> self._eventually(condition)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 78, in _eventually
> % (timeout, lastValue))
> AssertionError: Test failed due to timeout after 30 sec, with last condition 
> returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 
> 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74
> --
> Ran 13 tests in 185.051s
> FAILED (failures=1, skipped=1)
> {code}
> This looks happening after increasing the parallelism in Jenkins to speed up. 
> I am able to reproduce this manually when the resource usage is heavy with 
> manual decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26151) Return partial results for bad CSV records

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709991#comment-16709991
 ] 

Apache Spark commented on SPARK-26151:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23235

> Return partial results for bad CSV records
> --
>
> Key: SPARK-26151
> URL: https://issues.apache.org/jira/browse/SPARK-26151
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, CSV datasource and from_csv returns a rows with all nulls for bad 
> CSV records in the PERMISSIVE mode even some of fields were parsed and 
> converted successfully. For example, the CSV input:
> {code}
> 0,2013-111-11 12:13:14
> 1,1983-08-04
> {code}
> for the first line returned row is Row(null, null) but value 0 can be parsed 
> and converted successfully. And result can be Row(0, null). This ticket aims 
> to change implementation of UnivocityParser and return the partial result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction

2018-12-05 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-26275:


 Summary: Flaky test: pyspark.mllib.tests.test_streaming_algorithms 
StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
 Key: SPARK-26275
 URL: https://issues.apache.org/jira/browse/SPARK-26275
 Project: Spark
  Issue Type: Test
  Components: MLlib, PySpark
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Looks this test is flaky

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console

{code}
==
FAIL: test_training_and_prediction 
(pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
Test that the model improves on toy data with no. of batches
--
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 367, in test_training_and_prediction
self._eventually(condition)
  File 
"/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py",
 line 78, in _eventually
% (timeout, lastValue))
AssertionError: Test failed due to timeout after 30 sec, with last condition 
returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 
0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74

--
Ran 13 tests in 185.051s

FAILED (failures=1, skipped=1)
{code}

This looks happening after increasing the parallelism in Jenkins to speed up. I 
am able to reproduce this manually when the resource usage is heavy with manual 
decrease of timeout.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709970#comment-16709970
 ] 

Apache Spark commented on SPARK-26233:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23233

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel Canes
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect

2018-12-05 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709975#comment-16709975
 ] 

Hyukjin Kwon commented on SPARK-26149:
--

Thanks for details, [~yumwang]

> Read UTF8String from Parquet/ORC may be incorrect
> -
>
> Key: SPARK-26149
> URL: https://issues.apache.org/jira/browse/SPARK-26149
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: SPARK-26149.snappy.parquet, 
> image-2018-12-04-10-55-49-369.png
>
>
> How to reproduce:
> {code:bash}
> scala> 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1
>  = s2").show
> +-+
> |(s1 = s2)|
> +-+
> |false|
> +-+
> scala> val first = 
> spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head
> first: org.apache.spark.sql.Row = 
> [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96]
> scala> println(first.getString(0).equals(first.getString(1)))
> true
> {code}
> {code:sql}
> hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING)
> > stored as parquet
> > location "/Users/yumwang/SPARK-26149";
> OK
> Time taken: 0.224 seconds
> hive> select s1 = s2 from tb1;
> OK
> true
> Time taken: 0.167 seconds, Fetched: 1 row(s)
> {code}
> As you can see, only UTF8String returns {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709968#comment-16709968
 ] 

Apache Spark commented on SPARK-26233:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23234

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel Canes
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709964#comment-16709964
 ] 

Apache Spark commented on SPARK-26233:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23232

> Incorrect decimal value with java beans and first/last/max... functions
> ---
>
> Key: SPARK-26233
> URL: https://issues.apache.org/jira/browse/SPARK-26233
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Miquel Canes
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Decimal values from Java beans are incorrectly scaled when used with 
> functions like first/last/max...
> This problem came because Encoders.bean always set Decimal values as 
> _DecimalType(this.MAX_PRECISION(), 18)._
> Usually it's not a problem if you use numeric functions like *sum* but for 
> functions like *first*/*last*/*max*... it is a problem.
> How to reproduce this error:
> Using this class as an example:
> {code:java}
> public class Foo implements Serializable {
>   private String group;
>   private BigDecimal var;
>   public BigDecimal getVar() {
> return var;
>   }
>   public void setVar(BigDecimal var) {
> this.var = var;
>   }
>   public String getGroup() {
> return group;
>   }
>   public void setGroup(String group) {
> this.group = group;
>   }
> }
> {code}
>  
> And a dummy code to create some objects:
> {code:java}
> Dataset ds = spark.range(5)
> .map(l -> {
>   Foo foo = new Foo();
>   foo.setGroup("" + l);
>   foo.setVar(BigDecimal.valueOf(l + 0.));
>   return foo;
> }, Encoders.bean(Foo.class));
> ds.printSchema();
> ds.show();
> +-+--+
> |group| var|
> +-+--+
> | 0|0.|
> | 1|1.|
> | 2|2.|
> | 3|3.|
> | 4|4.|
> +-+--+
> {code}
> We can see that the DecimalType is precision 38 and 18 scale and all values 
> are show correctly.
> But if we use a first function, they are scaled incorrectly:
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> first("var")
> )
> .show();
> +-+-+
> |group|first(var, false)|
> +-+-+
> | 3| 3.E-14|
> | 0| 1.111E-15|
> | 1| 1.E-14|
> | 4| 4.E-14|
> | 2| 2.E-14|
> +-+-+
> {code}
> This incorrect behavior cannot be reproduced if we use "numerical "functions 
> like sum or if the column is cast a new Decimal Type.
> {code:java}
> ds.groupBy(col("group"))
> .agg(
> sum("var")
> )
> .show();
> +-++
> |group| sum(var)|
> +-++
> | 3|3.00|
> | 0|0.00|
> | 1|1.00|
> | 4|4.00|
> | 2|2.00|
> +-++
> ds.groupBy(col("group"))
> .agg(
> first(col("var").cast(new DecimalType(38, 8)))
> )
> .show();
> +-++
> |group|first(CAST(var AS DECIMAL(38,8)), false)|
> +-++
> | 3| 3.|
> | 0| 0.|
> | 1| 1.|
> | 4| 4.|
> | 2| 2.|
> +-++
> {code}
>    
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26270) Having clause does not work with explode anymore

2018-12-05 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709953#comment-16709953
 ] 

Marco Gaido commented on SPARK-26270:
-

This is caused by SPARK-25708. You can find more details on that ticket. If you 
want to switch to the previous behavior Spark had in this case you can set 
{{spark.sql.legacy.parser.havingWithoutGroupByAsWhere}} as {{true}}. This 
query, anyway, doesn't work in Postgres either, so I don't think it should be 
"fixed".

Since there is already a config which fits your needs, I am closing this 
ticket. Please feel free to re-open if you think some further action is 
required instead. Thanks.

> Having clause does not work with explode anymore
> 
>
> Key: SPARK-26270
> URL: https://issues.apache.org/jira/browse/SPARK-26270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Olli Kuonanoja
>Priority: Major
>
> Hi,
> In Spark 2.3.0 it was possible to execute queries like
> {code:sql}
> select explode(col1) as v from values array(1,2) having v>1
> {code}
> but in 2.4.0 it leads to 
> {noformat}
> org.apache.spark.sql.AnalysisException: Generators are not supported outside 
> the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0];
> {noformat}
> Before looking into a fix I'm trying to understand if this has been changed 
> on purpose and if there is an alternate construct available. Could not find 
> any pre-existing tests for the explode-having combination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26270) Having clause does not work with explode anymore

2018-12-05 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-26270.
-
Resolution: Invalid

> Having clause does not work with explode anymore
> 
>
> Key: SPARK-26270
> URL: https://issues.apache.org/jira/browse/SPARK-26270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Olli Kuonanoja
>Priority: Major
>
> Hi,
> In Spark 2.3.0 it was possible to execute queries like
> {code:sql}
> select explode(col1) as v from values array(1,2) having v>1
> {code}
> but in 2.4.0 it leads to 
> {noformat}
> org.apache.spark.sql.AnalysisException: Generators are not supported outside 
> the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0];
> {noformat}
> Before looking into a fix I'm trying to understand if this has been changed 
> on purpose and if there is an alternate construct available. Could not find 
> any pre-existing tests for the explode-having combination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709886#comment-16709886
 ] 

Apache Spark commented on SPARK-26273:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23231

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709882#comment-16709882
 ] 

Apache Spark commented on SPARK-26273:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23231

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26273:


Assignee: Apache Spark

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder

2018-12-05 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26273:


Assignee: (was: Apache Spark)

> Add OneHotEncoderEstimator as alias to OneHotEncoder
> 
>
> Key: SPARK-26273
> URL: https://issues.apache.org/jira/browse/SPARK-26273
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> SPARK-26133 removed deprecated OneHotEncoder and renamed 
> OneHotEncoderEstimator to OneHotEncoder.
> Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias 
> to OneHotEncoder.
> This task is going to add it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26274) Download page must link to https://www.apache.org/dist/spark for current releases

2018-12-05 Thread Sebb (JIRA)
Sebb created SPARK-26274:


 Summary: Download page must link to 
https://www.apache.org/dist/spark for current releases
 Key: SPARK-26274
 URL: https://issues.apache.org/jira/browse/SPARK-26274
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation, Web UI
Affects Versions: 2.4.0, 2.3.2
Reporter: Sebb


The download page currently uses the archive server:
https://archive.apache.org/dist/spark/...
for all sigs and hashes.
This is fine for archived releases, however current ones must link to the 
mirror system, i.e.
https://www.apache.org/dist/spark/...

Also, the page does not link directly to the hash or sig.
This makes it very difficult for the user, as they have to choose the correct 
file.
The download page must link directly to the actual sig or hash.

Ideally do so for the archived releases as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >