[jira] [Commented] (SPARK-26265) deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator
[ https://issues.apache.org/jira/browse/SPARK-26265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711079#comment-16711079 ] qian han commented on SPARK-26265: -- # There are hundreds of thousand application running on our cluster per day. And this deadlock is happened only once. This cannot be reproduce easily. # I ran spark sql. INSERT OVERWRITE TABLE dm_abtest.rpt_live_tag_metric_daily PARTITION(date='20181129_bak') select vid, tag_name, tag_value, count(*) impr_user, avg(impr) impr_per_u, stddev_pop(impr) var_impr_per_u, avg(read) read_per_u, stddev_pop(read) var_read_per_u, avg(stay) stay_per_u, stddev_pop(stay) var_stay_per_u, sum(stay)/sum(read) stay_per_r, sum(read)/sum(impr) read_per_i, avg(finish) finish_per_u, stddev_pop(finish) var_finish_per_u from ( select vid, user_uid, user_uid_type, tag_name, tag_value, sum(impr) impr, sum(read) read, sum(stay) stay, sum(stay_count) stay_count, 0 finish from ( select transform(vid,user_uid,user_uid_type,tags,impr,read,stay,stay_count) USING 'python transform.py 11' AS (vids,user_uid,user_uid_type,tag_name,tag_value,impr,read,stay,stay_count) from ( SELECT vid, user_uid, user_uid_type, tags, count(*) impr, sum(all_read) read, sum(video_stay) stay, sum(if(video_stay>0, 1, 0)) stay_count FROM dm_abtest.stg_live_impression_stats_daily WHERE date='20181129' and vid <> '' GROUP BY vid,user_uid,user_uid_type,tags ) t distribute by vids,user_uid,user_uid_type,tag_name,tag_value ) t lateral view explode(split(vids, ',')) b as vid group by vid,user_uid,user_uid_type,tag_name,tag_value ) t group by vid,tag_name,tag_value # When deadlock happen, the executor hang and do nothing. > deadlock between TaskMemoryManager and BytesToBytesMap$MapIterator > -- > > Key: SPARK-26265 > URL: https://issues.apache.org/jira/browse/SPARK-26265 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: qian han >Priority: Major > > The application is running on a cluster with 72000 cores and 182000G mem. > Enviroment: > |spark.dynamicAllocation.minExecutors|5| > |spark.dynamicAllocation.initialExecutors|30| > |spark.dynamicAllocation.maxExecutors|400| > |spark.executor.cores|4| > |spark.executor.memory|20g| > > > Stage description: > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:364) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:422) > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357) > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:193) > > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > jstack information as follow: > Found one Java-level deadlock: = > "Thread-ScriptTransformation-Feed": waiting to lock monitor > 0x00e0cb18 (object 0x0002f1641538, a > org.apache.spark.memory.TaskMemoryManager), which is held by "Executor task > launch worker for task 18899" "Executor task launch worker for task 18899": > waiting to lock monitor 0x00e09788 (object 0x000302faa3b0, a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator), which is held by > "Thread-ScriptTransformation-Feed" Java stack information for the threads > listed above: === > "Thread-ScriptTransformation-Feed": at > org.apache.spark.memory.TaskMemoryManager.freePage(TaskMemoryManager.java:332) > - waiting to lock <0x0002f1641538> (a > org.apache.spark.memory.TaskMemoryManager) at > org.apache.spark.memory.MemoryConsumer.freePage(MemoryConsumer.java:130) at > org.apache.spark.unsafe.map.BytesToBytesMap.access$300(BytesToBytesMap.java:66) > at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.advanceToNextPage(BytesToBytesMap.java:274) > - locked <0x000302faa3b0> (a > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator) at > org.apache.spark.unsafe.map.BytesToBytesMap$MapIterator.
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711068#comment-16711068 ] Apache Spark commented on SPARK-26288: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/23243 > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information, when the ExternalShuffleService restart and it can be reloaded, > which will be used as well . > While neither spark's standalone nor spark on k8s can record it's > RegisteredExecutors information by db or others ,so when > ExternalShuffleService restart ,which RegisteredExecutors information will be > lost,it is't what we looking forward to . > This commit add initRegisteredExecutorsDB which can be used either spark > standalone or spark on k8s to record RegisteredExecutors information , when > the ExternalShuffleService restart and it can be reloaded, which will be used > as well . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26288: Assignee: (was: Apache Spark) > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information, when the ExternalShuffleService restart and it can be reloaded, > which will be used as well . > While neither spark's standalone nor spark on k8s can record it's > RegisteredExecutors information by db or others ,so when > ExternalShuffleService restart ,which RegisteredExecutors information will be > lost,it is't what we looking forward to . > This commit add initRegisteredExecutorsDB which can be used either spark > standalone or spark on k8s to record RegisteredExecutors information , when > the ExternalShuffleService restart and it can be reloaded, which will be used > as well . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26288: Assignee: Apache Spark > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Assignee: Apache Spark >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information, when the ExternalShuffleService restart and it can be reloaded, > which will be used as well . > While neither spark's standalone nor spark on k8s can record it's > RegisteredExecutors information by db or others ,so when > ExternalShuffleService restart ,which RegisteredExecutors information will be > lost,it is't what we looking forward to . > This commit add initRegisteredExecutorsDB which can be used either spark > standalone or spark on k8s to record RegisteredExecutors information , when > the ExternalShuffleService restart and it can be reloaded, which will be used > as well . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
weixiuli created SPARK-26288: Summary: add initRegisteredExecutorsDB in ExternalShuffleService Key: SPARK-26288 URL: https://issues.apache.org/jira/browse/SPARK-26288 Project: Spark Issue Type: New Feature Components: Kubernetes, Shuffle Affects Versions: 2.4.0 Reporter: weixiuli Fix For: 2.4.0 As we all know that spark on Yarn uses DB to record RegisteredExecutors information, when the ExternalShuffleService restart and it can be reloaded, which will be used as well . While neither spark's standalone nor spark on k8s can record it's RegisteredExecutors information by db or others ,so when ExternalShuffleService restart ,which RegisteredExecutors information will be lost,it is't what we looking forward to . This commit add initRegisteredExecutorsDB which can be used either spark standalone or spark on k8s to record RegisteredExecutors information , when the ExternalShuffleService restart and it can be reloaded, which will be used as well . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF
[ https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711014#comment-16711014 ] Takeshi Yamamuro commented on SPARK-26182: -- This is an expected behaviour and a known issue, e.g., https://issues.apache.org/jira/browse/SPARK-15282. This is not a bug because this doesn't affect correctness. > Cost increases when optimizing scalaUDF > --- > > Key: SPARK-26182 > URL: https://issues.apache.org/jira/browse/SPARK-26182 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: Jiayi Liao >Priority: Major > > Let's assume that we have a udf called splitUDF which outputs a map data. > The SQL > {code:java} > select > g['a'], g['b'] > from >( select splitUDF(x) as g from table) tbl > {code} > will be optimized to the same logical plan of > {code:java} > select splitUDF(x)['a'], splitUDF(x)['b'] from table > {code} > which means that the splitUDF is executed twice instead of once. > The optimization is from CollapseProject. > I'm not sure whether this is a bug or not. Please tell me if I was wrong > about this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26182) Cost increases when optimizing scalaUDF
[ https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26182: - Issue Type: Improvement (was: Bug) > Cost increases when optimizing scalaUDF > --- > > Key: SPARK-26182 > URL: https://issues.apache.org/jira/browse/SPARK-26182 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: Jiayi Liao >Priority: Major > > Let's assume that we have a udf called splitUDF which outputs a map data. > The SQL > {code:java} > select > g['a'], g['b'] > from >( select splitUDF(x) as g from table) tbl > {code} > will be optimized to the same logical plan of > {code:java} > select splitUDF(x)['a'], splitUDF(x)['b'] from table > {code} > which means that the splitUDF is executed twice instead of once. > The optimization is from CollapseProject. > I'm not sure whether this is a bug or not. Please tell me if I was wrong > about this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26287) Don't need to create an empty spill file when memory has no records
[ https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711004#comment-16711004 ] Apache Spark commented on SPARK-26287: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23225 > Don't need to create an empty spill file when memory has no records > --- > > Key: SPARK-26287 > URL: https://issues.apache.org/jira/browse/SPARK-26287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > In the function writeSortedFile of the class ShuffleExternalSorter > when If there are no records in memory, then we don't need to create an empty > spill file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26287) Don't need to create an empty spill file when memory has no records
[ https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26287: Assignee: (was: Apache Spark) > Don't need to create an empty spill file when memory has no records > --- > > Key: SPARK-26287 > URL: https://issues.apache.org/jira/browse/SPARK-26287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > In the function writeSortedFile of the class ShuffleExternalSorter > when If there are no records in memory, then we don't need to create an empty > spill file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26287) Don't need to create an empty spill file when memory has no records
[ https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711001#comment-16711001 ] Apache Spark commented on SPARK-26287: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23225 > Don't need to create an empty spill file when memory has no records > --- > > Key: SPARK-26287 > URL: https://issues.apache.org/jira/browse/SPARK-26287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > In the function writeSortedFile of the class ShuffleExternalSorter > when If there are no records in memory, then we don't need to create an empty > spill file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26287) Don't need to create an empty spill file when memory has no records
[ https://issues.apache.org/jira/browse/SPARK-26287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26287: Assignee: Apache Spark > Don't need to create an empty spill file when memory has no records > --- > > Key: SPARK-26287 > URL: https://issues.apache.org/jira/browse/SPARK-26287 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > > In the function writeSortedFile of the class ShuffleExternalSorter > when If there are no records in memory, then we don't need to create an empty > spill file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26287) Don't need to create an empty spill file when memory has no records
wangjiaochun created SPARK-26287: Summary: Don't need to create an empty spill file when memory has no records Key: SPARK-26287 URL: https://issues.apache.org/jira/browse/SPARK-26287 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: wangjiaochun In the function writeSortedFile of the class ShuffleExternalSorter when If there are no records in memory, then we don't need to create an empty spill file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
[ https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26286: Assignee: (was: Apache Spark) > Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test > --- > > Key: SPARK-26286 > URL: https://issues.apache.org/jira/browse/SPARK-26286 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > Add max page size exception bounds checking test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
[ https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26286: Assignee: Apache Spark > Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test > --- > > Key: SPARK-26286 > URL: https://issues.apache.org/jira/browse/SPARK-26286 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Assignee: Apache Spark >Priority: Minor > > Add max page size exception bounds checking test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
[ https://issues.apache.org/jira/browse/SPARK-26286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710997#comment-16710997 ] Apache Spark commented on SPARK-26286: -- User 'wangjiaochun' has created a pull request for this issue: https://github.com/apache/spark/pull/23226 > Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test > --- > > Key: SPARK-26286 > URL: https://issues.apache.org/jira/browse/SPARK-26286 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: wangjiaochun >Priority: Minor > > Add max page size exception bounds checking test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26286) Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test
wangjiaochun created SPARK-26286: Summary: Add MAXIMUM_PAGE_SIZE_BYTES Exception unit test Key: SPARK-26286 URL: https://issues.apache.org/jira/browse/SPARK-26286 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.4.0 Reporter: wangjiaochun Add max page size exception bounds checking test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
[ https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26285: Assignee: (was: Apache Spark) > Add a metric source for accumulators (aka AccumulatorSource) > > > Key: SPARK-26285 > URL: https://issues.apache.org/jira/browse/SPARK-26285 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > We'd like a simple mechanism to register spark accumulators against the > codahale metrics registry. > This task proposes adding a LongAccumulatorSource and a > DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
[ https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26285: Assignee: Apache Spark > Add a metric source for accumulators (aka AccumulatorSource) > > > Key: SPARK-26285 > URL: https://issues.apache.org/jira/browse/SPARK-26285 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Assignee: Apache Spark >Priority: Minor > > We'd like a simple mechanism to register spark accumulators against the > codahale metrics registry. > This task proposes adding a LongAccumulatorSource and a > DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
[ https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710975#comment-16710975 ] Apache Spark commented on SPARK-26285: -- User 'abellina' has created a pull request for this issue: https://github.com/apache/spark/pull/23242 > Add a metric source for accumulators (aka AccumulatorSource) > > > Key: SPARK-26285 > URL: https://issues.apache.org/jira/browse/SPARK-26285 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > We'd like a simple mechanism to register spark accumulators against the > codahale metrics registry. > This task proposes adding a LongAccumulatorSource and a > DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
[ https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710974#comment-16710974 ] Apache Spark commented on SPARK-26285: -- User 'abellina' has created a pull request for this issue: https://github.com/apache/spark/pull/23242 > Add a metric source for accumulators (aka AccumulatorSource) > > > Key: SPARK-26285 > URL: https://issues.apache.org/jira/browse/SPARK-26285 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > We'd like a simple mechanism to register spark accumulators against the > codahale metrics registry. > This task proposes adding a LongAccumulatorSource and a > DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
[ https://issues.apache.org/jira/browse/SPARK-26285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710970#comment-16710970 ] Alessandro Bellina commented on SPARK-26285: I can't assign this issue, but I am putting up a PR for it. > Add a metric source for accumulators (aka AccumulatorSource) > > > Key: SPARK-26285 > URL: https://issues.apache.org/jira/browse/SPARK-26285 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Alessandro Bellina >Priority: Minor > > We'd like a simple mechanism to register spark accumulators against the > codahale metrics registry. > This task proposes adding a LongAccumulatorSource and a > DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26285) Add a metric source for accumulators (aka AccumulatorSource)
Alessandro Bellina created SPARK-26285: -- Summary: Add a metric source for accumulators (aka AccumulatorSource) Key: SPARK-26285 URL: https://issues.apache.org/jira/browse/SPARK-26285 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Alessandro Bellina We'd like a simple mechanism to register spark accumulators against the codahale metrics registry. This task proposes adding a LongAccumulatorSource and a DoubleAccumulatorSource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file
[ https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710902#comment-16710902 ] Hyukjin Kwon commented on SPARK-26261: -- It would be easy to verify if the codes are posted together so that other people could work on this if you're not going to work on this. > Spark does not check completeness temporary file > - > > Key: SPARK-26261 > URL: https://issues.apache.org/jira/browse/SPARK-26261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jialin LIu >Priority: Minor > > Spark does not check temporary files' completeness. When persisting to disk > is enabled on some RDDs, a bunch of temporary files will be created on > blockmgr folder. Block manager is able to detect missing blocks while it is > not able detect file content being modified during execution. > Our initial test shows that if we truncate the block file before being used > by executors, the program will finish without detecting any error, but the > result content is totally wrong. > We believe there should be a file checksum on every RDD file block and these > files should be protected by checksum. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710898#comment-16710898 ] Alan commented on SPARK-12312: -- I agree! Can we please get this implemented as soon as possible? This prevents us from being compliant with our internal security policies. > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file
[ https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710883#comment-16710883 ] Jialin LIu commented on SPARK-26261: Our initial test is: We start a word count workflow including persisting blocks to disk. After we make sure that there are some blocks on the disk already, we use the truncate command to truncate part of the block. We compare the result with the result produced by workflow without fault injection. > Spark does not check completeness temporary file > - > > Key: SPARK-26261 > URL: https://issues.apache.org/jira/browse/SPARK-26261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jialin LIu >Priority: Minor > > Spark does not check temporary files' completeness. When persisting to disk > is enabled on some RDDs, a bunch of temporary files will be created on > blockmgr folder. Block manager is able to detect missing blocks while it is > not able detect file content being modified during execution. > Our initial test shows that if we truncate the block file before being used > by executors, the program will finish without detecting any error, but the > result content is totally wrong. > We believe there should be a file checksum on every RDD file block and these > files should be protected by checksum. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26275. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23236 [https://github.com/apache/spark/pull/23236] > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26275: Assignee: Hyukjin Kwon > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-25148: > Executors launched with Spark on K8s client mode should prefix name with > spark.app.name > --- > > Key: SPARK-25148 > URL: https://issues.apache.org/jira/browse/SPARK-25148 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Timothy Chen >Priority: Major > > With the latest added client mode with Spark on k8s, executors launched by > default are all named "spark-exec-#". Which means when multiple jobs are > launched in the same cluster, they often have to retry to find unused pod > names. Also it's hard to correlate which executors are launched for which > spark app. The work around is to manually use the executor prefix > configuration for each job launched. > Ideally the experience should be the same for cluster mode, which each > executor is default prefix with the spark.app.name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710726#comment-16710726 ] Marcelo Vanzin commented on SPARK-25148: Actually there was a separate bug for the same issue. Duping... > Executors launched with Spark on K8s client mode should prefix name with > spark.app.name > --- > > Key: SPARK-25148 > URL: https://issues.apache.org/jira/browse/SPARK-25148 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Timothy Chen >Priority: Major > > With the latest added client mode with Spark on k8s, executors launched by > default are all named "spark-exec-#". Which means when multiple jobs are > launched in the same cluster, they often have to retry to find unused pod > names. Also it's hard to correlate which executors are launched for which > spark app. The work around is to manually use the executor prefix > configuration for each job launched. > Ideally the experience should be the same for cluster mode, which each > executor is default prefix with the spark.app.name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25148. Resolution: Duplicate > Executors launched with Spark on K8s client mode should prefix name with > spark.app.name > --- > > Key: SPARK-25148 > URL: https://issues.apache.org/jira/browse/SPARK-25148 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Timothy Chen >Priority: Major > > With the latest added client mode with Spark on k8s, executors launched by > default are all named "spark-exec-#". Which means when multiple jobs are > launched in the same cluster, they often have to retry to find unused pod > names. Also it's hard to correlate which executors are launched for which > spark app. The work around is to manually use the executor prefix > configuration for each job launched. > Ideally the experience should be the same for cluster mode, which each > executor is default prefix with the spark.app.name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25148) Executors launched with Spark on K8s client mode should prefix name with spark.app.name
[ https://issues.apache.org/jira/browse/SPARK-25148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25148. Resolution: Cannot Reproduce This seems to work for me locally. Executor pods are prefixed with a unique identifier based on the app name, unless overridden with {{spark.kubernetes.executor.podNamePrefix}}. > Executors launched with Spark on K8s client mode should prefix name with > spark.app.name > --- > > Key: SPARK-25148 > URL: https://issues.apache.org/jira/browse/SPARK-25148 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Timothy Chen >Priority: Major > > With the latest added client mode with Spark on k8s, executors launched by > default are all named "spark-exec-#". Which means when multiple jobs are > launched in the same cluster, they often have to retry to find unused pod > names. Also it's hard to correlate which executors are launched for which > spark app. The work around is to manually use the executor prefix > configuration for each job launched. > Ideally the experience should be the same for cluster mode, which each > executor is default prefix with the spark.app.name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710630#comment-16710630 ] shane knapp edited comment on SPARK-26282 at 12/5/18 9:02 PM: -- and the centos workers are updated: {noformat} [ sknapp@amp-jenkins-master ] [ ~ ] $ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java -version" [1] 12:57:19 [SUCCESS] amp-jenkins-worker-04 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [2] 12:57:19 [SUCCESS] amp-jenkins-worker-02 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [3] 12:57:19 [SUCCESS] amp-jenkins-worker-03 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [4] 12:57:19 [SUCCESS] amp-jenkins-worker-06 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [5] 12:57:19 [SUCCESS] amp-jenkins-worker-05 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [6] 12:57:19 [SUCCESS] amp-jenkins-worker-01 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat} i have a PR open to update the jenkins job configs, and once that's approved i'll deploy it immediately. was (Author: shaneknapp): and the centos workers are updated: {noformat} [ sknapp@amp-jenkins-master ] [ ~ ] $ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java -version" [1] 12:57:19 [SUCCESS] amp-jenkins-worker-04 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [2] 12:57:19 [SUCCESS] amp-jenkins-worker-02 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [3] 12:57:19 [SUCCESS] amp-jenkins-worker-03 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [4] 12:57:19 [SUCCESS] amp-jenkins-worker-06 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [5] 12:57:19 [SUCCESS] amp-jenkins-worker-05 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [6] 12:57:19 [SUCCESS] amp-jenkins-worker-01 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat} i have a PR open to update the jenkins job configs, and once that's approved i'll deploy that immediately. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710630#comment-16710630 ] shane knapp commented on SPARK-26282: - and the centos workers are updated: {noformat} [ sknapp@amp-jenkins-master ] [ ~ ] $ pssh -h jenkins_workers.txt -i "PATH=/usr/java/jdk1.8.0_191/bin:$PATH; java -version" [1] 12:57:19 [SUCCESS] amp-jenkins-worker-04 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [2] 12:57:19 [SUCCESS] amp-jenkins-worker-02 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [3] 12:57:19 [SUCCESS] amp-jenkins-worker-03 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [4] 12:57:19 [SUCCESS] amp-jenkins-worker-06 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [5] 12:57:19 [SUCCESS] amp-jenkins-worker-05 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [6] 12:57:19 [SUCCESS] amp-jenkins-worker-01 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat} i have a PR open to update the jenkins job configs, and once that's approved i'll deploy that immediately. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710622#comment-16710622 ] Apache Spark commented on SPARK-26281: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23160 > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26233: -- Fix Version/s: 2.4.1 2.3.3 2.2.3 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel Canes >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26284) Spark History server object vs file storage behavior difference
[ https://issues.apache.org/jira/browse/SPARK-26284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Damien Doucet-Girard updated SPARK-26284: - Description: I am using the spark history server in order to view running/complete jobs on spark using the kubernetes scheduling backend introduced in 2.3.0. Using a local file path in both {color:#33}{{spark.eventLog.dir}}{color} and {{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and completed tasks, with {{.inprogress}} files being flushed regularly. However, when using an {{s3a://}} path, it seems the calls to flush the file ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)] don't actually upload the file to s3. Due to this, I am unable to see currently incomplete tasks using an s3a path. >From the behavior I've observed, it only uploads on completion of the task >(hadoop 2.7) or upon the log file filling up the block size set for s3a >{{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended >behavior? was: I am using the spark history server in order to view running/complete jobs on spark using the kubernetes scheduling backend introduced in 2.3.0. Using a local file path in both `{color:#33}spark.eventLog.dir{color}` and `spark.history.fs.logDirectory`, I have no issue seeing both incomplete and completed tasks, with `.inprogress` files being flushed regularly. However, when using an `s3a://` path, it seems the calls to flush the file ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)] don't actually upload the file to s3. Due to this, I am unable to see currently incomplete tasks using an s3a path. >From the behavior I've observed, it only uploads on completion of the task >(hadoop 2.7) or upon the log file filling up the block size set for s3a >`{color:#6a8759}{color:#33}spark.hadoop.fs.s3a.multipart.size{color}` >{color}(hadoop 3.0.0). Is this intended behavior? > Spark History server object vs file storage behavior difference > --- > > Key: SPARK-26284 > URL: https://issues.apache.org/jira/browse/SPARK-26284 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Damien Doucet-Girard >Priority: Minor > > I am using the spark history server in order to view running/complete jobs on > spark using the kubernetes scheduling backend introduced in 2.3.0. Using a > local file path in both {color:#33}{{spark.eventLog.dir}}{color} and > {{spark.history.fs.logDirectory}}, I have no issue seeing both incomplete and > completed tasks, with {{.inprogress}} files being flushed regularly. However, > when using an {{s3a://}} path, it seems the calls to flush the file > ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)] > don't actually upload the file to s3. Due to this, I am unable to see > currently incomplete tasks using an s3a path. > From the behavior I've observed, it only uploads on completion of the task > (hadoop 2.7) or upon the log file filling up the block size set for s3a > {{spark.hadoop.fs.s3a.multipart.size}} (hadoop 3.0.0). Is this intended > behavior? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26284) Spark History server object vs file storage behavior difference
Damien Doucet-Girard created SPARK-26284: Summary: Spark History server object vs file storage behavior difference Key: SPARK-26284 URL: https://issues.apache.org/jira/browse/SPARK-26284 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: Damien Doucet-Girard I am using the spark history server in order to view running/complete jobs on spark using the kubernetes scheduling backend introduced in 2.3.0. Using a local file path in both `{color:#33}spark.eventLog.dir{color}` and `spark.history.fs.logDirectory`, I have no issue seeing both incomplete and completed tasks, with `.inprogress` files being flushed regularly. However, when using an `s3a://` path, it seems the calls to flush the file ([https://github.com/apache/spark/blob/dd518a196c2d40ae48034b8b0950d1c8045c02ed/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L152-L154)] don't actually upload the file to s3. Due to this, I am unable to see currently incomplete tasks using an s3a path. >From the behavior I've observed, it only uploads on completion of the task >(hadoop 2.7) or upon the log file filling up the block size set for s3a >`{color:#6a8759}{color:#33}spark.hadoop.fs.s3a.multipart.size{color}` >{color}(hadoop 3.0.0). Is this intended behavior? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26282: -- Summary: Update JVM to 8u191 on jenkins workers (was: update jvm on jenkins workers) > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pawan resolved SPARK-25919. --- Resolution: Fixed This was fixed by Hive in later versions of Jar which are not currently used by Spark yet. https://issues.apache.org/jira/browse/HIVE-11771 > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Blocker > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710565#comment-16710565 ] shane knapp commented on SPARK-26282: - ubuntu workers are done... {noformat} [ sknapp@amp-jenkins-master ] [ ~ ] $ pssh -h ubuntu_workers.txt -i "java -version" [1] 11:53:05 [SUCCESS] research-jenkins-worker-07 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [2] 11:53:05 [SUCCESS] amp-jenkins-staging-worker-02 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [3] 11:53:05 [SUCCESS] amp-jenkins-staging-worker-01 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode) [4] 11:53:05 [SUCCESS] research-jenkins-worker-08 Stderr: java version "1.8.0_191" Java(TM) SE Runtime Environment (build 1.8.0_191-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode){noformat} i'll get to the centos workers after lunch. > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26282: -- Affects Version/s: (was: 2.4.0) 3.0.0 > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) Update JVM to 8u191 on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710558#comment-16710558 ] Dongjoon Hyun commented on SPARK-26282: --- +1, great! > Update JVM to 8u191 on jenkins workers > -- > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) update jvm on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710540#comment-16710540 ] shane knapp commented on SPARK-26282: - looks like 191 is the most current java8... deploying that today. > update jvm on jenkins workers > - > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pawan closed SPARK-25919. - > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Blocker > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710535#comment-16710535 ] Pawan commented on SPARK-25919: --- I just figured out why is this the issue. Its because of the hive-exec jar packaged with Spark. The latest version which is packaged Spark-2.1.0 to Spark-2.3.1 is hive-exec-1.2.1.spark2.jar. However the parquet timestamp bug was fixed by Hive in hive-exec-2.0.0.jar, which is not available in spark packages which I mentioned earlier. It was fixed as a part of below Hive Jira https://issues.apache.org/jira/browse/HIVE-11771 Thanks & regards, Pawan Lawale > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Blocker > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710508#comment-16710508 ] Apache Spark commented on SPARK-26283: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23241 > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710506#comment-16710506 ] Apache Spark commented on SPARK-26283: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23241 > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26283: Assignee: (was: Apache Spark) > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26283: Assignee: Apache Spark > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Apache Spark >Priority: Minor > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26282) update jvm on jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-26282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710451#comment-16710451 ] Sean Owen commented on SPARK-26282: --- Yes the latest Java 8 JDK (_192?) is best. That may well be one of the final releases anyway. Whatever most recent version you can easily install through the OS updates is fine, as it will be much newer than _60. You're welcome to also install Java 11 while you're at it, as we will need it in the medium term to start running tests against Java 11. > update jvm on jenkins workers > - > > Key: SPARK-26282 > URL: https://issues.apache.org/jira/browse/SPARK-26282 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > the jvm we're using to build/test spark on the centos workers is a bit... > long in the teeth: > {noformat} > [sknapp@amp-jenkins-worker-04 ~]$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} > on the ubuntu nodes, it's only a little bit less old: > {noformat} > sknapp@amp-jenkins-staging-worker-01:~$ java -version > java version "1.8.0_171" > Java(TM) SE Runtime Environment (build 1.8.0_171-b11) > Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} > steps to update on centos: > * manually install new(er) java > * update /etc/alternatives > * update JJB configs and update JAVA_HOME/JAVA_BIN > steps to update on ubuntu: > * update ansible to install newer java > * deploy ansible > questions: > * do we stick w/java8 for now? > * which version is sufficient? > [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
ABHISHEK KUMAR GUPTA created SPARK-26283: Summary: When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running Key: SPARK-26283 URL: https://issues.apache.org/jira/browse/SPARK-26283 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.4.0, 3.0.0 Reporter: ABHISHEK KUMAR GUPTA When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26283) When zstd compression enabled, Inprogress application in the history server appUI showing finished job as running
[ https://issues.apache.org/jira/browse/SPARK-26283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710444#comment-16710444 ] shahid commented on SPARK-26283: Thanks. I am working on it. > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running > - > > Key: SPARK-26283 > URL: https://issues.apache.org/jira/browse/SPARK-26283 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > When zstd compression enabled, Inprogress application in the history server > appUI showing finished job as running -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26282) update jvm on jenkins workers
shane knapp created SPARK-26282: --- Summary: update jvm on jenkins workers Key: SPARK-26282 URL: https://issues.apache.org/jira/browse/SPARK-26282 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: shane knapp Assignee: shane knapp the jvm we're using to build/test spark on the centos workers is a bit... long in the teeth: {noformat} [sknapp@amp-jenkins-worker-04 ~]$ java -version java version "1.8.0_60" Java(TM) SE Runtime Environment (build 1.8.0_60-b27) Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode){noformat} on the ubuntu nodes, it's only a little bit less old: {noformat} sknapp@amp-jenkins-staging-worker-01:~$ java -version java version "1.8.0_171" Java(TM) SE Runtime Environment (build 1.8.0_171-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat} steps to update on centos: * manually install new(er) java * update /etc/alternatives * update JJB configs and update JAVA_HOME/JAVA_BIN steps to update on ubuntu: * update ansible to install newer java * deploy ansible questions: * do we stick w/java8 for now? * which version is sufficient? [~srowen] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26281: Assignee: Apache Spark > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
Gengliang Wang created SPARK-26281: -- Summary: Duration column of task table should be executor run time instead of real duration Key: SPARK-26281 URL: https://issues.apache.org/jira/browse/SPARK-26281 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Gengliang Wang In PR https://github.com/apache/spark/pull/23081/ , the duration column is changed to executor run time. The behavior is consistent with the summary metrics table and previous Spark version. However, after PR https://github.com/apache/spark/pull/21688, the issue can be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26281: Assignee: (was: Apache Spark) > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26281) Duration column of task table should be executor run time instead of real duration
[ https://issues.apache.org/jira/browse/SPARK-26281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710403#comment-16710403 ] Apache Spark commented on SPARK-26281: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/23240 > Duration column of task table should be executor run time instead of real > duration > -- > > Key: SPARK-26281 > URL: https://issues.apache.org/jira/browse/SPARK-26281 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > In PR https://github.com/apache/spark/pull/23081/ , the duration column is > changed to executor run time. The behavior is consistent with the summary > metrics table and previous Spark version. > However, after PR https://github.com/apache/spark/pull/21688, the issue can > be reproduced again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26278) V2 Streaming sources cannot be written to V1 sinks
[ https://issues.apache.org/jira/browse/SPARK-26278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710397#comment-16710397 ] Seth Fitzsimmons commented on SPARK-26278: -- I was thinking specifically of the SerializedOffset / Offset incompatibility referenced in SPARK-25257 and fixed in SPARK-23092 (but just the part that affects v2 source -> v1 sinks). > V2 Streaming sources cannot be written to V1 sinks > -- > > Key: SPARK-26278 > URL: https://issues.apache.org/jira/browse/SPARK-26278 > Project: Spark > Issue Type: Bug > Components: Input/Output, Structured Streaming >Affects Versions: 2.3.2 >Reporter: Justin Polchlopek >Priority: Major > > Starting from a streaming DataFrame derived from a custom v2 MicroBatch > reader, we have > {code:java} > val df: DataFrame = ... > assert(df.isStreaming) > val outputFormat = "orc" // also applies to "csv" and "json" but not > "console" > df.writeStream > .format(outputFormat) > .option("checkpointLocation", "/tmp/checkpoints") > .option("path", "/tmp/result") > .start > {code} > This code fails with the following stack trace: > {code:java} > 2018-12-04 08:24:27 ERROR MicroBatchExecution:91 - Query [id = > 193f97bf-8064-4658-8aa6-0f481919eafe, runId = > e96ed7e5-aaf4-4ef4-a3f3-05fe0b01a715] terminated with error > java.lang.ClassCastException: > org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to > org.apache.spark.sql.sources.v2.reader.streaming.Offset > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at > org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){code} > I'm filing th
[jira] [Commented] (SPARK-26222) Scan: track file listing time
[ https://issues.apache.org/jira/browse/SPARK-26222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710303#comment-16710303 ] Yuanjian Li commented on SPARK-26222: - Leave some thoughts for further discussion: * There's one place has track file listing duration now in `FileSourceScanExec`, metrics name is `metadataTime`(maybe an inaccurate name, should be changed to file listing time), we should add the phase tracking here. * We should also add the duration and phase tracking in these 2 places: ** HiveMetastoreCatalog inferred Scehma. ** replaceTableScanWithPartitionMetadata in OptimizeMetadataOnlyQuery rule. * IIUC, the phase tracking can use `QueryPlanningTracker` directly cause its thread locally and passed through within all `RuleExecution`. * About the meaning of listing time, maybe we can define it's only refers to reading without cache because loading from cache is not the 'heavy' operator we want to tracking and also spend less time. The listing time not only contains the first time `listFiles` called, but also each time after cache was refreshed. > Scan: track file listing time > - > > Key: SPARK-26222 > URL: https://issues.apache.org/jira/browse/SPARK-26222 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Reynold Xin >Priority: Major > > We should track file listing time and add it to the scan node's SQL metric, > so we have visibility how much is spent in file listing. It'd be useful to > track not just duration, but also start and end time so we can construct a > timeline. > This requires a little bit design to define what file listing time means, > when we are reading from cache, vs not cache. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive
[ https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710266#comment-16710266 ] Apache Spark commented on SPARK-26021: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23239 > -0.0 and 0.0 not treated consistently, doesn't match Hive > - > > Key: SPARK-26021 > URL: https://issues.apache.org/jira/browse/SPARK-26021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Alon Doron >Priority: Critical > Fix For: 2.4.1, 3.0.0 > > > Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new > issue: > The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are > numerically identical but not the same double value: > In hive, 0.0 and -0.0 are equal since > https://issues.apache.org/jira/browse/HIVE-11174. > That's not the case with spark sql as "group by" (non-codegen) treats them > as different values. Since their hash is different they're put in different > buckets of UnsafeFixedWidthAggregationMap. > In addition there's an inconsistency when using the codegen, for example the > following unit test: > {code:java} > println(Seq(0.0d, 0.0d, > -0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,3] > {code:java} > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,1], [-0.0,2] > {code:java} > spark.conf.set("spark.sql.codegen.wholeStage", "false") > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,2], [-0.0,1] > Note that the only difference between the first 2 lines is the order of the > elements in the Seq. > This inconsistency is resulted by different partitioning of the Seq and the > usage of the generated fast hash map in the first, partial, aggregation. > It looks like we need to add a specific check for -0.0 before hashing (both > in codegen and non-codegen modes) if we want to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26280) Spark will read entire CSV file even when limit is used
Amir Bar-Or created SPARK-26280: --- Summary: Spark will read entire CSV file even when limit is used Key: SPARK-26280 URL: https://issues.apache.org/jira/browse/SPARK-26280 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Reporter: Amir Bar-Or When you read CSV as below , the parser still waste time and read the entire file: var lineDF1 = spark.read .format("com.databricks.spark.csv") .option("header", "true") //reading the headers .option("mode", "DROPMALFORMED") .option("delimiter",",") .option("inferSchema", "false") .schema(line_schema) .load(i_lineitem) .lineDF1.limit(10) Even though a LocalLimit is created , this does not stop the FileScan and the parser from parsing entire file. Is it possible to push the limit down and stop the parsing ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26273: -- Priority: Minor (was: Major) > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710219#comment-16710219 ] Apache Spark commented on SPARK-25132: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23238 > Case-insensitive field resolution when reading from Parquet > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh resolved SPARK-26273. - Resolution: Won't Fix > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710216#comment-16710216 ] Apache Spark commented on SPARK-25132: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23238 > Case-insensitive field resolution when reading from Parquet > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710192#comment-16710192 ] Liang-Chi Hsieh commented on SPARK-26273: - For now the idea collected from the PR is we don't need to keep such alias even it is mentioned in the ml migration guide. So I close this and the PR. > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26279) Remove unused method in Logging
[ https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26279: Assignee: (was: Apache Spark) > Remove unused method in Logging > --- > > Key: SPARK-26279 > URL: https://issues.apache.org/jira/browse/SPARK-26279 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > The method isTraceEnabled is not used anywhere. We should remove it to avoid > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming (mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710144#comment-16710144 ] Dan Dutrow commented on SPARK-2629: --- This PR should not reference SPARK-2629 > Improved state management for Spark Streaming (mapWithState) > > > Key: SPARK-2629 > URL: https://issues.apache.org/jira/browse/SPARK-2629 > Project: Spark > Issue Type: Epic > Components: DStreams >Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 1.6.0 > > > Current updateStateByKey provides stateful processing in Spark Streaming. It > allows the user to maintain per-key state and manage that state using an > updateFunction. The updateFunction is called for each key, and it uses new > data and existing state of the key, to generate an updated state. However, > based on community feedback, we have learnt the following lessons. > - Need for more optimized state management that does not scan every key > - Need to make it easier to implement common use cases - (a) timeout of idle > data, (b) returning items other than state > The high level idea that I am proposing is > - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user > to update per-key state, and emit arbitrary records. The new API is necessary > as this will have significantly different semantics than the existing > updateStateByKey API. This API will have direct support for timeouts. > - Internally, the system will keep the state data as a map/list within the > partitions of the state RDDs. The new data RDDs will be partitioned > appropriately, and for all the key-value data, it will lookup the map/list in > the state RDD partition and create a new list/map of updated state data. The > new state RDD partition will be created based on the update data and if > necessary, with old data. > Here is the detailed design doc (*outdated, to be updated*). Please take a > look and provide feedback as comments. > https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26279) Remove unused method in Logging
[ https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26279: Assignee: Apache Spark > Remove unused method in Logging > --- > > Key: SPARK-26279 > URL: https://issues.apache.org/jira/browse/SPARK-26279 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Major > > The method isTraceEnabled is not used anywhere. We should remove it to avoid > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26279) Remove unused method in Logging
[ https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710109#comment-16710109 ] Apache Spark commented on SPARK-26279: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23237 > Remove unused method in Logging > --- > > Key: SPARK-26279 > URL: https://issues.apache.org/jira/browse/SPARK-26279 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > The method isTraceEnabled is not used anywhere. We should remove it to avoid > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26279) Remove unused method in Logging
[ https://issues.apache.org/jira/browse/SPARK-26279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26279: - Summary: Remove unused method in Logging (was: Remove unused methods in Logging) > Remove unused method in Logging > --- > > Key: SPARK-26279 > URL: https://issues.apache.org/jira/browse/SPARK-26279 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > The method isTraceEnabled is not used anywhere. We should remove it to avoid > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26279) Remove unused methods in Logging
Chenxiao Mao created SPARK-26279: Summary: Remove unused methods in Logging Key: SPARK-26279 URL: https://issues.apache.org/jira/browse/SPARK-26279 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Chenxiao Mao The method isTraceEnabled is not used anywhere. We should remove it to avoid confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710091#comment-16710091 ] M. Le Bihan edited comment on SPARK-24417 at 12/5/18 1:53 PM: -- Hello, Unaware if the problem with the JDK 11, I used it with _Spark 2.3.x_ without troubles for months, calling most of the times _lookup()_ functions on RDDs. But when I attempted a _collect()_, I had a failure (an _IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._". Is it a trouble coming from memory management or from _Scala_ language ? If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait for _Spark 3.0,_ when this version is planned to be released ? Sorry if it's out of subject, but : Will this next major version still be built over _Scala_ (meaning that it has to wait that _Scala_ project can follow _Java_ JDK versions) or only over _Java_, with _Scala_ offered as an independant option ? Because it seems to me, who do not use _Scala_ for programming _Spark_ but plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a _Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ would confort me : a cause of issues would disappear. Regards, was (Author: mlebihan): Hello, Unaware if the problem with the JDK 11, I used it with _Spark 2.3.2_ without troubles for months, calling most of the times _lookup()_ functions on RDDs. But when I attempted a _collect()_, I had a failure (an _IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._". Is it a trouble coming from memory management or from _Scala_ language ? If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait for _Spark 3.0,_ when this version is planned to be released ? Sorry if it's out of subject, but : Will this next major version still be built over _Scala_ (meaning that it has to wait that _Scala_ project can follow _Java_ JDK versions) or only over _Java_, with _Scala_ offered as an independant option ? Because it seems to me, who do not use _Scala_ for programming _Spark_ but plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a _Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ would confort me : a cause of issues would disappear. Regards, > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24417) Build and Run Spark on JDK11
[ https://issues.apache.org/jira/browse/SPARK-24417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710091#comment-16710091 ] M. Le Bihan commented on SPARK-24417: - Hello, Unaware if the problem with the JDK 11, I used it with _Spark 2.3.2_ without troubles for months, calling most of the times _lookup()_ functions on RDDs. But when I attempted a _collect()_, I had a failure (an _IllegalArgumentException_). I upgraded to _Spark 2.4.0_ and a message from a class in _org.apache.xbean_ explained: "_Unsupported minor major version 55._". Is it a trouble coming from memory management or from _Scala_ language ? If, eventually, _Spark 2.x_ cannot support _JDK 11_ and that we have to wait for _Spark 3.0,_ when this version is planned to be released ? Sorry if it's out of subject, but : Will this next major version still be built over _Scala_ (meaning that it has to wait that _Scala_ project can follow _Java_ JDK versions) or only over _Java_, with _Scala_ offered as an independant option ? Because it seems to me, who do not use _Scala_ for programming _Spark_ but plain _Java_ only, that _Scala_ is a cause of underlying troubles. Having a _Spark_ without _Scala_ like it is possible to have a _Spark_ without _Hadoop_ would confort me : a cause of issues would disappear. Regards, > Build and Run Spark on JDK11 > > > Key: SPARK-24417 > URL: https://issues.apache.org/jira/browse/SPARK-24417 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.3.0 >Reporter: DB Tsai >Priority: Major > > This is an umbrella JIRA for Apache Spark to support JDK11 > As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per > community discussion, we will skip JDK9 and 10 to support JDK 11 directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26278) V2 Streaming sources cannot be written to V1 sinks
Justin Polchlopek created SPARK-26278: - Summary: V2 Streaming sources cannot be written to V1 sinks Key: SPARK-26278 URL: https://issues.apache.org/jira/browse/SPARK-26278 Project: Spark Issue Type: Bug Components: Input/Output, Structured Streaming Affects Versions: 2.3.2 Reporter: Justin Polchlopek Starting from a streaming DataFrame derived from a custom v2 MicroBatch reader, we have {code:java} val df: DataFrame = ... assert(df.isStreaming) val outputFormat = "orc" // also applies to "csv" and "json" but not "console" df.writeStream .format(outputFormat) .option("checkpointLocation", "/tmp/checkpoints") .option("path", "/tmp/result") .start {code} This code fails with the following stack trace: {code:java} 2018-12-04 08:24:27 ERROR MicroBatchExecution:91 - Query [id = 193f97bf-8064-4658-8aa6-0f481919eafe, runId = e96ed7e5-aaf4-4ef4-a3f3-05fe0b01a715] terminated with error java.lang.ClassCastException: org.apache.spark.sql.execution.streaming.SerializedOffset cannot be cast to org.apache.spark.sql.sources.v2.reader.streaming.Offset at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:405) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1$$anonfun$apply$9.apply(MicroBatchExecution.scala:390) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$1.apply(MicroBatchExecution.scala:390) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:389) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){code} I'm filing this issue on the suggestion of [~mojodna] who suggests that this problem could be resolved by backporting streaming sinks from spark 2.4.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled
[ https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26277: Assignee: (was: Apache Spark) > WholeStageCodegen metrics should be tested with whole-stage codegen enabled > --- > > Key: SPARK-26277 > URL: https://issues.apache.org/jira/browse/SPARK-26277 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test > case named "WholeStageCodegen metrics". However, it is executed with > whole-stage codegen disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26270) Having clause does not work with explode anymore
[ https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710026#comment-16710026 ] Olli Kuonanoja commented on SPARK-26270: Makes sense, thanks [~mgaido] > Having clause does not work with explode anymore > > > Key: SPARK-26270 > URL: https://issues.apache.org/jira/browse/SPARK-26270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Olli Kuonanoja >Priority: Major > > Hi, > In Spark 2.3.0 it was possible to execute queries like > {code:sql} > select explode(col1) as v from values array(1,2) having v>1 > {code} > but in 2.4.0 it leads to > {noformat} > org.apache.spark.sql.AnalysisException: Generators are not supported outside > the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0]; > {noformat} > Before looking into a fix I'm trying to understand if this has been changed > on purpose and if there is an alternate construct available. Could not find > any pre-existing tests for the explode-having combination. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled
[ https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710037#comment-16710037 ] Apache Spark commented on SPARK-26277: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23224 > WholeStageCodegen metrics should be tested with whole-stage codegen enabled > --- > > Key: SPARK-26277 > URL: https://issues.apache.org/jira/browse/SPARK-26277 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test > case named "WholeStageCodegen metrics". However, it is executed with > whole-stage codegen disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled
[ https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710039#comment-16710039 ] Apache Spark commented on SPARK-26277: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/23224 > WholeStageCodegen metrics should be tested with whole-stage codegen enabled > --- > > Key: SPARK-26277 > URL: https://issues.apache.org/jira/browse/SPARK-26277 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test > case named "WholeStageCodegen metrics". However, it is executed with > whole-stage codegen disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled
[ https://issues.apache.org/jira/browse/SPARK-26277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26277: Assignee: Apache Spark > WholeStageCodegen metrics should be tested with whole-stage codegen enabled > --- > > Key: SPARK-26277 > URL: https://issues.apache.org/jira/browse/SPARK-26277 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Major > > In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test > case named "WholeStageCodegen metrics". However, it is executed with > whole-stage codegen disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26277) WholeStageCodegen metrics should be tested with whole-stage codegen enabled
Chenxiao Mao created SPARK-26277: Summary: WholeStageCodegen metrics should be tested with whole-stage codegen enabled Key: SPARK-26277 URL: https://issues.apache.org/jira/browse/SPARK-26277 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Chenxiao Mao In {{org.apache.spark.sql.execution.metric.SQLMetricsSuite}}, there's a test case named "WholeStageCodegen metrics". However, it is executed with whole-stage codegen disabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26276) Broken link on download page
[ https://issues.apache.org/jira/browse/SPARK-26276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebb resolved SPARK-26276. -- Resolution: Invalid Wrong project > Broken link on download page > > > Key: SPARK-26276 > URL: https://issues.apache.org/jira/browse/SPARK-26276 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.2 >Reporter: Sebb >Priority: Major > > The download page [1] links to release notes at > http://bahir.apache.org/releases/spark/2.3.2/release-notes > This does not exist. > [1] http://bahir.apache.org/downloads/spark/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26276) Broken link on download page
Sebb created SPARK-26276: Summary: Broken link on download page Key: SPARK-26276 URL: https://issues.apache.org/jira/browse/SPARK-26276 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.2 Reporter: Sebb The download page [1] links to release notes at http://bahir.apache.org/releases/spark/2.3.2/release-notes This does not exist. [1] http://bahir.apache.org/downloads/spark/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709994#comment-16709994 ] Apache Spark commented on SPARK-26275: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23236 > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26275: - Priority: Minor (was: Major) > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26275: Assignee: Apache Spark > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
[ https://issues.apache.org/jira/browse/SPARK-26275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26275: Assignee: (was: Apache Spark) > Flaky test: pyspark.mllib.tests.test_streaming_algorithms > StreamingLogisticRegressionWithSGDTests.test_training_and_prediction > -- > > Key: SPARK-26275 > URL: https://issues.apache.org/jira/browse/SPARK-26275 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Looks this test is flaky > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console > {code} > == > FAIL: test_training_and_prediction > (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) > Test that the model improves on toy data with no. of batches > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 367, in test_training_and_prediction > self._eventually(condition) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", > line 78, in _eventually > % (timeout, lastValue)) > AssertionError: Test failed due to timeout after 30 sec, with last condition > returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, > 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 > -- > Ran 13 tests in 185.051s > FAILED (failures=1, skipped=1) > {code} > This looks happening after increasing the parallelism in Jenkins to speed up. > I am able to reproduce this manually when the resource usage is heavy with > manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26151) Return partial results for bad CSV records
[ https://issues.apache.org/jira/browse/SPARK-26151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709991#comment-16709991 ] Apache Spark commented on SPARK-26151: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23235 > Return partial results for bad CSV records > -- > > Key: SPARK-26151 > URL: https://issues.apache.org/jira/browse/SPARK-26151 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, CSV datasource and from_csv returns a rows with all nulls for bad > CSV records in the PERMISSIVE mode even some of fields were parsed and > converted successfully. For example, the CSV input: > {code} > 0,2013-111-11 12:13:14 > 1,1983-08-04 > {code} > for the first line returned row is Row(null, null) but value 0 can be parsed > and converted successfully. And result can be Row(0, null). This ticket aims > to change implementation of UnivocityParser and return the partial result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26275) Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
Hyukjin Kwon created SPARK-26275: Summary: Flaky test: pyspark.mllib.tests.test_streaming_algorithms StreamingLogisticRegressionWithSGDTests.test_training_and_prediction Key: SPARK-26275 URL: https://issues.apache.org/jira/browse/SPARK-26275 Project: Spark Issue Type: Test Components: MLlib, PySpark Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Looks this test is flaky https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99569/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99644/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99548/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99454/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99609/console {code} == FAIL: test_training_and_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests) Test that the model improves on toy data with no. of batches -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 367, in test_training_and_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 78, in _eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 30 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74 -- Ran 13 tests in 185.051s FAILED (failures=1, skipped=1) {code} This looks happening after increasing the parallelism in Jenkins to speed up. I am able to reproduce this manually when the resource usage is heavy with manual decrease of timeout. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709970#comment-16709970 ] Apache Spark commented on SPARK-26233: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/23233 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel Canes >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26149) Read UTF8String from Parquet/ORC may be incorrect
[ https://issues.apache.org/jira/browse/SPARK-26149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709975#comment-16709975 ] Hyukjin Kwon commented on SPARK-26149: -- Thanks for details, [~yumwang] > Read UTF8String from Parquet/ORC may be incorrect > - > > Key: SPARK-26149 > URL: https://issues.apache.org/jira/browse/SPARK-26149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Yuming Wang >Priority: Major > Attachments: SPARK-26149.snappy.parquet, > image-2018-12-04-10-55-49-369.png > > > How to reproduce: > {code:bash} > scala> > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").selectExpr("s1 > = s2").show > +-+ > |(s1 = s2)| > +-+ > |false| > +-+ > scala> val first = > spark.read.parquet("/Users/yumwang/SPARK-26149/SPARK-26149.snappy.parquet").collect().head > first: org.apache.spark.sql.Row = > [a0750c1f13f0k5��F8j���b�Ro'4da96,a0750c1f13f0k5��F8j���b�Ro'4da96] > scala> println(first.getString(0).equals(first.getString(1))) > true > {code} > {code:sql} > hive> CREATE TABLE `tb1` (`s1` STRING, `s2` STRING) > > stored as parquet > > location "/Users/yumwang/SPARK-26149"; > OK > Time taken: 0.224 seconds > hive> select s1 = s2 from tb1; > OK > true > Time taken: 0.167 seconds, Fetched: 1 row(s) > {code} > As you can see, only UTF8String returns {{false}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709968#comment-16709968 ] Apache Spark commented on SPARK-26233: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/23234 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel Canes >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26233) Incorrect decimal value with java beans and first/last/max... functions
[ https://issues.apache.org/jira/browse/SPARK-26233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709964#comment-16709964 ] Apache Spark commented on SPARK-26233: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/23232 > Incorrect decimal value with java beans and first/last/max... functions > --- > > Key: SPARK-26233 > URL: https://issues.apache.org/jira/browse/SPARK-26233 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Miquel Canes >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 3.0.0 > > > Decimal values from Java beans are incorrectly scaled when used with > functions like first/last/max... > This problem came because Encoders.bean always set Decimal values as > _DecimalType(this.MAX_PRECISION(), 18)._ > Usually it's not a problem if you use numeric functions like *sum* but for > functions like *first*/*last*/*max*... it is a problem. > How to reproduce this error: > Using this class as an example: > {code:java} > public class Foo implements Serializable { > private String group; > private BigDecimal var; > public BigDecimal getVar() { > return var; > } > public void setVar(BigDecimal var) { > this.var = var; > } > public String getGroup() { > return group; > } > public void setGroup(String group) { > this.group = group; > } > } > {code} > > And a dummy code to create some objects: > {code:java} > Dataset ds = spark.range(5) > .map(l -> { > Foo foo = new Foo(); > foo.setGroup("" + l); > foo.setVar(BigDecimal.valueOf(l + 0.)); > return foo; > }, Encoders.bean(Foo.class)); > ds.printSchema(); > ds.show(); > +-+--+ > |group| var| > +-+--+ > | 0|0.| > | 1|1.| > | 2|2.| > | 3|3.| > | 4|4.| > +-+--+ > {code} > We can see that the DecimalType is precision 38 and 18 scale and all values > are show correctly. > But if we use a first function, they are scaled incorrectly: > {code:java} > ds.groupBy(col("group")) > .agg( > first("var") > ) > .show(); > +-+-+ > |group|first(var, false)| > +-+-+ > | 3| 3.E-14| > | 0| 1.111E-15| > | 1| 1.E-14| > | 4| 4.E-14| > | 2| 2.E-14| > +-+-+ > {code} > This incorrect behavior cannot be reproduced if we use "numerical "functions > like sum or if the column is cast a new Decimal Type. > {code:java} > ds.groupBy(col("group")) > .agg( > sum("var") > ) > .show(); > +-++ > |group| sum(var)| > +-++ > | 3|3.00| > | 0|0.00| > | 1|1.00| > | 4|4.00| > | 2|2.00| > +-++ > ds.groupBy(col("group")) > .agg( > first(col("var").cast(new DecimalType(38, 8))) > ) > .show(); > +-++ > |group|first(CAST(var AS DECIMAL(38,8)), false)| > +-++ > | 3| 3.| > | 0| 0.| > | 1| 1.| > | 4| 4.| > | 2| 2.| > +-++ > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26270) Having clause does not work with explode anymore
[ https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709953#comment-16709953 ] Marco Gaido commented on SPARK-26270: - This is caused by SPARK-25708. You can find more details on that ticket. If you want to switch to the previous behavior Spark had in this case you can set {{spark.sql.legacy.parser.havingWithoutGroupByAsWhere}} as {{true}}. This query, anyway, doesn't work in Postgres either, so I don't think it should be "fixed". Since there is already a config which fits your needs, I am closing this ticket. Please feel free to re-open if you think some further action is required instead. Thanks. > Having clause does not work with explode anymore > > > Key: SPARK-26270 > URL: https://issues.apache.org/jira/browse/SPARK-26270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Olli Kuonanoja >Priority: Major > > Hi, > In Spark 2.3.0 it was possible to execute queries like > {code:sql} > select explode(col1) as v from values array(1,2) having v>1 > {code} > but in 2.4.0 it leads to > {noformat} > org.apache.spark.sql.AnalysisException: Generators are not supported outside > the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0]; > {noformat} > Before looking into a fix I'm trying to understand if this has been changed > on purpose and if there is an alternate construct available. Could not find > any pre-existing tests for the explode-having combination. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26270) Having clause does not work with explode anymore
[ https://issues.apache.org/jira/browse/SPARK-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-26270. - Resolution: Invalid > Having clause does not work with explode anymore > > > Key: SPARK-26270 > URL: https://issues.apache.org/jira/browse/SPARK-26270 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Olli Kuonanoja >Priority: Major > > Hi, > In Spark 2.3.0 it was possible to execute queries like > {code:sql} > select explode(col1) as v from values array(1,2) having v>1 > {code} > but in 2.4.0 it leads to > {noformat} > org.apache.spark.sql.AnalysisException: Generators are not supported outside > the SELECT clause, but got: 'Aggregate [explode(col1#1) AS v#0]; > {noformat} > Before looking into a fix I'm trying to understand if this has been changed > on purpose and if there is an alternate construct available. Could not find > any pre-existing tests for the explode-having combination. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709886#comment-16709886 ] Apache Spark commented on SPARK-26273: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/23231 > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16709882#comment-16709882 ] Apache Spark commented on SPARK-26273: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/23231 > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26273: Assignee: Apache Spark > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26273) Add OneHotEncoderEstimator as alias to OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26273: Assignee: (was: Apache Spark) > Add OneHotEncoderEstimator as alias to OneHotEncoder > > > Key: SPARK-26273 > URL: https://issues.apache.org/jira/browse/SPARK-26273 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > SPARK-26133 removed deprecated OneHotEncoder and renamed > OneHotEncoderEstimator to OneHotEncoder. > Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias > to OneHotEncoder. > This task is going to add it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26274) Download page must link to https://www.apache.org/dist/spark for current releases
Sebb created SPARK-26274: Summary: Download page must link to https://www.apache.org/dist/spark for current releases Key: SPARK-26274 URL: https://issues.apache.org/jira/browse/SPARK-26274 Project: Spark Issue Type: Bug Components: Deploy, Documentation, Web UI Affects Versions: 2.4.0, 2.3.2 Reporter: Sebb The download page currently uses the archive server: https://archive.apache.org/dist/spark/... for all sigs and hashes. This is fine for archived releases, however current ones must link to the mirror system, i.e. https://www.apache.org/dist/spark/... Also, the page does not link directly to the hash or sig. This makes it very difficult for the user, as they have to choose the correct file. The download page must link directly to the actual sig or hash. Ideally do so for the archived releases as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org