[jira] [Created] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat
deshanxiao created SPARK-26333: -- Summary: FsHistoryProviderSuite failed because setReadable doesn't work in RedHat Key: SPARK-26333 URL: https://issues.apache.org/jira/browse/SPARK-26333 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: deshanxiao FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot be read.". I try to invoke logFile2.canRead after invoking "setReadable(false, false)" . And I find that the result of "logFile2.canRead" is true but in my ubuntu16.04 return false. The environment: RedHat: Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 22:26:13 UTC 2017 JDK Java version: 1.8.0_151, vendor: Oracle Corporation {code:java} org.scalatest.exceptions.TestFailedException: 2 was not equal to 1 at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340) at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668) at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182) at org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182) at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) at org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) at org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716426#comment-16716426 ] ASF GitHub Bot commented on SPARK-26262: viirya commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446104710 I think wholeStageCodegen doesn't disallow using those objects in interpreted mode. The objects can be in interpreted mode if it rolls back from codegen in case of compilation error. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and > CODEGEN_FACTORY_MODE > > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed > config sets: > 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN > 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716419#comment-16716419 ] ASF GitHub Bot commented on SPARK-26311: HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446103258 @vanzin Thanks for the detailed review! Addressed review comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716406#comment-16716406 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446100271 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99955/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716407#comment-16716407 ] ASF GitHub Bot commented on SPARK-26311: SparkQA removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446096365 **[Test build #99955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)** for PR 23260 at commit [`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716409#comment-16716409 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446100271 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99955/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716408#comment-16716408 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446100267 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25277) YARN applicationMaster metrics should not register static and JVM metrics
[ https://issues.apache.org/jira/browse/SPARK-25277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716415#comment-16716415 ] ASF GitHub Bot commented on SPARK-25277: LucaCanali commented on issue #22279: [SPARK-25277][YARN] YARN applicationMaster metrics should not register static metrics URL: https://github.com/apache/spark/pull/22279#issuecomment-446102201 Thanks @vanzin for looking at this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > YARN applicationMaster metrics should not register static and JVM metrics > - > > Key: SPARK-25277 > URL: https://issues.apache.org/jira/browse/SPARK-25277 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Luca Canali >Priority: Minor > > YARN applicationMaster metrics registration introduced in SPARK-24594 causes > further registration of static metrics (Codegenerator and > HiveExternalCatalog) and of JVM metrics, which I believe do not belong in > this context. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26332) Spark sql write orc table on viewFS throws exception
Bang Xiao created SPARK-26332: - Summary: Spark sql write orc table on viewFS throws exception Key: SPARK-26332 URL: https://issues.apache.org/jira/browse/SPARK-26332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Bang Xiao Using SparkSQL write orc table on viewFs will cause exception: {code:java} Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hadoop.fs.viewfs.NotInMountpointException: getDefaultReplication on empty path is invalid at org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:352) at org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168) at org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2413) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86) at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:392) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) ... 8 more Suppressed: org.apache.hadoop.fs.viewfs.NotInMountpointException: getDefaultReplication on empty path is invalid at org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2425) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106) at org.apache.spark.sql.hive.execution.HiveOutputWriter.close(HiveFileFormat.scala:154) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:275) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1423) ... 9 more{code} this exception can be reproduced by follow sqls: {code:java} spark-sql> CREATE EXTERNAL TABLE test_orc(test_id INT, test_age INT, test_rank INT) STORED AS ORC LOCATION 'viewfs://nsX/user/hive/warehouse/ultraman_tmp.db/test_orc'; spark-sql> CREATE TABLE source(id INT, age INT, rank INT); spark-sql> INSERT INTO source VALUES(1,1,1); spark-sql> INSERT OVERWRITE TABLE test_orc SELECT * FROM source; {code} this is related to https://issues.apache.org/jira/browse/HIVE-10790. and resolved after hive-2.0.0 , While SparkSQL depends on hive-1.2.1-Spark2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716405#comment-16716405 ] ASF GitHub Bot commented on SPARK-26311: AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446100267 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716404#comment-16716404 ] ASF GitHub Bot commented on SPARK-26311: SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446100212 **[Test build #99955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)** for PR 23260 at commit [`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716393#comment-16716393 ] ASF GitHub Bot commented on SPARK-25212: AmplabJenkins removed a comment on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446097111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99944/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716387#comment-16716387 ] ASF GitHub Bot commented on SPARK-25212: SparkQA removed a comment on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446057878 **[Test build #99944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99944/testReport)** for PR 23273 at commit [`dfd0f71`](https://github.com/apache/spark/commit/dfd0f71afb8d95253ea4f64d00cea53c306b6e1c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716390#comment-16716390 ] ASF GitHub Bot commented on SPARK-25212: AmplabJenkins commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446097111 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99944/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716389#comment-16716389 ] ASF GitHub Bot commented on SPARK-25212: AmplabJenkins commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446097107 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716392#comment-16716392 ] ASF GitHub Bot commented on SPARK-25212: AmplabJenkins removed a comment on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446097107 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716385#comment-16716385 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240493499 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. +#' @param maxIter the maximum number of iterations. +#' @param sourceCol the name of the input column for source vertex IDs. +#' @param destinationCol the name of the input column for destination vertex IDs +#' @param weightCol weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#'list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#'list(4L, 0L, 0.1)), +#' schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = c("random", "degree"), maxIter = 20L, +sourceCol = "src", destinationCol = "dst", weightCol = NULL) { +if (!is.numeric(k) || k < 1) { + stop("k should be a number with value >= 1.") +} +if (!is.integer(maxIter) || maxIter <= 0) { Review comment: if maxIter should in integer, should we check k is also integer? it;s fixed when it is passed, so just a minor consistency on value check This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716382#comment-16716382 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240492789 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. +#' @param maxIter the maximum number of iterations. +#' @param sourceCol the name of the input column for source vertex IDs. +#' @param destinationCol the name of the input column for destination vertex IDs +#' @param weightCol weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method Review comment: wait, this aliases doesn't make sense. could you test if `?assignClusters` in R shell if this works? this should be `@aliases spark.assignClusters,SparkDataFrame-method` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716381#comment-16716381 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240491948 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' Review comment: remove empty line - empty is significant in roxygen2 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716386#comment-16716386 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240492482 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. +#' @param maxIter the maximum number of iterations. +#' @param sourceCol the name of the input column for source vertex IDs. +#' @param destinationCol the name of the input column for destination vertex IDs +#' @param weightCol weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} Review comment: mm, this won't format correctly - roxygen strips all the whitespaces also Long and Int is not a proper type in R This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation
[ https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716380#comment-16716380 ] ASF GitHub Bot commented on SPARK-25212: SparkQA commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule URL: https://github.com/apache/spark/pull/23273#issuecomment-446096788 **[Test build #99944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99944/testReport)** for PR 23273 at commit [`dfd0f71`](https://github.com/apache/spark/commit/dfd0f71afb8d95253ea4f64d00cea53c306b6e1c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support Filter in ConvertToLocalRelation > > > Key: SPARK-25212 > URL: https://issues.apache.org/jira/browse/SPARK-25212 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Major > Fix For: 2.4.0 > > > ConvertToLocalRelation can make short queries faster but currently it only > supports Project and Limit. > It can be extended with other operators such as Filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716383#comment-16716383 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240492041 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. Review comment: add `One of "random", "degree"`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716384#comment-16716384 ] ASF GitHub Bot commented on SPARK-19827: felixcheung commented on a change in pull request #23072: [SPARK-19827][R]spark.ml R API for PIC URL: https://github.com/apache/spark/pull/23072#discussion_r240492887 ## File path: R/pkg/R/mllib_clustering.R ## @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data a SparkDataFrame. +#' @param k the number of clusters to create. +#' @param initMode the initialization algorithm. +#' @param maxIter the maximum number of iterations. +#' @param sourceCol the name of the input column for source vertex IDs. +#' @param destinationCol the name of the input column for destination vertex IDs +#' @param weightCol weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#'list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#'list(4L, 0L, 0.1)), +#' schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") Review comment: space around `=` as style This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr
[ https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716379#comment-16716379 ] ASF GitHub Bot commented on SPARK-26311: SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log URL for stdout/stderr URL: https://github.com/apache/spark/pull/23260#issuecomment-446096365 **[Test build #99955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)** for PR 23260 at commit [`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [YARN] New feature: custom log URL for stdout/stderr > > > Key: SPARK-26311 > URL: https://issues.apache.org/jira/browse/SPARK-26311 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Spark has been setting static log URLs for YARN application, which points to > NodeManager webapp. Normally it would work for both running apps and finished > apps, but there're also other approaches on maintaining application logs, > like having external log service which enables to avoid application log url > to be a deadlink when NodeManager is not accessible. (Node decommissioned, > elastic nodes, etc.) > Spark can provide a new configuration for custom log url on YARN mode, which > end users can set it properly to point application log to external log > service. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row
[ https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716363#comment-16716363 ] ASF GitHub Bot commented on SPARK-26318: KyleLi1985 commented on a change in pull request #23271: [SPARK-26318][SQL] Enhance function merge performance in Row URL: https://github.com/apache/spark/pull/23271#discussion_r240491652 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -58,8 +58,21 @@ object Row { * Merge multiple rows into a single row, one after another. */ def merge(rows: Row*): Row = { -// TODO: Improve the performance of this if used in performance critical part. -new GenericRow(rows.flatMap(_.toSeq).toArray) +val size = rows.size +var number = 0 +for (i <- 0 until size) { + number = number + rows(i).size +} +val container = Array.ofDim[Any](number) +var n = 0 +for (i <- 0 until size) { + val subSize = rows(i).size + for (j <- 0 until subSize) { +container(n) = rows(i)(j) +n = n + 1 + } +} +new GenericRow(container) Review comment: definitely, It is important This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enhance function merge performance in Row > - > > Key: SPARK-26318 > URL: https://issues.apache.org/jira/browse/SPARK-26318 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang Li >Priority: Minor > > Enhance function merge performance in Row > Like do 1 time Row.merge for input > val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 > millisecond > After do some enhancement, it only need 24967 millisecond -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716369#comment-16716369 ] ASF GitHub Bot commented on SPARK-26098: AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446094075 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716370#comment-16716370 ] ASF GitHub Bot commented on SPARK-26098: AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446094078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5958/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716368#comment-16716368 ] ASF GitHub Bot commented on SPARK-26098: SparkQA commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446094215 **[Test build #99954 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99954/testReport)** for PR 23068 at commit [`0a63604`](https://github.com/apache/spark/commit/0a636049ecc721cdd31cd676fce79aeb6582dd7c). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716366#comment-16716366 ] ASF GitHub Bot commented on SPARK-26098: AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446094078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5958/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page
[ https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716365#comment-16716365 ] ASF GitHub Bot commented on SPARK-26098: AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL query in Job page URL: https://github.com/apache/spark/pull/23068#issuecomment-446094075 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show associated SQL query in Job page > - > > Key: SPARK-26098 > URL: https://issues.apache.org/jira/browse/SPARK-26098 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > For jobs associated to SQL queries, it would be easier to understand the > context to showing the SQL query in Job detail page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row
[ https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716364#comment-16716364 ] ASF GitHub Bot commented on SPARK-26318: KyleLi1985 commented on a change in pull request #23271: [SPARK-26318][SQL] Enhance function merge performance in Row URL: https://github.com/apache/spark/pull/23271#discussion_r240491672 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ## @@ -58,8 +58,21 @@ object Row { * Merge multiple rows into a single row, one after another. */ def merge(rows: Row*): Row = { -// TODO: Improve the performance of this if used in performance critical part. -new GenericRow(rows.flatMap(_.toSeq).toArray) +val size = rows.size +var number = 0 +for (i <- 0 until size) { + number = number + rows(i).size +} +val container = Array.ofDim[Any](number) +var n = 0 +for (i <- 0 until size) { Review comment: Only primitively use size, subSize, and number information and control the container will improve the performance more. up to call 1 time Row.merge(row1) need 18064 millisecond call 1 time Row.merge(rows:_*) need 25651 millisecond This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enhance function merge performance in Row > - > > Key: SPARK-26318 > URL: https://issues.apache.org/jira/browse/SPARK-26318 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang Li >Priority: Minor > > Enhance function merge performance in Row > Like do 1 time Row.merge for input > val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 > millisecond > After do some enhancement, it only need 24967 millisecond -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716355#comment-16716355 ] ASF GitHub Bot commented on SPARK-26303: HyukjinKwon commented on a change in pull request #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#discussion_r240489089 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala ## @@ -20,6 +20,16 @@ package org.apache.spark.sql.catalyst.util import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.unsafe.types.UTF8String +/** + * Exception thrown when the underlying parser returns a partial result of parsing. + * @param partialResult the partial result of parsing a bad record. + * @param cause the actual exception about why the parser cannot return full result. + */ +case class PartialResultException( Review comment: I mean, we don't have to standardise the name but let's use another name that doesn't conflict with Java's libraries. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716353#comment-16716353 ] ASF GitHub Bot commented on SPARK-26303: HyukjinKwon commented on a change in pull request #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#discussion_r240488920 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala ## @@ -20,6 +20,16 @@ package org.apache.spark.sql.catalyst.util import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.unsafe.types.UTF8String +/** + * Exception thrown when the underlying parser returns a partial result of parsing. + * @param partialResult the partial result of parsing a bad record. + * @param cause the actual exception about why the parser cannot return full result. + */ +case class PartialResultException( Review comment: Wait .. but let's just rename it if possible .. the cost of renaming is 0 but there are some benefits by that .. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716345#comment-16716345 ] ASF GitHub Bot commented on SPARK-26316: SparkQA removed a comment on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446057021 **[Test build #99943 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99943/testReport)** for PR 23269 at commit [`8de1bcc`](https://github.com/apache/spark/commit/8de1bcca55a8b0b1448841871c47abee8101d917). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame
[ https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716350#comment-16716350 ] Michael Chirico commented on SPARK-14948: - This issue comes up a _lot_ in non-trivial ETLs. I have one script right now where the same problem comes up three separate times! The workaround is quite cumbersome/unintuitive/makes the scripts substantially harder to read... > Exception when joining DataFrames derived form the same DataFrame > - > > Key: SPARK-14948 > URL: https://issues.apache.org/jira/browse/SPARK-14948 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Saurabh Santhosh >Priority: Major > > h2. Spark Analyser is throwing the following exception in a specific scenario > : > h2. Exception : > org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing > from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > JavaRDD rdd = > > sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a", > "b"))); > DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new > StructType(fields)); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as > asd, F2 from t1"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, > "t2"); > sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3"); > > DataFrame join = aliasedDf.join(df, > aliasedDf.col("F2").equalTo(df.col("F2")), "inner"); > DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1")); > select.collect(); > {code} > h2. Observations : > * This issue is related to the Data Type of Fields of the initial Data > Frame.(If the Data Type is not String, it will work.) > * It works fine if the data frame is registered as a temporary table and an > sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716349#comment-16716349 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446090350 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99943/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716348#comment-16716348 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446090346 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716347#comment-16716347 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446090350 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99943/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716346#comment-16716346 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446090346 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716344#comment-16716344 ] ASF GitHub Bot commented on SPARK-26316: SparkQA commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-44608 **[Test build #99943 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99943/testReport)** for PR 23269 at commit [`8de1bcc`](https://github.com/apache/spark/commit/8de1bcca55a8b0b1448841871c47abee8101d917). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716343#comment-16716343 ] ASF GitHub Bot commented on SPARK-26262: cloud-fan commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446089670 when wholeStageCogen is on, there is no way to avoid codegen, so codegenFactoryMode doesn't make difference. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and > CODEGEN_FACTORY_MODE > > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed > config sets: > 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN > 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716247#comment-16716247 ] ASF GitHub Bot commented on SPARK-24102: SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446079811 **[Test build #99948 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)** for PR 17085 at commit [`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26293. - Resolution: Fixed Fix Version/s: 2.4.1 3.0.0 Issue resolved by pull request 23248 [https://github.com/apache/spark/pull/23248] > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0, 2.4.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716337#comment-16716337 ] ASF GitHub Bot commented on SPARK-26262: HyukjinKwon commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446088412 Ah, I had the same question as https://github.com/apache/spark/pull/23213#issuecomment-444824164. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and > CODEGEN_FACTORY_MODE > > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed > config sets: > 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN > 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
[ https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716338#comment-16716338 ] ASF GitHub Bot commented on SPARK-26262: HyukjinKwon edited a comment on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE URL: https://github.com/apache/spark/pull/23213#issuecomment-446088412 Ah, I had the same question as https://github.com/apache/spark/pull/23213#issuecomment-444824164. It would be good to update PR description :-). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and > CODEGEN_FACTORY_MODE > > > Key: SPARK-26262 > URL: https://issues.apache.org/jira/browse/SPARK-26262 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed > config sets: > 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY > 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN > 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716322#comment-16716322 ] ASF GitHub Bot commented on SPARK-25272: SparkQA commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446086071 **[Test build #99950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)** for PR 22273 at commit [`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class HaveArrowTests(unittest.TestCase):` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716332#comment-16716332 ] ASF GitHub Bot commented on SPARK-26293: cloud-fan commented on issue #23248: [SPARK-26293][SQL] Cast exception when having python udf in subquery URL: https://github.com/apache/spark/pull/23248#issuecomment-446086659 thanks, merging to master/2.4! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Cast exception when having python udf in subquery > - > > Key: SPARK-26293 > URL: https://issues.apache.org/jira/browse/SPARK-26293 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716326#comment-16716326 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446086290 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716327#comment-16716327 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446086294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99950/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26293) Cast exception when having python udf in subquery
[ https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716329#comment-16716329 ] ASF GitHub Bot commented on SPARK-26293: asfgit closed pull request #23248: [SPARK-26293][SQL] Cast exception when having python udf in subquery URL: https://github.com/apache/spark/pull/23248 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyspark/sql/tests/test_udf.py b/python/pyspark/sql/tests/test_udf.py index ed298f724d551..12cf8c7de1dad 100644 --- a/python/pyspark/sql/tests/test_udf.py +++ b/python/pyspark/sql/tests/test_udf.py @@ -23,7 +23,7 @@ from pyspark import SparkContext from pyspark.sql import SparkSession, Column, Row -from pyspark.sql.functions import UserDefinedFunction +from pyspark.sql.functions import UserDefinedFunction, udf from pyspark.sql.types import * from pyspark.sql.utils import AnalysisException from pyspark.testing.sqlutils import ReusedSQLTestCase, test_compiled, test_not_compiled_message @@ -102,7 +102,6 @@ def test_udf_registration_return_type_not_none(self): def test_nondeterministic_udf(self): # Test that nondeterministic UDFs are evaluated only once in chained UDF evaluations -from pyspark.sql.functions import udf import random udf_random_col = udf(lambda: int(100 * random.random()), IntegerType()).asNondeterministic() self.assertEqual(udf_random_col.deterministic, False) @@ -113,7 +112,6 @@ def test_nondeterministic_udf(self): def test_nondeterministic_udf2(self): import random -from pyspark.sql.functions import udf random_udf = udf(lambda: random.randint(6, 6), IntegerType()).asNondeterministic() self.assertEqual(random_udf.deterministic, False) random_udf1 = self.spark.catalog.registerFunction("randInt", random_udf) @@ -132,7 +130,6 @@ def test_nondeterministic_udf2(self): def test_nondeterministic_udf3(self): # regression test for SPARK-23233 -from pyspark.sql.functions import udf f = udf(lambda x: x) # Here we cache the JVM UDF instance. self.spark.range(1).select(f("id")) @@ -144,7 +141,7 @@ def test_nondeterministic_udf3(self): self.assertFalse(deterministic) def test_nondeterministic_udf_in_aggregate(self): -from pyspark.sql.functions import udf, sum +from pyspark.sql.functions import sum import random udf_random_col = udf(lambda: int(100 * random.random()), 'int').asNondeterministic() df = self.spark.range(10) @@ -181,7 +178,6 @@ def test_multiple_udfs(self): self.assertEqual(tuple(row), (6, 5)) def test_udf_in_filter_on_top_of_outer_join(self): -from pyspark.sql.functions import udf left = self.spark.createDataFrame([Row(a=1)]) right = self.spark.createDataFrame([Row(a=1)]) df = left.join(right, on='a', how='left_outer') @@ -190,7 +186,6 @@ def test_udf_in_filter_on_top_of_outer_join(self): def test_udf_in_filter_on_top_of_join(self): # regression test for SPARK-18589 -from pyspark.sql.functions import udf left = self.spark.createDataFrame([Row(a=1)]) right = self.spark.createDataFrame([Row(b=1)]) f = udf(lambda a, b: a == b, BooleanType()) @@ -199,7 +194,6 @@ def test_udf_in_filter_on_top_of_join(self): def test_udf_in_join_condition(self): # regression test for SPARK-25314 -from pyspark.sql.functions import udf left = self.spark.createDataFrame([Row(a=1)]) right = self.spark.createDataFrame([Row(b=1)]) f = udf(lambda a, b: a == b, BooleanType()) @@ -211,7 +205,7 @@ def test_udf_in_join_condition(self): def test_udf_in_left_outer_join_condition(self): # regression test for SPARK-26147 -from pyspark.sql.functions import udf, col +from pyspark.sql.functions import col left = self.spark.createDataFrame([Row(a=1)]) right = self.spark.createDataFrame([Row(b=1)]) f = udf(lambda a: str(a), StringType()) @@ -223,7 +217,6 @@ def test_udf_in_left_outer_join_condition(self): def test_udf_in_left_semi_join_condition(self): # regression test for SPARK-25314 -from pyspark.sql.functions import udf left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, a1=2, a2=2)]) right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1)]) f = udf(lambda a, b: a == b, BooleanType()) @@ -236,7 +229,6 @@ def test_udf_in_left_semi_join_condition(self): def test_udf_and_common_filter_in_join_condition(self): # regression test for SPARK-25314
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716323#comment-16716323 ] ASF GitHub Bot commented on SPARK-25272: SparkQA removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081233 **[Test build #99950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)** for PR 22273 at commit [`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716325#comment-16716325 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446086294 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99950/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716324#comment-16716324 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446086290 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716311#comment-16716311 ] ASF GitHub Bot commented on SPARK-26303: AmplabJenkins removed a comment on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446084120 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5957/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26288: -- Fix Version/s: (was: 2.4.0) > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716299#comment-16716299 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446083193 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716292#comment-16716292 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446083121 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716314#comment-16716314 ] ASF GitHub Bot commented on SPARK-26288: dongjoon-hyun commented on a change in pull request #23243: [SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB URL: https://github.com/apache/spark/pull/23243#discussion_r240483099 ## File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala ## @@ -243,4 +243,13 @@ class WorkerSuite extends SparkFunSuite with Matchers with BeforeAndAfter { ExecutorStateChanged("app1", 0, ExecutorState.EXITED, None, None)) assert(cleanupCalled.get() == value) } + test("test initRegisteredExecutorsDB ") { +val sparkConf = new SparkConf() +Utils.loadDefaultSparkProperties(sparkConf) +val securityManager = new SecurityManager(sparkConf) +sparkConf.set(config.SHUFFLE_SERVICE_DB_ENABLED.key, "true") +sparkConf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true") +sparkConf.set("spark.local.dir", "/tmp") +val externalShuffleService = new ExternalShuffleService(sparkConf, securityManager) Review comment: Does this test case fail without your patch? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716310#comment-16716310 ] ASF GitHub Bot commented on SPARK-26303: AmplabJenkins removed a comment on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446084116 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716308#comment-16716308 ] ASF GitHub Bot commented on SPARK-26303: AmplabJenkins commented on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446084120 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5957/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716306#comment-16716306 ] ASF GitHub Bot commented on SPARK-26303: SparkQA commented on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446084058 **[Test build #99953 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99953/testReport)** for PR 23253 at commit [`9ca9248`](https://github.com/apache/spark/commit/9ca9248ed3f9314747c1415bd19760c53019bf36). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716309#comment-16716309 ] Dongjoon Hyun commented on SPARK-26288: --- [~weixiuli] . Thank you for the contribution. Please don't specify the Target Versions and Fix Versions. It should be handled by committers. There is a helpful guide for you to start contributions; [http://spark.apache.org/contributing.html] . > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716307#comment-16716307 ] ASF GitHub Bot commented on SPARK-26303: AmplabJenkins commented on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446084116 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26288: -- Target Version/s: (was: 2.4.0) > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716304#comment-16716304 ] ASF GitHub Bot commented on SPARK-26288: dongjoon-hyun commented on a change in pull request #23243: [SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB URL: https://github.com/apache/spark/pull/23243#discussion_r240482510 ## File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala ## @@ -19,20 +19,20 @@ package org.apache.spark.deploy.worker import java.util.concurrent.atomic.AtomicBoolean import java.util.function.Supplier - Review comment: Please execute `dev/scalastyle` to check the coding style. You should not remove this blank. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716303#comment-16716303 ] ASF GitHub Bot commented on SPARK-26288: dongjoon-hyun commented on a change in pull request #23243: [SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB URL: https://github.com/apache/spark/pull/23243#discussion_r240482510 ## File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala ## @@ -19,20 +19,20 @@ package org.apache.spark.deploy.worker import java.util.concurrent.atomic.AtomicBoolean import java.util.function.Supplier - Review comment: Please execute `dev/scalastyle` to check the coding style. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716302#comment-16716302 ] Darcy Shen commented on SPARK-25075: I maintained a list of scala libraries which spark uses. https://github.com/scala/scala-dev/issues/563#issuecomment-425363609 > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716298#comment-16716298 ] ASF GitHub Bot commented on SPARK-26288: dongjoon-hyun edited a comment on issue #23243: [SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB URL: https://github.com/apache/spark/pull/23243#issuecomment-446083308 Hi, @weixiuli . You can use `[CORE]` instead of `[ExternalShuffleService]` in the PR title. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716300#comment-16716300 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446083198 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5955/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716294#comment-16716294 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446083193 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716267#comment-16716267 ] ASF GitHub Bot commented on SPARK-25272: SparkQA commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081233 **[Test build #99950 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)** for PR 22273 at commit [`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService
[ https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716297#comment-16716297 ] ASF GitHub Bot commented on SPARK-26288: dongjoon-hyun commented on issue #23243: [SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB URL: https://github.com/apache/spark/pull/23243#issuecomment-446083308 Hi, @weixiuli . You can use `[CORE]` instead of `[ExternalShuffleService]`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > add initRegisteredExecutorsDB in ExternalShuffleService > --- > > Key: SPARK-26288 > URL: https://issues.apache.org/jira/browse/SPARK-26288 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Shuffle >Affects Versions: 2.4.0 >Reporter: weixiuli >Priority: Major > Fix For: 2.4.0 > > > As we all know that spark on Yarn uses DB to record RegisteredExecutors > information which can be reloaded and used again when the > ExternalShuffleService is restarted . > The RegisteredExecutors information can't be recorded both in the mode of > spark's standalone and spark on k8s , which will cause the > RegisteredExecutors information to be lost ,when the ExternalShuffleService > is restarted. > To solve the problem above, a method is proposed and is committed . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716293#comment-16716293 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446083128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5956/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716295#comment-16716295 ] ASF GitHub Bot commented on SPARK-26316: AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446083198 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5955/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records
[ https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716296#comment-16716296 ] ASF GitHub Bot commented on SPARK-26303: HyukjinKwon commented on issue #23253: [SPARK-26303][SQL] Return partial results for bad JSON records URL: https://github.com/apache/spark/pull/23253#issuecomment-446083244 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Return partial results for bad JSON records > --- > > Key: SPARK-26303 > URL: https://issues.apache.org/jira/browse/SPARK-26303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, JSON datasource and JSON functions return row with all null for a > malformed JSON string in the PERMISSIVE mode when specified schema has the > struct type. All nulls are returned even some of fields were parsed and > converted to desired types successfully. The ticket aims to solve the problem > by returning already parsed fields. The corrupted column specified via JSON > option `columnNameOfCorruptRecord` or SQL config should contain whole > original JSON string. > For example, if the input has one JSON string: > {code:json} > {"a":0.1,"b":{},"c":"def"} > {code} > and specified schema is: > {code:sql} > a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN > {code} > expected output of `from_json` in the PERMISSIVE mode: > {code} > +---++---+--+ > |a |b |c |_corrupt_record | > +---++---+--+ > |0.1|null|def|{"a":0.1,"b":{},"c":"def"}| > +---++---+--+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716290#comment-16716290 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446083128 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5956/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716291#comment-16716291 ] ASF GitHub Bot commented on SPARK-24102: SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446083138 **[Test build #99952 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99952/testReport)** for PR 17085 at commit [`0cb2daf`](https://github.com/apache/spark/commit/0cb2daf35888d80c5c223e16505354571d87d383). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716289#comment-16716289 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446083121 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716288#comment-16716288 ] ASF GitHub Bot commented on SPARK-26316: SparkQA commented on issue #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#issuecomment-446083119 **[Test build #99951 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99951/testReport)** for PR 23269 at commit [`a46d18e`](https://github.com/apache/spark/commit/a46d18e2a6ae822a1e1d903e54ab928096cb2339). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row
[ https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716283#comment-16716283 ] ASF GitHub Bot commented on SPARK-26318: HyukjinKwon commented on issue #23271: [SPARK-26318][SQL] Enhance function merge performance in Row URL: https://github.com/apache/spark/pull/23271#issuecomment-446082473 +1 for deprecation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enhance function merge performance in Row > - > > Key: SPARK-26318 > URL: https://issues.apache.org/jira/browse/SPARK-26318 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang Li >Priority: Minor > > Enhance function merge performance in Row > Like do 1 time Row.merge for input > val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 > millisecond > After do some enhancement, it only need 24967 millisecond -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716281#comment-16716281 ] ASF GitHub Bot commented on SPARK-26300: AmplabJenkins removed a comment on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446082207 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716266#comment-16716266 ] ASF GitHub Bot commented on SPARK-26300: SparkQA commented on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446081221 **[Test build #99949 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99949/testReport)** for PR 23251 at commit [`b1e71ee`](https://github.com/apache/spark/commit/b1e71ee7a723d63f1cf3c0754f2372eb185439d3). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,
[ https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716280#comment-16716280 ] ASF GitHub Bot commented on SPARK-26316: JkSelf commented on a change in pull request #23269: [SPARK-26316] Revert hash join metrics in spark 21052 that causes performance degradation URL: https://github.com/apache/spark/pull/23269#discussion_r240481284 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala ## @@ -62,8 +62,7 @@ case class HashAggregateExec( "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"), "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"), "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"), -"aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "aggregate time"), -"avgHashProbe" -> SQLMetrics.createAverageMetric(sparkContext, "avg hash probe")) +"aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "aggregate time")) Review comment: Yes, updated. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Because of the perf degradation in TPC-DS, we currently partial revert > SPARK-21052:Add hash map metrics to join, > > > Key: SPARK-26316 > URL: https://issues.apache.org/jira/browse/SPARK-26316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Ke Jia >Priority: Major > > The code of > [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486] > and > [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487] > in SPARK-21052 cause performance degradation in spark2.3. The result of > all queries in TPC-DS with 1TB is in [TPC-DS > result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716282#comment-16716282 ] ASF GitHub Bot commented on SPARK-26300: AmplabJenkins removed a comment on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446082209 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5954/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716278#comment-16716278 ] ASF GitHub Bot commented on SPARK-26300: AmplabJenkins commented on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446082209 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5954/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716277#comment-16716277 ] ASF GitHub Bot commented on SPARK-26300: AmplabJenkins commented on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446082207 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716272#comment-16716272 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081265 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5953/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716250#comment-16716250 ] ASF GitHub Bot commented on SPARK-24102: SparkQA removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446078542 **[Test build #99948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)** for PR 17085 at commit [`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716271#comment-16716271 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081262 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716270#comment-16716270 ] ASF GitHub Bot commented on SPARK-26327: HyukjinKwon commented on issue #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#issuecomment-446081444 Looks fine to me This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Metrics in FileSourceScanExec not update correctly > -- > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and > "metadataTime"(fileListingTime) were updated while lazy val > `selectedPartitions` initialized. But `selectedPartitions` will be > initialized by `metadata` at first, which is called by > `queryExecution.toString` in `SQLExecution.withNewExecutionId`. So while the > `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding > liveExecutions in SQLAppStatusListener, the metrics update is not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716269#comment-16716269 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081265 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5953/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716268#comment-16716268 ] ASF GitHub Bot commented on SPARK-25272: AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446081262 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716265#comment-16716265 ] ASF GitHub Bot commented on SPARK-26300: dongjoon-hyun commented on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446081142 cc @tdas , too. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`
[ https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716264#comment-16716264 ] ASF GitHub Bot commented on SPARK-26300: dongjoon-hyun commented on issue #23251: [SPARK-26300][SS] Remove a redundant `checkForStreaming` call URL: https://github.com/apache/spark/pull/23251#issuecomment-446081024 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > The `checkForStreaming` mothod may be called twice in `createQuery` > - > > Key: SPARK-26300 > URL: https://issues.apache.org/jira/browse/SPARK-26300 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in > {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called > twice in {{createQuery}} , this is not necessary, and the > {{checkForStreaming}} method has a lot of statements, so it's better to > remove one of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716261#comment-16716261 ] ASF GitHub Bot commented on SPARK-26327: HyukjinKwon commented on a change in pull request #23277: [SPARK-26327][SQL] Metrics in FileSourceScanExec not update correctly URL: https://github.com/apache/spark/pull/23277#discussion_r240480046 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -316,7 +313,7 @@ case class FileSourceScanExec( override lazy val metrics = Map("numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows"), "numFiles" -> SQLMetrics.createMetric(sparkContext, "number of files"), - "metadataTime" -> SQLMetrics.createMetric(sparkContext, "metadata time (ms)"), + "fileListingTime" -> SQLMetrics.createMetric(sparkContext, "file listing time (ms)"), Review comment: Yea, please fix PR description and title accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Metrics in FileSourceScanExec not update correctly > -- > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Priority: Major > > As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and > "metadataTime"(fileListingTime) were updated while lazy val > `selectedPartitions` initialized. But `selectedPartitions` will be > initialized by `metadata` at first, which is called by > `queryExecution.toString` in `SQLExecution.withNewExecutionId`. So while the > `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding > liveExecutions in SQLAppStatusListener, the metrics update is not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run
[ https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716256#comment-16716256 ] ASF GitHub Bot commented on SPARK-25272: BryanCutler commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to better indicate pyarrow is installed and related tests will run URL: https://github.com/apache/spark/pull/22273#issuecomment-446080278 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Show some kind of test output to indicate pyarrow tests were run > > > Key: SPARK-25272 > URL: https://issues.apache.org/jira/browse/SPARK-25272 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > Right now tests only output status when they are skipped and there is no way > to really see from the logs that pyarrow tests, like ArrowTests, have been > run except by the absence of a skipped message. We can add a test that is > skipped if pyarrow is installed, which will give an output in our Jenkins > test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716253#comment-16716253 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446079822 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99948/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716251#comment-16716251 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446079818 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716248#comment-16716248 ] ASF GitHub Bot commented on SPARK-24102: AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446079818 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716231#comment-16716231 ] ASF GitHub Bot commented on SPARK-24102: SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators should use weight column - added weight column for regression evaluator URL: https://github.com/apache/spark/pull/17085#issuecomment-446078542 **[Test build #99948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)** for PR 17085 at commit [`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > RegressionEvaluator should use sample weight data > - > > Key: SPARK-24102 > URL: https://issues.apache.org/jira/browse/SPARK-24102 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Ilya Matiach >Priority: Major > Labels: starter > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org