[jira] [Created] (SPARK-26333) FsHistoryProviderSuite failed because setReadable doesn't work in RedHat

2018-12-10 Thread deshanxiao (JIRA)
deshanxiao created SPARK-26333:
--

 Summary: FsHistoryProviderSuite failed because setReadable doesn't 
work in RedHat
 Key: SPARK-26333
 URL: https://issues.apache.org/jira/browse/SPARK-26333
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: deshanxiao


FsHistoryProviderSuite failed in case "SPARK-3697: ignore files that cannot be 
read.". I try to invoke logFile2.canRead after invoking "setReadable(false, 
false)" . And I find that the result of "logFile2.canRead" is true but in my 
ubuntu16.04 return false.

The environment:

RedHat:
Linux version 3.10.0-693.2.2.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc 
version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Tue Sep 12 22:26:13 
UTC 2017

JDK
Java version: 1.8.0_151, vendor: Oracle Corporation

{code:java}
 org.scalatest.exceptions.TestFailedException: 2 was not equal to 1
  at org.scalatest.MatchersHelper$.indicateFailure(MatchersHelper.scala:340)
  at 
org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6668)
  at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6704)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:183)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12$$anonfun$apply$7.apply(FsHistoryProviderSuite.scala:182)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.org$apache$spark$deploy$history$FsHistoryProviderSuite$$updateAndCheck(FsHistoryProviderSuite.scala:841)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:182)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$12.apply(FsHistoryProviderSuite.scala:148)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.org$scalatest$BeforeAndAfter$$super$runTest(FsHistoryProviderSuite.scala:51)
  at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
  at 
org.apache.spark.deploy.history.FsHistoryProviderSuite.runTest(FsHistoryProviderSuite.scala:51)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716426#comment-16716426
 ] 

ASF GitHub Bot commented on SPARK-26262:


viirya commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on 
mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
URL: https://github.com/apache/spark/pull/23213#issuecomment-446104710
 
 
   I think wholeStageCodegen doesn't disallow using those objects in 
interpreted mode. The objects can be in interpreted mode if it rolls back from 
codegen in case of compilation error.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
> 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716419#comment-16716419
 ] 

ASF GitHub Bot commented on SPARK-26311:


HeartSaVioR commented on issue #23260: [SPARK-26311][YARN] New feature: custom 
log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446103258
 
 
   @vanzin Thanks for the detailed review! Addressed review comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716406#comment-16716406
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446100271
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99955/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716407#comment-16716407
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA removed a comment on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446096365
 
 
   **[Test build #99955 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)**
 for PR 23260 at commit 
[`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716409#comment-16716409
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New 
feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446100271
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99955/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716408#comment-16716408
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins removed a comment on issue #23260: [SPARK-26311][YARN] New 
feature: custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446100267
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25277) YARN applicationMaster metrics should not register static and JVM metrics

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716415#comment-16716415
 ] 

ASF GitHub Bot commented on SPARK-25277:


LucaCanali commented on issue #22279: [SPARK-25277][YARN] YARN 
applicationMaster metrics should not register static metrics
URL: https://github.com/apache/spark/pull/22279#issuecomment-446102201
 
 
   Thanks @vanzin for looking at this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> YARN applicationMaster metrics should not register static and JVM metrics
> -
>
> Key: SPARK-25277
> URL: https://issues.apache.org/jira/browse/SPARK-25277
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> YARN applicationMaster metrics registration introduced in SPARK-24594 causes 
> further registration of static metrics (Codegenerator and 
> HiveExternalCatalog) and of JVM metrics, which I believe do not belong in 
> this context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26332) Spark sql write orc table on viewFS throws exception

2018-12-10 Thread Bang Xiao (JIRA)
Bang Xiao created SPARK-26332:
-

 Summary: Spark sql write orc table on viewFS throws exception
 Key: SPARK-26332
 URL: https://issues.apache.org/jira/browse/SPARK-26332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Bang Xiao


Using SparkSQL write orc table on viewFs will cause exception:
{code:java}
Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.viewfs.NotInMountpointException: 
getDefaultReplication on empty path is invalid
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.checkMemory(WriterImpl.java:352)
at 
org.apache.hadoop.hive.ql.io.orc.MemoryManager.notifyWriters(MemoryManager.java:168)
at 
org.apache.hadoop.hive.ql.io.orc.MemoryManager.addedRow(MemoryManager.java:157)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2413)
at 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:392)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
Suppressed: org.apache.hadoop.fs.viewfs.NotInMountpointException: 
getDefaultReplication on empty path is invalid
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getDefaultReplication(ViewFileSystem.java:634)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.getStream(WriterImpl.java:2103)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.flushStripe(WriterImpl.java:2120)
at org.apache.hadoop.hive.ql.io.orc.WriterImpl.close(WriterImpl.java:2425)
at 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.close(OrcOutputFormat.java:106)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.close(HiveFileFormat.scala:154)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.releaseResources(FileFormatWriter.scala:405)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$1.apply$mcV$sp(FileFormatWriter.scala:275)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1423)
... 9 more{code}
this exception can be reproduced by follow sqls:
{code:java}
spark-sql> CREATE EXTERNAL TABLE test_orc(test_id INT, test_age INT, test_rank 
INT) STORED AS ORC LOCATION 
'viewfs://nsX/user/hive/warehouse/ultraman_tmp.db/test_orc';
spark-sql> CREATE TABLE source(id INT, age INT, rank INT);
spark-sql> INSERT INTO source VALUES(1,1,1);
spark-sql> INSERT OVERWRITE TABLE test_orc SELECT * FROM source;

{code}
this is related to https://issues.apache.org/jira/browse/HIVE-10790.  and 
resolved after hive-2.0.0 , While SparkSQL depends on hive-1.2.1-Spark2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716405#comment-16716405
 ] 

ASF GitHub Bot commented on SPARK-26311:


AmplabJenkins commented on issue #23260: [SPARK-26311][YARN] New feature: 
custom log URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446100267
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716404#comment-16716404
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log 
URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446100212
 
 
   **[Test build #99955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)**
 for PR 23260 at commit 
[`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716393#comment-16716393
 ] 

ASF GitHub Bot commented on SPARK-25212:


AmplabJenkins removed a comment on issue #23273: 
[SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446097111
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99944/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716387#comment-16716387
 ] 

ASF GitHub Bot commented on SPARK-25212:


SparkQA removed a comment on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] 
Fix comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446057878
 
 
   **[Test build #99944 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99944/testReport)**
 for PR 23273 at commit 
[`dfd0f71`](https://github.com/apache/spark/commit/dfd0f71afb8d95253ea4f64d00cea53c306b6e1c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716390#comment-16716390
 ] 

ASF GitHub Bot commented on SPARK-25212:


AmplabJenkins commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix 
comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446097111
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99944/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716389#comment-16716389
 ] 

ASF GitHub Bot commented on SPARK-25212:


AmplabJenkins commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix 
comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446097107
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716392#comment-16716392
 ] 

ASF GitHub Bot commented on SPARK-25212:


AmplabJenkins removed a comment on issue #23273: 
[SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446097107
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716385#comment-16716385
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240493499
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.
+#' @param maxIter the maximum number of iterations.
+#' @param sourceCol the name of the input column for source vertex IDs.
+#' @param destinationCol the name of the input column for destination vertex 
IDs
+#' @param weightCol weight column name. If this is not set or \code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the corresponding 
cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'list(4L, 0L, 0.1)),
+#'   schema = c("src", "dst", "weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter = 
20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
 
 Review comment:
   if maxIter should in integer, should we check k is also integer? it;s fixed 
when it is passed, so just a minor consistency on value check


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716382#comment-16716382
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240492789
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.
+#' @param maxIter the maximum number of iterations.
+#' @param sourceCol the name of the input column for source vertex IDs.
+#' @param destinationCol the name of the input column for destination vertex 
IDs
+#' @param weightCol weight column name. If this is not set or \code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the corresponding 
cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
 
 Review comment:
   wait, this aliases doesn't make sense. could you test if `?assignClusters` 
in R shell if this works?
   
   this should be `@aliases spark.assignClusters,SparkDataFrame-method`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716381#comment-16716381
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240491948
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
 
 Review comment:
   remove empty line - empty is significant in roxygen2


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716386#comment-16716386
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240492482
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.
+#' @param maxIter the maximum number of iterations.
+#' @param sourceCol the name of the input column for source vertex IDs.
+#' @param destinationCol the name of the input column for destination vertex 
IDs
+#' @param weightCol weight column name. If this is not set or \code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the corresponding 
cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
 
 Review comment:
   mm, this won't format correctly - roxygen strips all the whitespaces
   also Long and Int is not a proper type in R


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25212) Support Filter in ConvertToLocalRelation

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716380#comment-16716380
 ] 

ASF GitHub Bot commented on SPARK-25212:


SparkQA commented on issue #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix 
comments of ConvertToLocalRelation rule
URL: https://github.com/apache/spark/pull/23273#issuecomment-446096788
 
 
   **[Test build #99944 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99944/testReport)**
 for PR 23273 at commit 
[`dfd0f71`](https://github.com/apache/spark/commit/dfd0f71afb8d95253ea4f64d00cea53c306b6e1c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support Filter in ConvertToLocalRelation
> 
>
> Key: SPARK-25212
> URL: https://issues.apache.org/jira/browse/SPARK-25212
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Major
> Fix For: 2.4.0
>
>
> ConvertToLocalRelation can make short queries faster but currently it only 
> supports Project and Limit.
> It can be extended with other operators such as Filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716383#comment-16716383
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240492041
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.
 
 Review comment:
   add `One of "random", "degree"`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716384#comment-16716384
 ] 

ASF GitHub Bot commented on SPARK-19827:


felixcheung commented on a change in pull request #23072: 
[SPARK-19827][R]spark.ml R API for PIC
URL: https://github.com/apache/spark/pull/23072#discussion_r240492887
 
 

 ##
 File path: R/pkg/R/mllib_clustering.R
 ##
 @@ -610,3 +616,59 @@ setMethod("write.ml", signature(object = "LDAModel", path 
= "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data a SparkDataFrame.
+#' @param k the number of clusters to create.
+#' @param initMode the initialization algorithm.
+#' @param maxIter the maximum number of iterations.
+#' @param sourceCol the name of the input column for source vertex IDs.
+#' @param destinationCol the name of the input column for destination vertex 
IDs
+#' @param weightCol weight column name. If this is not set or \code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the corresponding 
cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'list(4L, 0L, 0.1)),
+#'   schema = c("src", "dst", "weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight")
 
 Review comment:
   space around `=` as style


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26311) [YARN] New feature: custom log URL for stdout/stderr

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716379#comment-16716379
 ] 

ASF GitHub Bot commented on SPARK-26311:


SparkQA commented on issue #23260: [SPARK-26311][YARN] New feature: custom log 
URL for stdout/stderr
URL: https://github.com/apache/spark/pull/23260#issuecomment-446096365
 
 
   **[Test build #99955 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99955/testReport)**
 for PR 23260 at commit 
[`dbeade7`](https://github.com/apache/spark/commit/dbeade7e41f861c9240c70058796293b239db96c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [YARN] New feature: custom log URL for stdout/stderr
> 
>
> Key: SPARK-26311
> URL: https://issues.apache.org/jira/browse/SPARK-26311
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Spark has been setting static log URLs for YARN application, which points to 
> NodeManager webapp. Normally it would work for both running apps and finished 
> apps, but there're also other approaches on maintaining application logs, 
> like having external log service which enables to avoid application log url 
> to be a deadlink when NodeManager is not accessible. (Node decommissioned, 
> elastic nodes, etc.)
> Spark can provide a new configuration for custom log url on YARN mode, which 
> end users can set it properly to point application log to external log 
> service.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716363#comment-16716363
 ] 

ASF GitHub Bot commented on SPARK-26318:


KyleLi1985 commented on a change in pull request #23271: [SPARK-26318][SQL] 
Enhance function merge performance in Row
URL: https://github.com/apache/spark/pull/23271#discussion_r240491652
 
 

 ##
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
 ##
 @@ -58,8 +58,21 @@ object Row {
* Merge multiple rows into a single row, one after another.
*/
   def merge(rows: Row*): Row = {
-// TODO: Improve the performance of this if used in performance critical 
part.
-new GenericRow(rows.flatMap(_.toSeq).toArray)
+val size = rows.size
+var number = 0
+for (i <- 0 until size) {
+  number = number + rows(i).size
+}
+val container = Array.ofDim[Any](number)
+var n = 0
+for (i <- 0 until size) {
+  val subSize = rows(i).size
+  for (j <- 0 until subSize) {
+container(n) = rows(i)(j)
+n = n + 1
+  }
+}
+new GenericRow(container)
 
 Review comment:
   definitely, It is important


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enhance function merge performance in Row
> -
>
> Key: SPARK-26318
> URL: https://issues.apache.org/jira/browse/SPARK-26318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Enhance function merge performance in Row
> Like do 1 time Row.merge for input 
> val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 
> millisecond
> After do some enhancement, it only need 24967 millisecond



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716369#comment-16716369
 ] 

ASF GitHub Bot commented on SPARK-26098:


AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show 
associated SQL query in Job page
URL: https://github.com/apache/spark/pull/23068#issuecomment-446094075
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716370#comment-16716370
 ] 

ASF GitHub Bot commented on SPARK-26098:


AmplabJenkins removed a comment on issue #23068: [SPARK-26098][WebUI] Show 
associated SQL query in Job page
URL: https://github.com/apache/spark/pull/23068#issuecomment-446094078
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5958/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716368#comment-16716368
 ] 

ASF GitHub Bot commented on SPARK-26098:


SparkQA commented on issue #23068: [SPARK-26098][WebUI] Show associated SQL 
query in Job page
URL: https://github.com/apache/spark/pull/23068#issuecomment-446094215
 
 
   **[Test build #99954 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99954/testReport)**
 for PR 23068 at commit 
[`0a63604`](https://github.com/apache/spark/commit/0a636049ecc721cdd31cd676fce79aeb6582dd7c).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716366#comment-16716366
 ] 

ASF GitHub Bot commented on SPARK-26098:


AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated 
SQL query in Job page
URL: https://github.com/apache/spark/pull/23068#issuecomment-446094078
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5958/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26098) Show associated SQL query in Job page

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716365#comment-16716365
 ] 

ASF GitHub Bot commented on SPARK-26098:


AmplabJenkins commented on issue #23068: [SPARK-26098][WebUI] Show associated 
SQL query in Job page
URL: https://github.com/apache/spark/pull/23068#issuecomment-446094075
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show associated SQL query in Job page
> -
>
> Key: SPARK-26098
> URL: https://issues.apache.org/jira/browse/SPARK-26098
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> For jobs associated to SQL queries, it would be easier to understand the 
> context to showing the SQL query in Job detail page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716364#comment-16716364
 ] 

ASF GitHub Bot commented on SPARK-26318:


KyleLi1985 commented on a change in pull request #23271: [SPARK-26318][SQL] 
Enhance function merge performance in Row
URL: https://github.com/apache/spark/pull/23271#discussion_r240491672
 
 

 ##
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala
 ##
 @@ -58,8 +58,21 @@ object Row {
* Merge multiple rows into a single row, one after another.
*/
   def merge(rows: Row*): Row = {
-// TODO: Improve the performance of this if used in performance critical 
part.
-new GenericRow(rows.flatMap(_.toSeq).toArray)
+val size = rows.size
+var number = 0
+for (i <- 0 until size) {
+  number = number + rows(i).size
+}
+val container = Array.ofDim[Any](number)
+var n = 0
+for (i <- 0 until size) {
 
 Review comment:
   Only primitively use size, subSize, and number information and control the 
container will improve the performance more.
   up to 
   call 1 time Row.merge(row1) need 18064 millisecond
   call 1 time Row.merge(rows:_*) need 25651 millisecond


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enhance function merge performance in Row
> -
>
> Key: SPARK-26318
> URL: https://issues.apache.org/jira/browse/SPARK-26318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Enhance function merge performance in Row
> Like do 1 time Row.merge for input 
> val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 
> millisecond
> After do some enhancement, it only need 24967 millisecond



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716355#comment-16716355
 ] 

ASF GitHub Bot commented on SPARK-26303:


HyukjinKwon commented on a change in pull request #23253: [SPARK-26303][SQL] 
Return partial results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#discussion_r240489089
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
 ##
 @@ -20,6 +20,16 @@ package org.apache.spark.sql.catalyst.util
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.unsafe.types.UTF8String
 
+/**
+ * Exception thrown when the underlying parser returns a partial result of 
parsing.
+ * @param partialResult the partial result of parsing a bad record.
+ * @param cause the actual exception about why the parser cannot return full 
result.
+ */
+case class PartialResultException(
 
 Review comment:
   I mean, we don't have to standardise the name but let's use another name 
that doesn't conflict with Java's libraries.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716353#comment-16716353
 ] 

ASF GitHub Bot commented on SPARK-26303:


HyukjinKwon commented on a change in pull request #23253: [SPARK-26303][SQL] 
Return partial results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#discussion_r240488920
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/BadRecordException.scala
 ##
 @@ -20,6 +20,16 @@ package org.apache.spark.sql.catalyst.util
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.unsafe.types.UTF8String
 
+/**
+ * Exception thrown when the underlying parser returns a partial result of 
parsing.
+ * @param partialResult the partial result of parsing a bad record.
+ * @param cause the actual exception about why the parser cannot return full 
result.
+ */
+case class PartialResultException(
 
 Review comment:
   Wait .. but let's just rename it if possible .. the cost of renaming is 0 
but there are some benefits by that ..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716345#comment-16716345
 ] 

ASF GitHub Bot commented on SPARK-26316:


SparkQA removed a comment on issue #23269: [SPARK-26316] Revert hash join 
metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446057021
 
 
   **[Test build #99943 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99943/testReport)**
 for PR 23269 at commit 
[`8de1bcc`](https://github.com/apache/spark/commit/8de1bcca55a8b0b1448841871c47abee8101d917).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2018-12-10 Thread Michael Chirico (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716350#comment-16716350
 ] 

Michael Chirico commented on SPARK-14948:
-

This issue comes up a _lot_ in non-trivial ETLs.

I have one script right now where the same problem comes up three separate 
times!

The workaround is quite cumbersome/unintuitive/makes the scripts substantially 
harder to read...

 

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716349#comment-16716349
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join 
metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446090350
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99943/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716348#comment-16716348
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join 
metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446090346
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716347#comment-16716347
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics 
in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446090350
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99943/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716346#comment-16716346
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics 
in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446090346
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716344#comment-16716344
 ] 

ASF GitHub Bot commented on SPARK-26316:


SparkQA commented on issue #23269: [SPARK-26316] Revert hash join metrics in 
spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-44608
 
 
   **[Test build #99943 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99943/testReport)**
 for PR 23269 at commit 
[`8de1bcc`](https://github.com/apache/spark/commit/8de1bcca55a8b0b1448841871c47abee8101d917).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716343#comment-16716343
 ] 

ASF GitHub Bot commented on SPARK-26262:


cloud-fan commented on issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite 
on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE
URL: https://github.com/apache/spark/pull/23213#issuecomment-446089670
 
 
   when wholeStageCogen is on, there is no way to avoid codegen, so 
codegenFactoryMode doesn't make difference.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
> 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716247#comment-16716247
 ] 

ASF GitHub Bot commented on SPARK-24102:


SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446079811
 
 
   **[Test build #99948 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)**
 for PR 17085 at commit 
[`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26293) Cast exception when having python udf in subquery

2018-12-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26293.
-
   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 23248
[https://github.com/apache/spark/pull/23248]

> Cast exception when having python udf in subquery
> -
>
> Key: SPARK-26293
> URL: https://issues.apache.org/jira/browse/SPARK-26293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716337#comment-16716337
 ] 

ASF GitHub Bot commented on SPARK-26262:


HyukjinKwon commented on issue #23213: [SPARK-26262][SQL] Runs 
SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
CODEGEN_FACTORY_MODE
URL: https://github.com/apache/spark/pull/23213#issuecomment-446088412
 
 
   Ah, I had the same question as 
https://github.com/apache/spark/pull/23213#issuecomment-444824164.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
> 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716338#comment-16716338
 ] 

ASF GitHub Bot commented on SPARK-26262:


HyukjinKwon edited a comment on issue #23213: [SPARK-26262][SQL] Runs 
SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
CODEGEN_FACTORY_MODE
URL: https://github.com/apache/spark/pull/23213#issuecomment-446088412
 
 
   Ah, I had the same question as 
https://github.com/apache/spark/pull/23213#issuecomment-444824164. It would be 
good to update PR description :-).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
> 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716322#comment-16716322
 ] 

ASF GitHub Bot commented on SPARK-25272:


SparkQA commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to 
better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446086071
 
 
   **[Test build #99950 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)**
 for PR 22273 at commit 
[`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `class HaveArrowTests(unittest.TestCase):`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26293) Cast exception when having python udf in subquery

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716332#comment-16716332
 ] 

ASF GitHub Bot commented on SPARK-26293:


cloud-fan commented on issue #23248: [SPARK-26293][SQL] Cast exception when 
having python udf in subquery
URL: https://github.com/apache/spark/pull/23248#issuecomment-446086659
 
 
   thanks, merging to master/2.4!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Cast exception when having python udf in subquery
> -
>
> Key: SPARK-26293
> URL: https://issues.apache.org/jira/browse/SPARK-26293
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716326#comment-16716326
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] 
Add test to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446086290
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716327#comment-16716327
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] 
Add test to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446086294
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99950/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26293) Cast exception when having python udf in subquery

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716329#comment-16716329
 ] 

ASF GitHub Bot commented on SPARK-26293:


asfgit closed pull request #23248: [SPARK-26293][SQL] Cast exception when 
having python udf in subquery
URL: https://github.com/apache/spark/pull/23248
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyspark/sql/tests/test_udf.py 
b/python/pyspark/sql/tests/test_udf.py
index ed298f724d551..12cf8c7de1dad 100644
--- a/python/pyspark/sql/tests/test_udf.py
+++ b/python/pyspark/sql/tests/test_udf.py
@@ -23,7 +23,7 @@
 
 from pyspark import SparkContext
 from pyspark.sql import SparkSession, Column, Row
-from pyspark.sql.functions import UserDefinedFunction
+from pyspark.sql.functions import UserDefinedFunction, udf
 from pyspark.sql.types import *
 from pyspark.sql.utils import AnalysisException
 from pyspark.testing.sqlutils import ReusedSQLTestCase, test_compiled, 
test_not_compiled_message
@@ -102,7 +102,6 @@ def test_udf_registration_return_type_not_none(self):
 
 def test_nondeterministic_udf(self):
 # Test that nondeterministic UDFs are evaluated only once in chained 
UDF evaluations
-from pyspark.sql.functions import udf
 import random
 udf_random_col = udf(lambda: int(100 * random.random()), 
IntegerType()).asNondeterministic()
 self.assertEqual(udf_random_col.deterministic, False)
@@ -113,7 +112,6 @@ def test_nondeterministic_udf(self):
 
 def test_nondeterministic_udf2(self):
 import random
-from pyspark.sql.functions import udf
 random_udf = udf(lambda: random.randint(6, 6), 
IntegerType()).asNondeterministic()
 self.assertEqual(random_udf.deterministic, False)
 random_udf1 = self.spark.catalog.registerFunction("randInt", 
random_udf)
@@ -132,7 +130,6 @@ def test_nondeterministic_udf2(self):
 
 def test_nondeterministic_udf3(self):
 # regression test for SPARK-23233
-from pyspark.sql.functions import udf
 f = udf(lambda x: x)
 # Here we cache the JVM UDF instance.
 self.spark.range(1).select(f("id"))
@@ -144,7 +141,7 @@ def test_nondeterministic_udf3(self):
 self.assertFalse(deterministic)
 
 def test_nondeterministic_udf_in_aggregate(self):
-from pyspark.sql.functions import udf, sum
+from pyspark.sql.functions import sum
 import random
 udf_random_col = udf(lambda: int(100 * random.random()), 
'int').asNondeterministic()
 df = self.spark.range(10)
@@ -181,7 +178,6 @@ def test_multiple_udfs(self):
 self.assertEqual(tuple(row), (6, 5))
 
 def test_udf_in_filter_on_top_of_outer_join(self):
-from pyspark.sql.functions import udf
 left = self.spark.createDataFrame([Row(a=1)])
 right = self.spark.createDataFrame([Row(a=1)])
 df = left.join(right, on='a', how='left_outer')
@@ -190,7 +186,6 @@ def test_udf_in_filter_on_top_of_outer_join(self):
 
 def test_udf_in_filter_on_top_of_join(self):
 # regression test for SPARK-18589
-from pyspark.sql.functions import udf
 left = self.spark.createDataFrame([Row(a=1)])
 right = self.spark.createDataFrame([Row(b=1)])
 f = udf(lambda a, b: a == b, BooleanType())
@@ -199,7 +194,6 @@ def test_udf_in_filter_on_top_of_join(self):
 
 def test_udf_in_join_condition(self):
 # regression test for SPARK-25314
-from pyspark.sql.functions import udf
 left = self.spark.createDataFrame([Row(a=1)])
 right = self.spark.createDataFrame([Row(b=1)])
 f = udf(lambda a, b: a == b, BooleanType())
@@ -211,7 +205,7 @@ def test_udf_in_join_condition(self):
 
 def test_udf_in_left_outer_join_condition(self):
 # regression test for SPARK-26147
-from pyspark.sql.functions import udf, col
+from pyspark.sql.functions import col
 left = self.spark.createDataFrame([Row(a=1)])
 right = self.spark.createDataFrame([Row(b=1)])
 f = udf(lambda a: str(a), StringType())
@@ -223,7 +217,6 @@ def test_udf_in_left_outer_join_condition(self):
 
 def test_udf_in_left_semi_join_condition(self):
 # regression test for SPARK-25314
-from pyspark.sql.functions import udf
 left = self.spark.createDataFrame([Row(a=1, a1=1, a2=1), Row(a=2, 
a1=2, a2=2)])
 right = self.spark.createDataFrame([Row(b=1, b1=1, b2=1)])
 f = udf(lambda a, b: a == b, BooleanType())
@@ -236,7 +229,6 @@ def test_udf_in_left_semi_join_condition(self):
 def test_udf_and_common_filter_in_join_condition(self):
 # regression test for SPARK-25314
 

[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716323#comment-16716323
 ] 

ASF GitHub Bot commented on SPARK-25272:


SparkQA removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] Add test 
to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081233
 
 
   **[Test build #99950 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)**
 for PR 22273 at commit 
[`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716325#comment-16716325
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test 
to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446086294
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99950/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716324#comment-16716324
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test 
to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446086290
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716311#comment-16716311
 ] 

ASF GitHub Bot commented on SPARK-26303:


AmplabJenkins removed a comment on issue #23253: [SPARK-26303][SQL] Return 
partial results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446084120
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5957/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26288:
--
Fix Version/s: (was: 2.4.0)

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716299#comment-16716299
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join 
metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446083193
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716292#comment-16716292
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML 
Evaluators should use weight column - added weight column for regression 
evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446083121
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716314#comment-16716314
 ] 

ASF GitHub Bot commented on SPARK-26288:


dongjoon-hyun commented on a change in pull request #23243: 
[SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB
URL: https://github.com/apache/spark/pull/23243#discussion_r240483099
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala
 ##
 @@ -243,4 +243,13 @@ class WorkerSuite extends SparkFunSuite with Matchers 
with BeforeAndAfter {
   ExecutorStateChanged("app1", 0, ExecutorState.EXITED, None, None))
 assert(cleanupCalled.get() == value)
   }
+  test("test  initRegisteredExecutorsDB  ") {
+val sparkConf = new SparkConf()
+Utils.loadDefaultSparkProperties(sparkConf)
+val securityManager = new SecurityManager(sparkConf)
+sparkConf.set(config.SHUFFLE_SERVICE_DB_ENABLED.key, "true")
+sparkConf.set(config.SHUFFLE_SERVICE_ENABLED.key, "true")
+sparkConf.set("spark.local.dir", "/tmp")
+val externalShuffleService = new ExternalShuffleService(sparkConf, 
securityManager)
 
 Review comment:
   Does this test case fail without your patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716310#comment-16716310
 ] 

ASF GitHub Bot commented on SPARK-26303:


AmplabJenkins removed a comment on issue #23253: [SPARK-26303][SQL] Return 
partial results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446084116
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716308#comment-16716308
 ] 

ASF GitHub Bot commented on SPARK-26303:


AmplabJenkins commented on issue #23253: [SPARK-26303][SQL] Return partial 
results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446084120
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5957/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716306#comment-16716306
 ] 

ASF GitHub Bot commented on SPARK-26303:


SparkQA commented on issue #23253: [SPARK-26303][SQL] Return partial results 
for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446084058
 
 
   **[Test build #99953 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99953/testReport)**
 for PR 23253 at commit 
[`9ca9248`](https://github.com/apache/spark/commit/9ca9248ed3f9314747c1415bd19760c53019bf36).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716309#comment-16716309
 ] 

Dongjoon Hyun commented on SPARK-26288:
---

[~weixiuli] . Thank you for the contribution.

Please don't specify the Target Versions and Fix Versions. It should be handled 
by committers.

There is a helpful guide for you to start contributions; 
[http://spark.apache.org/contributing.html] .

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716307#comment-16716307
 ] 

ASF GitHub Bot commented on SPARK-26303:


AmplabJenkins commented on issue #23253: [SPARK-26303][SQL] Return partial 
results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446084116
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26288:
--
Target Version/s:   (was: 2.4.0)

> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716304#comment-16716304
 ] 

ASF GitHub Bot commented on SPARK-26288:


dongjoon-hyun commented on a change in pull request #23243: 
[SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB
URL: https://github.com/apache/spark/pull/23243#discussion_r240482510
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala
 ##
 @@ -19,20 +19,20 @@ package org.apache.spark.deploy.worker
 
 import java.util.concurrent.atomic.AtomicBoolean
 import java.util.function.Supplier
-
 
 Review comment:
   Please execute `dev/scalastyle` to check the coding style. You should not 
remove this blank.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716303#comment-16716303
 ] 

ASF GitHub Bot commented on SPARK-26288:


dongjoon-hyun commented on a change in pull request #23243: 
[SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB
URL: https://github.com/apache/spark/pull/23243#discussion_r240482510
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/worker/WorkerSuite.scala
 ##
 @@ -19,20 +19,20 @@ package org.apache.spark.deploy.worker
 
 import java.util.concurrent.atomic.AtomicBoolean
 import java.util.function.Supplier
-
 
 Review comment:
   Please execute `dev/scalastyle` to check the coding style.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2018-12-10 Thread Darcy Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716302#comment-16716302
 ] 

Darcy Shen commented on SPARK-25075:


I maintained a list of scala libraries which spark uses.

 

https://github.com/scala/scala-dev/issues/563#issuecomment-425363609

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716298#comment-16716298
 ] 

ASF GitHub Bot commented on SPARK-26288:


dongjoon-hyun edited a comment on issue #23243: 
[SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB
URL: https://github.com/apache/spark/pull/23243#issuecomment-446083308
 
 
   Hi, @weixiuli . You can use `[CORE]` instead of `[ExternalShuffleService]` 
in the PR title.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716300#comment-16716300
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins removed a comment on issue #23269: [SPARK-26316] Revert hash join 
metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446083198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5955/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716294#comment-16716294
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics 
in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446083193
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716267#comment-16716267
 ] 

ASF GitHub Bot commented on SPARK-25272:


SparkQA commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to 
better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081233
 
 
   **[Test build #99950 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99950/testReport)**
 for PR 22273 at commit 
[`8574291`](https://github.com/apache/spark/commit/8574291a0b84574626ca213bc6f95dc0db73b0ef).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26288) add initRegisteredExecutorsDB in ExternalShuffleService

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716297#comment-16716297
 ] 

ASF GitHub Bot commented on SPARK-26288:


dongjoon-hyun commented on issue #23243: 
[SPARK-26288][ExternalShuffleService]add initRegisteredExecutorsDB
URL: https://github.com/apache/spark/pull/23243#issuecomment-446083308
 
 
   Hi, @weixiuli . You can use `[CORE]` instead of `[ExternalShuffleService]`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> add initRegisteredExecutorsDB in ExternalShuffleService
> ---
>
> Key: SPARK-26288
> URL: https://issues.apache.org/jira/browse/SPARK-26288
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Shuffle
>Affects Versions: 2.4.0
>Reporter: weixiuli
>Priority: Major
> Fix For: 2.4.0
>
>
> As we all know that spark on Yarn uses DB to record RegisteredExecutors 
> information which can be reloaded and used again when the 
> ExternalShuffleService is restarted .
> The RegisteredExecutors information can't be recorded both in the mode of 
> spark's standalone and spark on k8s , which will cause the 
> RegisteredExecutors information to be lost ,when the ExternalShuffleService 
> is restarted.
> To solve the problem above, a method is proposed and is committed .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716293#comment-16716293
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML 
Evaluators should use weight column - added weight column for regression 
evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446083128
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5956/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716295#comment-16716295
 ] 

ASF GitHub Bot commented on SPARK-26316:


AmplabJenkins commented on issue #23269: [SPARK-26316] Revert hash join metrics 
in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446083198
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5955/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26303) Return partial results for bad JSON records

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716296#comment-16716296
 ] 

ASF GitHub Bot commented on SPARK-26303:


HyukjinKwon commented on issue #23253: [SPARK-26303][SQL] Return partial 
results for bad JSON records
URL: https://github.com/apache/spark/pull/23253#issuecomment-446083244
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Return partial results for bad JSON records
> ---
>
> Key: SPARK-26303
> URL: https://issues.apache.org/jira/browse/SPARK-26303
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, JSON datasource and JSON functions return row with all null for a 
> malformed JSON string in the PERMISSIVE mode when specified schema has the 
> struct type. All nulls are returned even some of fields were parsed and 
> converted to desired types successfully. The ticket aims to solve the problem 
> by returning already parsed fields. The corrupted column specified via JSON 
> option `columnNameOfCorruptRecord` or SQL config should contain whole 
> original JSON string. 
> For example, if the input has one JSON string:
> {code:json}
> {"a":0.1,"b":{},"c":"def"}
> {code}
> and specified schema is:
> {code:sql}
> a DOUBLE, b ARRAY, c STRING, _corrupt_record STRIN
> {code}
> expected output of `from_json` in the PERMISSIVE mode:
> {code}
> +---++---+--+
> |a  |b   |c  |_corrupt_record   |
> +---++---+--+
> |0.1|null|def|{"a":0.1,"b":{},"c":"def"}|
> +---++---+--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716290#comment-16716290
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446083128
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5956/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716291#comment-16716291
 ] 

ASF GitHub Bot commented on SPARK-24102:


SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446083138
 
 
   **[Test build #99952 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99952/testReport)**
 for PR 17085 at commit 
[`0cb2daf`](https://github.com/apache/spark/commit/0cb2daf35888d80c5c223e16505354571d87d383).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716289#comment-16716289
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446083121
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716288#comment-16716288
 ] 

ASF GitHub Bot commented on SPARK-26316:


SparkQA commented on issue #23269: [SPARK-26316] Revert hash join metrics in 
spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#issuecomment-446083119
 
 
   **[Test build #99951 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99951/testReport)**
 for PR 23269 at commit 
[`a46d18e`](https://github.com/apache/spark/commit/a46d18e2a6ae822a1e1d903e54ab928096cb2339).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26318) Enhance function merge performance in Row

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716283#comment-16716283
 ] 

ASF GitHub Bot commented on SPARK-26318:


HyukjinKwon commented on issue #23271: [SPARK-26318][SQL] Enhance function 
merge performance in Row
URL: https://github.com/apache/spark/pull/23271#issuecomment-446082473
 
 
   +1 for deprecation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enhance function merge performance in Row
> -
>
> Key: SPARK-26318
> URL: https://issues.apache.org/jira/browse/SPARK-26318
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang Li
>Priority: Minor
>
> Enhance function merge performance in Row
> Like do 1 time Row.merge for input 
> val row1 = Row("name", "work", 2314, "null", 1, ""), it need 108458 
> millisecond
> After do some enhancement, it only need 24967 millisecond



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716281#comment-16716281
 ] 

ASF GitHub Bot commented on SPARK-26300:


AmplabJenkins removed a comment on issue #23251: [SPARK-26300][SS] Remove a 
redundant `checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446082207
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716266#comment-16716266
 ] 

ASF GitHub Bot commented on SPARK-26300:


SparkQA commented on issue #23251: [SPARK-26300][SS] Remove a redundant 
`checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446081221
 
 
   **[Test build #99949 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99949/testReport)**
 for PR 23251 at commit 
[`b1e71ee`](https://github.com/apache/spark/commit/b1e71ee7a723d63f1cf3c0754f2372eb185439d3).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26316) Because of the perf degradation in TPC-DS, we currently partial revert SPARK-21052:Add hash map metrics to join,

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716280#comment-16716280
 ] 

ASF GitHub Bot commented on SPARK-26316:


JkSelf commented on a change in pull request #23269: [SPARK-26316] Revert hash 
join metrics in spark 21052 that causes performance degradation 
URL: https://github.com/apache/spark/pull/23269#discussion_r240481284
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ##
 @@ -62,8 +62,7 @@ case class HashAggregateExec(
 "numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output 
rows"),
 "peakMemory" -> SQLMetrics.createSizeMetric(sparkContext, "peak memory"),
 "spillSize" -> SQLMetrics.createSizeMetric(sparkContext, "spill size"),
-"aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "aggregate time"),
-"avgHashProbe" -> SQLMetrics.createAverageMetric(sparkContext, "avg hash 
probe"))
+"aggTime" -> SQLMetrics.createTimingMetric(sparkContext, "aggregate time"))
 
 Review comment:
   Yes, updated. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Because of the perf degradation in TPC-DS, we currently partial revert 
> SPARK-21052:Add hash map metrics to join,
> 
>
> Key: SPARK-26316
> URL: https://issues.apache.org/jira/browse/SPARK-26316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The code of  
> [L486|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L486]
>  and 
> [L487|https://github.com/apache/spark/blob/1d3dd58d21400b5652b75af7e7e53aad85a31528/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L487]
>  in  SPARK-21052 cause performance degradation in spark2.3. The result of  
> all queries in TPC-DS with 1TB is in [TPC-DS 
> result|https://docs.google.com/spreadsheets/d/18a5BdOlmm8euTaRodyeWum9yu92mbWWu6JbhGXtr7yE/edit#gid=0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716282#comment-16716282
 ] 

ASF GitHub Bot commented on SPARK-26300:


AmplabJenkins removed a comment on issue #23251: [SPARK-26300][SS] Remove a 
redundant `checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446082209
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5954/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716278#comment-16716278
 ] 

ASF GitHub Bot commented on SPARK-26300:


AmplabJenkins commented on issue #23251: [SPARK-26300][SS] Remove a redundant 
`checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446082209
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5954/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716277#comment-16716277
 ] 

ASF GitHub Bot commented on SPARK-26300:


AmplabJenkins commented on issue #23251: [SPARK-26300][SS] Remove a redundant 
`checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446082207
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716272#comment-16716272
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] 
Add test to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081265
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5953/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716250#comment-16716250
 ] 

ASF GitHub Bot commented on SPARK-24102:


SparkQA removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML 
Evaluators should use weight column - added weight column for regression 
evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446078542
 
 
   **[Test build #99948 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)**
 for PR 17085 at commit 
[`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716271#comment-16716271
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins removed a comment on issue #22273: [SPARK-25272][PYTHON][TEST] 
Add test to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081262
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716270#comment-16716270
 ] 

ASF GitHub Bot commented on SPARK-26327:


HyukjinKwon commented on issue #23277: [SPARK-26327][SQL] Metrics in 
FileSourceScanExec not update correctly
URL: https://github.com/apache/spark/pull/23277#issuecomment-446081444
 
 
   Looks fine to me


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Metrics in FileSourceScanExec not update correctly
> --
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized. But `selectedPartitions` will be 
> initialized by `metadata` at first, which is called by 
> `queryExecution.toString` in `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716269#comment-16716269
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test 
to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081265
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5953/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716268#comment-16716268
 ] 

ASF GitHub Bot commented on SPARK-25272:


AmplabJenkins commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test 
to better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446081262
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716265#comment-16716265
 ] 

ASF GitHub Bot commented on SPARK-26300:


dongjoon-hyun commented on issue #23251: [SPARK-26300][SS] Remove a redundant 
`checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446081142
 
 
   cc @tdas , too.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26300) The `checkForStreaming` mothod may be called twice in `createQuery`

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716264#comment-16716264
 ] 

ASF GitHub Bot commented on SPARK-26300:


dongjoon-hyun commented on issue #23251: [SPARK-26300][SS] Remove a redundant 
`checkForStreaming` call
URL: https://github.com/apache/spark/pull/23251#issuecomment-446081024
 
 
   Retest this please.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> The `checkForStreaming`  mothod  may be called twice in `createQuery`
> -
>
> Key: SPARK-26300
> URL: https://issues.apache.org/jira/browse/SPARK-26300
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> If {{checkForContinuous}} is called ( {{checkForStreaming}} is called in 
> {{checkForContinuous}} ), the {{checkForStreaming}} mothod will be called 
> twice in {{createQuery}} , this is not necessary, and the 
> {{checkForStreaming}} method has a lot of statements, so it's better to 
> remove one of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26327) Metrics in FileSourceScanExec not update correctly

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716261#comment-16716261
 ] 

ASF GitHub Bot commented on SPARK-26327:


HyukjinKwon commented on a change in pull request #23277: [SPARK-26327][SQL] 
Metrics in FileSourceScanExec not update correctly
URL: https://github.com/apache/spark/pull/23277#discussion_r240480046
 
 

 ##
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
 ##
 @@ -316,7 +313,7 @@ case class FileSourceScanExec(
   override lazy val metrics =
 Map("numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"),
   "numFiles" -> SQLMetrics.createMetric(sparkContext, "number of files"),
-  "metadataTime" -> SQLMetrics.createMetric(sparkContext, "metadata time 
(ms)"),
+  "fileListingTime" -> SQLMetrics.createMetric(sparkContext, "file listing 
time (ms)"),
 
 Review comment:
   Yea, please fix PR description and title accordingly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Metrics in FileSourceScanExec not update correctly
> --
>
> Key: SPARK-26327
> URL: https://issues.apache.org/jira/browse/SPARK-26327
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Priority: Major
>
> As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and 
> "metadataTime"(fileListingTime) were updated while lazy val 
> `selectedPartitions` initialized. But `selectedPartitions` will be 
> initialized by `metadata` at first, which is called by 
> `queryExecution.toString` in `SQLExecution.withNewExecutionId`. So while the 
> `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding 
> liveExecutions in SQLAppStatusListener, the metrics update is not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716256#comment-16716256
 ] 

ASF GitHub Bot commented on SPARK-25272:


BryanCutler commented on issue #22273: [SPARK-25272][PYTHON][TEST] Add test to 
better indicate pyarrow is installed and related tests will run
URL: https://github.com/apache/spark/pull/22273#issuecomment-446080278
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716253#comment-16716253
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML 
Evaluators should use weight column - added weight column for regression 
evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446079822
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99948/
   Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716251#comment-16716251
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins removed a comment on issue #17085: [SPARK-24102][ML][MLLIB] ML 
Evaluators should use weight column - added weight column for regression 
evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446079818
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716248#comment-16716248
 ] 

ASF GitHub Bot commented on SPARK-24102:


AmplabJenkins commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446079818
 
 
   Merged build finished. Test FAILed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24102) RegressionEvaluator should use sample weight data

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716231#comment-16716231
 ] 

ASF GitHub Bot commented on SPARK-24102:


SparkQA commented on issue #17085: [SPARK-24102][ML][MLLIB] ML Evaluators 
should use weight column - added weight column for regression evaluator
URL: https://github.com/apache/spark/pull/17085#issuecomment-446078542
 
 
   **[Test build #99948 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99948/testReport)**
 for PR 17085 at commit 
[`0480721`](https://github.com/apache/spark/commit/04807214d8694dcff7a2fe042457934e67eb8d57).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> RegressionEvaluator should use sample weight data
> -
>
> Key: SPARK-24102
> URL: https://issues.apache.org/jira/browse/SPARK-24102
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Ilya Matiach
>Priority: Major
>  Labels: starter
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   >