[jira] [Updated] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE

2019-11-10 Thread Xianyin Xin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated SPARK-29835:

Summary: Remove the unnecessary conversion from Statement to LogicalPlan 
for DELETE/UPDATE  (was: Remove the unneeded conversion from Statement to 
LogicalPlan for DELETE/UPDATE)

> Remove the unnecessary conversion from Statement to LogicalPlan for 
> DELETE/UPDATE
> -
>
> Key: SPARK-29835
> URL: https://issues.apache.org/jira/browse/SPARK-29835
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Priority: Major
>
> The current parse and analyze flow for DELETE is: 1, the SQL string will be 
> firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be 
> converted to `DeleteFromTable`. However, the SQL string can be parsed to 
> `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be 
> redundant.
> It is the same for UPDATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29835) Remove the unneeded conversion from Statement to LogicalPlan for DELETE/UPDATE

2019-11-10 Thread Xianyin Xin (Jira)
Xianyin Xin created SPARK-29835:
---

 Summary: Remove the unneeded conversion from Statement to 
LogicalPlan for DELETE/UPDATE
 Key: SPARK-29835
 URL: https://issues.apache.org/jira/browse/SPARK-29835
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xianyin Xin


The current parse and analyze flow for DELETE is: 1, the SQL string will be 
firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be 
converted to `DeleteFromTable`. However, the SQL string can be parsed to 
`DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant.

It is the same for UPDATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29421) Using 'USING provider' to specify a different table provider in CREATE TABLE LIKE

2019-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29421.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26097
[https://github.com/apache/spark/pull/26097]

> Using 'USING provider' to specify a different table provider in CREATE TABLE 
> LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Hive support STORED AS new file format syntax:
> {code}
> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
> {code}
> We add a similar syntax for Spark. Here we separate to two features:
> 1. specify a different table provider in CREATE TABLE LIKE
> 2. Hive compatibility
> In this PR, we address the first one:
> Using `USING provider` to specify a different table provider in CREATE TABLE 
> LIKE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29421) Using 'USING provider' to specify a different table provider in CREATE TABLE LIKE

2019-11-10 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29421:
---

Assignee: Lantao Jin

> Using 'USING provider' to specify a different table provider in CREATE TABLE 
> LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Hive support STORED AS new file format syntax:
> {code}
> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
> {code}
> We add a similar syntax for Spark. Here we separate to two features:
> 1. specify a different table provider in CREATE TABLE LIKE
> 2. Hive compatibility
> In this PR, we address the first one:
> Using `USING provider` to specify a different table provider in CREATE TABLE 
> LIKE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands

2019-11-10 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-29834:
-

 Summary: DESC DATABASE should look up catalog like v2 commands
 Key: SPARK-29834
 URL: https://issues.apache.org/jira/browse/SPARK-29834
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29834) DESC DATABASE should look up catalog like v2 commands

2019-11-10 Thread Hu Fuwang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971345#comment-16971345
 ] 

Hu Fuwang commented on SPARK-29834:
---

Working on this.

> DESC DATABASE should look up catalog like v2 commands
> -
>
> Key: SPARK-29834
> URL: https://issues.apache.org/jira/browse/SPARK-29834
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29827) Wrong persist strategy in mllib.clustering.BisectingKMeans.run

2019-11-10 Thread shahid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971336#comment-16971336
 ] 

shahid commented on SPARK-29827:


I would like to analyze this issue

> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --
>
> Key: SPARK-29827
> URL: https://issues.apache.org/jira/browse/SPARK-29827
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
>  * First, the rdd {color:#de350b}_input_{color} should be persisted, because 
> it was not only used by the action _first(),_ but also used by other __ 
> actions in the following code.
>  * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. 
> It was used in the fuction _summarize()_ more than once, which containts an 
> action on _assignments_.
>  * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ 
> persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. 
> Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its 
> child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary 
> to persist {color:#de350b}_norms_{color} anymore.
> {code:scala}
>   private[spark] def run(
>   input: RDD[Vector],
>   instr: Option[Instrumentation]): BisectingKMeansModel = {
> if (input.getStorageLevel == StorageLevel.NONE) {
>   logWarning(s"The input RDD ${input.id} is not directly cached, which 
> may hurt performance if"
> + " its parent RDDs are also not cached.")
> }
> // Needs to persist input
> val d = input.map(_.size).first() 
> logInfo(s"Feature dimension: $d.")
> val dMeasure: DistanceMeasure = 
> DistanceMeasure.decodeFromString(this.distanceMeasure)
> // Compute and cache vector norms for fast distance computation.
> val norms = input.map(v => Vectors.norm(v, 
> 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
> val vectors = input.zip(norms).map { case (x, norm) => new 
> VectorWithNorm(x, norm) }
> var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
> var activeClusters = summarize(d, assignments, dMeasure)
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28307) Support Interval comparison

2019-11-10 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971325#comment-16971325
 ] 

Kent Yao commented on SPARK-28307:
--

Resolved by the PR: [https://github.com/apache/spark/pull/26337]

> Support Interval comparison
> ---
>
> Key: SPARK-28307
> URL: https://issues.apache.org/jira/browse/SPARK-28307
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> postgres=# SELECT INTERVAL '0:0' HOUR TO SECOND = INTERVAL '0:0' HOUR TO 
> SECOND;
>  ?column?
> --
>  t
> (1 row)
> {code}
> {code}
> spark-sql> SELECT INTERVAL '0:0' HOUR TO SECOND = INTERVAL '0:0' HOUR TO 
> SECOND;
> Error in query: cannot resolve '(interval 0 microseconds = interval 0 
> microseconds)' due to data type mismatch: EqualTo does not support ordering 
> on type calendarinterval; line 1 pos 7;
> 'Project [unresolvedalias((interval 0 microseconds = interval 0 
> microseconds), None)]
> +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29833) Add FileNotFoundException check for spark.yarn.jars

2019-11-10 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-29833:

Summary: Add FileNotFoundException check  for spark.yarn.jars  (was: Add 
FileNotFoundException for spark.yarn.jars)

> Add FileNotFoundException check  for spark.yarn.jars
> 
>
> Key: SPARK-29833
> URL: https://issues.apache.org/jira/browse/SPARK-29833
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.4
>Reporter: ulysses you
>Priority: Minor
>
> When set `spark.yarn.jars=/xxx/xxx` which is just a no schema path, spark 
> will throw a NullPointerException.
> The reason is hdfs will return null if pathFs.globStatus(path) is not exist, 
> and spark just use `pathFs.globStatus(path).filter(_.isFile())` without check 
> it.
> Related Globber code is here
> {noformat}
> /*
>  * When the input pattern "looks" like just a simple filename, and we
>  * can't find it, we return null rather than an empty array.
>  * This is a special case which the shell relies on.
>  *
>  * To be more precise: if there were no results, AND there were no
>  * groupings (aka brackets), and no wildcards in the input (aka stars),
>  * we return null.
>  */
> if ((!sawWildcard) && results.isEmpty() &&
> (flattenedPatterns.size() <= 1)) {
>   return null;
> }
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29833) Add FileNotFoundException for spark.yarn.jars

2019-11-10 Thread ulysses you (Jira)
ulysses you created SPARK-29833:
---

 Summary: Add FileNotFoundException for spark.yarn.jars
 Key: SPARK-29833
 URL: https://issues.apache.org/jira/browse/SPARK-29833
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.4.4
Reporter: ulysses you


When set `spark.yarn.jars=/xxx/xxx` which is just a no schema path, spark will 
throw a NullPointerException.

The reason is hdfs will return null if pathFs.globStatus(path) is not exist, 
and spark just use `pathFs.globStatus(path).filter(_.isFile())` without check 
it.

Related Globber code is here
{noformat}
/*
 * When the input pattern "looks" like just a simple filename, and we
 * can't find it, we return null rather than an empty array.
 * This is a special case which the shell relies on.
 *
 * To be more precise: if there were no results, AND there were no
 * groupings (aka brackets), and no wildcards in the input (aka stars),
 * we return null.
 */
if ((!sawWildcard) && results.isEmpty() &&
(flattenedPatterns.size() <= 1)) {
  return null;
}
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29832) Unnecessary persist on instances in ml.regression.IsotonicRegression.fit

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29832:
-

 Summary: Unnecessary persist on instances in 
ml.regression.IsotonicRegression.fit
 Key: SPARK-29832
 URL: https://issues.apache.org/jira/browse/SPARK-29832
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: Dong Wang


Persist on instances in ml.regression.IsotonicRegression.fit() is unnecessary, 
because it is only used once in run(instances).
{code:scala}
  override def fit(dataset: Dataset[_]): IsotonicRegressionModel = instrumented 
{ instr =>
transformSchema(dataset.schema, logging = true)
// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
val instances = extractWeightedLabeledPoints(dataset)
val handlePersistence = dataset.storageLevel == StorageLevel.NONE
// Unnecessary persist
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
instr.logPipelineStage(this)
instr.logDataset(dataset)
instr.logParams(this, labelCol, featuresCol, weightCol, predictionCol, 
featureIndex, isotonic)
instr.logNumFeatures(1)
val isotonicRegression = new 
MLlibIsotonicRegression().setIsotonic($(isotonic))
val oldModel = isotonicRegression.run(instances) // Only use once here
if (handlePersistence) instances.unpersist()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29831) Scan Hive partitioned table should not dramatically increase data parallelism

2019-11-10 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-29831:
---

 Summary: Scan Hive partitioned table should not dramatically 
increase data parallelism
 Key: SPARK-29831
 URL: https://issues.apache.org/jira/browse/SPARK-29831
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Hive table scan operator reads each Hive partition as a HadoopRDD and unions 
all RDDs. The data parallelism of the result RDD can be dramatically increased, 
when reading a lot of partitions with a lot of files.

Although users can also do coalesce by themselves, this ticket proposes to add 
a config to limit the maximum of the data parallelism. Because:

1. end-users might not understand details and get confused by big partition 
number. end-users might not know why/when/where to add coalesce.
2. users need to add coalesce to each time Hive table scan. It is annoying.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29805) Enable nested schema pruning and pruning on expressions by default

2019-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29805:
-
Component/s: (was: Spark Core)
 SQL

> Enable nested schema pruning and pruning on expressions by default
> --
>
> Key: SPARK-29805
> URL: https://issues.apache.org/jira/browse/SPARK-29805
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29421) Using 'USING provider' to specify a different table provider in CREATE TABLE LIKE

2019-11-10 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29421:
---
Summary: Using 'USING provider' to specify a different table provider in 
CREATE TABLE LIKE  (was: Add an opportunity to change the file format of 
command CREATE TABLE LIKE)

> Using 'USING provider' to specify a different table provider in CREATE TABLE 
> LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Hive support STORED AS new file format syntax:
> {code}
> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
> {code}
> We add a similar syntax for Spark. Here we separate to two features:
> 1. specify a different table provider in CREATE TABLE LIKE
> 2. Hive compatibility
> In this PR, we address the first one:
> Using `USING provider` to specify a different table provider in CREATE TABLE 
> LIKE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29421) Add an opportunity to change the file format of command CREATE TABLE LIKE

2019-11-10 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29421:
---
Description: 
Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the 
definition of table tb2. The most user case is to create tb1 with the same 
schema of tb2. But an inconvenient case here is this command also copies the 
FileFormat from tb2, it cannot change the input/output format and serde. Add 
the ability of changing file format is useful for some scenarios like upgrading 
a table from a low performance file format to a high performance one (parquet, 
orc).

Hive support STORED AS new file format syntax:
{code}
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
{code}
We add a similar syntax for Spark. Here we separate to two features:
1. specify a different table provider in CREATE TABLE LIKE
2. Hive compatibility

In this PR, we address the first one:
Using `USING provider` to specify a different table provider in CREATE TABLE 
LIKE.

  was:
Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the 
definition of table tb2. The most user case is to create tb1 with the same 
schema of tb2. But an inconvenient case here is this command also copies the 
FileFormat from tb2, it cannot change the input/output format and serde. Add 
the ability of changing file format is useful for some scenarios like upgrading 
a table from a low performance file format to a high performance one (parquet, 
orc).

Here gives two options to enhance it.
Option1: Add a configuration {{spark.sql.createTableLike.fileformat}}, the 
value by default is "none" which keeps the behaviour same with current -- 
copying the file format from source table. After run command SET 
spark.sql.createTableLike.fileformat=parquet or any other valid file format 
defined in {{HiveSerDe}}, {{CREATE TABLE ... LIKE}} will use the new file 
format type.

Option2: Add syntax {{USING fileformat}} after {{CREATE TABLE ... LIKE}}. For 
example,
{code}
CREATE TABLE tb1 LIKE tb2 USING parquet;
{code}
If USING keyword is ignored, it also keeps the behaviour same with current -- 
copying the file format from source table.

Both of them can keep its behaviour same with current.
We use option1 with parquet file format as an enhancement in our production 
thriftserver because we need change many existing SQL scripts without any 
modification. But for community, Option2 could be treated as a new feature 
since it needs user to write additional USING part.

cc [~dongjoon] [~hyukjin.kwon] [~joshrosen] [~cloud_fan] [~yumwang]


> Add an opportunity to change the file format of command CREATE TABLE LIKE
> -
>
> Key: SPARK-29421
> URL: https://issues.apache.org/jira/browse/SPARK-29421
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on 
> the definition of table tb2. The most user case is to create tb1 with the 
> same schema of tb2. But an inconvenient case here is this command also copies 
> the FileFormat from tb2, it cannot change the input/output format and serde. 
> Add the ability of changing file format is useful for some scenarios like 
> upgrading a table from a low performance file format to a high performance 
> one (parquet, orc).
> Hive support STORED AS new file format syntax:
> {code}
> CREATE TABLE tbl(a int) STORED AS TEXTFILE;
> CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
> {code}
> We add a similar syntax for Spark. Here we separate to two features:
> 1. specify a different table provider in CREATE TABLE LIKE
> 2. Hive compatibility
> In this PR, we address the first one:
> Using `USING provider` to specify a different table provider in CREATE TABLE 
> LIKE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29396) Extend Spark plugin interface to driver

2019-11-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29396:

Labels: release-notes  (was: )

> Extend Spark plugin interface to driver
> ---
>
> Key: SPARK-29396
> URL: https://issues.apache.org/jira/browse/SPARK-29396
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
>  Labels: release-notes
>
> Spark provides an extension API for people to implement executor plugins, 
> added in SPARK-24918 and later extended in SPARK-28091.
> That API does not offer any functionality for doing similar things on the 
> driver side, though. As a consequence of that, there is not a good way for 
> the executor plugins to get information or communicate in any way with the 
> Spark driver.
> I've been playing with such an improved API for developing some new 
> functionality. I'll file a few child bugs for the work to get the changes in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28091) Extend Spark metrics system with user-defined metrics using executor plugins

2019-11-10 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28091:

Labels: release-notes  (was: )

> Extend Spark metrics system with user-defined metrics using executor plugins
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This proposes to improve Spark instrumentation by adding a hook for 
> user-defined metrics, extending Spark’s Dropwizard/Codahale metrics system.
> The original motivation of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes with this JIRA to add a metrics plugin system which is of 
> more flexible and general use.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also , useful to  monitor and troubleshoot Spark workloads. A typical 
> workflow is to sink the metrics to a storage system and build dashboards on 
> top of that.
> Highlights:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation builds on top of the work on Executor Plugin of 
> SPARK-24918 and builds on recent work on extending Spark executor metrics, 
> such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28781) Unneccesary persist in PeriodicCheckpointer.update()

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-28781:
--
Description: 
Once the fuction _update()_ is called, the RDD _newData_ is persisted at line 
82. However, only when meeting the checking point condition (at line 94), the 
persisted rdd _newData_ would be used for the second time in the API 
_checkpoint()_ (do checkpoint at line 97). In other conditions, _newData_ will 
only be used once and it is unnecessary to persist the rdd in that case. 
Although the persistedQueue will be checked to avoid too many unnecessary 
cached data, it would be better to avoid every unnecessary persist operation.
{code:scala}
def update(newData: T): Unit = {
persist(newData)
persistedQueue.enqueue(newData)
// We try to maintain 2 Datasets in persistedQueue to support the semantics 
of this class:
// Users should call [[update()]] when a new Dataset has been created,
// before the Dataset has been materialized.
while (persistedQueue.size > 3) {
  val dataToUnpersist = persistedQueue.dequeue()
  unpersist(dataToUnpersist)
}
updateCount += 1

// Handle checkpointing (after persisting)
if (checkpointInterval != -1 && (updateCount % checkpointInterval) == 0
  && sc.getCheckpointDir.nonEmpty) {
  // Add new checkpoint before removing old checkpoints.
  checkpoint(newData)
{code}

  was:
Once the update is called, newData is persisted at line 82. However, only when 
the checkpoint is handling (satisfy the condition at line 94), the persist data 
is used for the second time (do checkpoint at line 97). The other data which is 
not satisfied to the checkpoint condition is unnecessary to be cached. The 
persistedQueue avoids too many unnecessary cached data, but it is best to avoid 
every unnecessary persist operation.
{code:scala}
def update(newData: T): Unit = {
persist(newData)
persistedQueue.enqueue(newData)
// We try to maintain 2 Datasets in persistedQueue to support the semantics 
of this class:
// Users should call [[update()]] when a new Dataset has been created,
// before the Dataset has been materialized.
while (persistedQueue.size > 3) {
  val dataToUnpersist = persistedQueue.dequeue()
  unpersist(dataToUnpersist)
}
updateCount += 1

// Handle checkpointing (after persisting)
if (checkpointInterval != -1 && (updateCount % checkpointInterval) == 0
  && sc.getCheckpointDir.nonEmpty) {
  // Add new checkpoint before removing old checkpoints.
  checkpoint(newData)
{code}


> Unneccesary persist in PeriodicCheckpointer.update()
> 
>
> Key: SPARK-28781
> URL: https://issues.apache.org/jira/browse/SPARK-28781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> Once the fuction _update()_ is called, the RDD _newData_ is persisted at line 
> 82. However, only when meeting the checking point condition (at line 94), the 
> persisted rdd _newData_ would be used for the second time in the API 
> _checkpoint()_ (do checkpoint at line 97). In other conditions, _newData_ 
> will only be used once and it is unnecessary to persist the rdd in that case. 
> Although the persistedQueue will be checked to avoid too many unnecessary 
> cached data, it would be better to avoid every unnecessary persist operation.
> {code:scala}
> def update(newData: T): Unit = {
> persist(newData)
> persistedQueue.enqueue(newData)
> // We try to maintain 2 Datasets in persistedQueue to support the 
> semantics of this class:
> // Users should call [[update()]] when a new Dataset has been created,
> // before the Dataset has been materialized.
> while (persistedQueue.size > 3) {
>   val dataToUnpersist = persistedQueue.dequeue()
>   unpersist(dataToUnpersist)
> }
> updateCount += 1
> // Handle checkpointing (after persisting)
> if (checkpointInterval != -1 && (updateCount % checkpointInterval) == 0
>   && sc.getCheckpointDir.nonEmpty) {
>   // Add new checkpoint before removing old checkpoints.
>   checkpoint(newData)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29823:
--
Description: 
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply on 
_norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
{color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not  _{color:#de350b}norms{color}_.
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}{color:#172b4d}. S{color}o all the actions that reply on 
_norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
{color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not  __  _{color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  _{color:#de350b}norms{color}_.
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29823:
--
Description: 
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}{color:#172b4d}. S{color}o all the actions that reply on 
_norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
{color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not  __  _{color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply on 
_norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
{color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not _ __ {color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}. S{color}o all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not  __  _{color:#de350b}norms{color}._
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (SPARK-29393) Add the make_interval() function

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29393:
-

Assignee: Maxim Gekk

> Add the make_interval() function
> 
>
> Key: SPARK-29393
> URL: https://issues.apache.org/jira/browse/SPARK-29393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> PostgreSQL allows to make an interval by using the make_interval() function:
> |{{make_interval(_{{years}}_ }}{{int}}{{ DEFAULT 0, _{{months}}_ }}{{int}}{{ 
> DEFAULT 0, _{{weeks}}_ }}{{int}}{{ DEFAULT 0, _{{days}}_ }}{{int}}{{ DEFAULT 
> 0, _{{hours}}_ }}{{int}}{{ DEFAULT 0, _{{mins}}_ }}{{int}}{{ DEFAULT 0, 
> _{{secs}}_ }}{{double precision}}{{ DEFAULT 0.0)}}|{{interval}}|Create 
> interval from years, months, weeks, days, hours, minutes and seconds 
> fields|{{make_interval(days => 10)}}|{{10 days}}|
> See https://www.postgresql.org/docs/12/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29823:
--
Description: 
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply on 
_norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
{color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not _ __ {color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}. So all the actions that reply on 
{color:#de350b}_norms_ {color}also reply on _{color:#de350b}zippedData{color}._ 
The rdd {color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not __ _{color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}{color:#172b4d}.{color} So all the actions that reply 
> on _norms_ also reply on _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not _ __ {color:#de350b}norms{color}._
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (SPARK-29393) Add the make_interval() function

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29393.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26446
[https://github.com/apache/spark/pull/26446]

> Add the make_interval() function
> 
>
> Key: SPARK-29393
> URL: https://issues.apache.org/jira/browse/SPARK-29393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> PostgreSQL allows to make an interval by using the make_interval() function:
> |{{make_interval(_{{years}}_ }}{{int}}{{ DEFAULT 0, _{{months}}_ }}{{int}}{{ 
> DEFAULT 0, _{{weeks}}_ }}{{int}}{{ DEFAULT 0, _{{days}}_ }}{{int}}{{ DEFAULT 
> 0, _{{hours}}_ }}{{int}}{{ DEFAULT 0, _{{mins}}_ }}{{int}}{{ DEFAULT 0, 
> _{{secs}}_ }}{{double precision}}{{ DEFAULT 0.0)}}|{{interval}}|Create 
> interval from years, months, weeks, days, hours, minutes and seconds 
> fields|{{make_interval(days => 10)}}|{{10 days}}|
> See https://www.postgresql.org/docs/12/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29823) Improper persist strategy in mllib.clustering.KMeans.run()

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29823:
--
Description: 
In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., the 
rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
persisted{color}{color}. So all the actions that reply on 
{color:#de350b}_norms_ {color}also reply on _{color:#de350b}zippedData{color}._ 
The rdd {color:#de350b}_zippedData_{color} will be used by multiple times in 
_runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
persisted, not __ _{color:#de350b}norms{color}._
{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
In mllib.clustering.KMeans.run(), the rdd norms is persisted. But it only has a 
single child rdd zippedData, so it's a unnecessary persist. On the other hand, 
norms's child rdd zippedData will be used by multi times in runAlgorithm, so 
zippedData should be persisted.

{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

Summary: Improper persist strategy in mllib.clustering.KMeans.run()  
(was: Wrong persist strategy in mllib.clustering.KMeans.run())

> Improper persist strategy in mllib.clustering.KMeans.run()
> --
>
> Key: SPARK-29823
> URL: https://issues.apache.org/jira/browse/SPARK-29823
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> In mllib.clustering.KMeans.run(), the rdd {color:#de350b}_norms_{color} is 
> persisted. But {color:#de350b}_norms_ {color}only has a single child, i.e., 
> the rdd {color:#de350b}_zippedData_ {color:#172b4d}which was not 
> persisted{color}{color}. So all the actions that reply on 
> {color:#de350b}_norms_ {color}also reply on 
> _{color:#de350b}zippedData{color}._ The rdd 
> {color:#de350b}_zippedData_{color} will be used by multiple times in 
> _runAlgorithm()._ Therefore _{color:#de350b}zippedData{color}_ should be 
> persisted, not __ _{color:#de350b}norms{color}._
> {code:scala}
>   private[spark] def run(
>   data: RDD[Vector],
>   instr: Option[Instrumentation]): KMeansModel = {
> if (data.getStorageLevel == StorageLevel.NONE) {
>   logWarning("The input data is not directly cached, which may hurt 
> performance if its"
> + " parent RDDs are also uncached.")
> }
> // Compute squared norms and cache them.
> val norms = data.map(Vectors.norm(_, 2.0))
> norms.persist() // Unnecessary persist. Only used to generate zippedData.
> val zippedData = data.zip(norms).map { case (v, norm) =>
>   new VectorWithNorm(v, norm)
> } // needs to persist
> val model = runAlgorithm(zippedData, instr)
> norms.unpersist() // Change to zippedData.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-29393) Add make_interval() function

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29393:
--
Summary: Add make_interval() function  (was: Add the make_interval() 
function)

> Add make_interval() function
> 
>
> Key: SPARK-29393
> URL: https://issues.apache.org/jira/browse/SPARK-29393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> PostgreSQL allows to make an interval by using the make_interval() function:
> |{{make_interval(_{{years}}_ }}{{int}}{{ DEFAULT 0, _{{months}}_ }}{{int}}{{ 
> DEFAULT 0, _{{weeks}}_ }}{{int}}{{ DEFAULT 0, _{{days}}_ }}{{int}}{{ DEFAULT 
> 0, _{{hours}}_ }}{{int}}{{ DEFAULT 0, _{{mins}}_ }}{{int}}{{ DEFAULT 0, 
> _{{secs}}_ }}{{double precision}}{{ DEFAULT 0.0)}}|{{interval}}|Create 
> interval from years, months, weeks, days, hours, minutes and seconds 
> fields|{{make_interval(days => 10)}}|{{10 days}}|
> See https://www.postgresql.org/docs/12/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29827) Wrong persist strategy in mllib.clustering.BisectingKMeans.run

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29827:
--
Description: 
There are three persist misuses in mllib.clustering.BisectingKMeans.run.
 * First, the rdd {color:#de350b}_input_{color} should be persisted, because it 
was not only used by the action _first(),_ but also used by other __ actions in 
the following code.
 * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. It 
was used in the fuction _summarize()_ more than once, which containts an action 
on _assignments_.
 * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ 
persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. Because 
{color:#de350b}_norms_ {color} is an intermediate rdd. Since its child rdd 
{color:#de350b}_assignments_{color} is persisted, it is unnecessary to persist 
{color:#de350b}_norms_{color} anymore.

{code:scala}
  private[spark] def run(
  input: RDD[Vector],
  instr: Option[Instrumentation]): BisectingKMeansModel = {
if (input.getStorageLevel == StorageLevel.NONE) {
  logWarning(s"The input RDD ${input.id} is not directly cached, which may 
hurt performance if"
+ " its parent RDDs are also not cached.")
}
// Needs to persist input
val d = input.map(_.size).first() 
logInfo(s"Feature dimension: $d.")
val dMeasure: DistanceMeasure = 
DistanceMeasure.decodeFromString(this.distanceMeasure)
// Compute and cache vector norms for fast distance computation.
val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
var activeClusters = summarize(d, assignments, dMeasure)
{code}
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
There are three persist misuses in mllib.clustering.BisectingKMeans.run.

First, the rdd {{ _input _}} should be persisted, because it is used by action 
first and actions in the following code.
Second, rdd assignments should be persisted. It is used in summarize() more 
than once, which has an action on assignment.
Third, persisting rdd norms is unnecessary, because it is just a mediate rdd. 
Since its child rdd assignments is persisted, it is unnecessary to be persisted.

{code:scala}
  private[spark] def run(
  input: RDD[Vector],
  instr: Option[Instrumentation]): BisectingKMeansModel = {
if (input.getStorageLevel == StorageLevel.NONE) {
  logWarning(s"The input RDD ${input.id} is not directly cached, which may 
hurt performance if"
+ " its parent RDDs are also not cached.")
}
// Needs to persist input
val d = input.map(_.size).first() 
logInfo(s"Feature dimension: $d.")
val dMeasure: DistanceMeasure = 
DistanceMeasure.decodeFromString(this.distanceMeasure)
// Compute and cache vector norms for fast distance computation.
val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
var activeClusters = summarize(d, assignments, dMeasure)
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --
>
> Key: SPARK-29827
> URL: https://issues.apache.org/jira/browse/SPARK-29827
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
>  * First, the rdd {color:#de350b}_input_{color} should be persisted, because 
> it was not only used by the action _first(),_ but also used by other __ 
> actions in the following code.
>  * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. 
> It was used in the fuction _summarize()_ more than once, which containts an 
> action on _assignments_.
>  * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ 
> persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. 
> Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its 
> child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary 
> to persist {color:#de350b}_norms_{color} anymore.
> {code:scala}
>   private[spark] def run(
>   input: RDD[Vector],
>   instr: Option[Instrumentation]): 

[jira] [Updated] (SPARK-29827) Wrong persist strategy in mllib.clustering.BisectingKMeans.run

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29827:
--
Description: 
There are three persist misuses in mllib.clustering.BisectingKMeans.run.

First, the rdd {{ _input _}} should be persisted, because it is used by action 
first and actions in the following code.
Second, rdd assignments should be persisted. It is used in summarize() more 
than once, which has an action on assignment.
Third, persisting rdd norms is unnecessary, because it is just a mediate rdd. 
Since its child rdd assignments is persisted, it is unnecessary to be persisted.

{code:scala}
  private[spark] def run(
  input: RDD[Vector],
  instr: Option[Instrumentation]): BisectingKMeansModel = {
if (input.getStorageLevel == StorageLevel.NONE) {
  logWarning(s"The input RDD ${input.id} is not directly cached, which may 
hurt performance if"
+ " its parent RDDs are also not cached.")
}
// Needs to persist input
val d = input.map(_.size).first() 
logInfo(s"Feature dimension: $d.")
val dMeasure: DistanceMeasure = 
DistanceMeasure.decodeFromString(this.distanceMeasure)
// Compute and cache vector norms for fast distance computation.
val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
var activeClusters = summarize(d, assignments, dMeasure)
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
There are three persist misuses in mllib.clustering.BisectingKMeans.run.
First, rdd input should be persisted, because it is used by action first and 
actions in the following code.
Second, rdd assignments should be persisted. It is used in summarize() more 
than once, which has an action on assignment.
Third, persisting rdd norms is unnecessary, because it is just a mediate rdd. 
Since its child rdd assignments is persisted, it is unnecessary to be persisted.

{code:scala}
  private[spark] def run(
  input: RDD[Vector],
  instr: Option[Instrumentation]): BisectingKMeansModel = {
if (input.getStorageLevel == StorageLevel.NONE) {
  logWarning(s"The input RDD ${input.id} is not directly cached, which may 
hurt performance if"
+ " its parent RDDs are also not cached.")
}
// Needs to persist input
val d = input.map(_.size).first() 
logInfo(s"Feature dimension: $d.")
val dMeasure: DistanceMeasure = 
DistanceMeasure.decodeFromString(this.distanceMeasure)
// Compute and cache vector norms for fast distance computation.
val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
var activeClusters = summarize(d, assignments, dMeasure)
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --
>
> Key: SPARK-29827
> URL: https://issues.apache.org/jira/browse/SPARK-29827
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
> First, the rdd {{ _input _}} should be persisted, because it is used by 
> action first and actions in the following code.
> Second, rdd assignments should be persisted. It is used in summarize() more 
> than once, which has an action on assignment.
> Third, persisting rdd norms is unnecessary, because it is just a mediate rdd. 
> Since its child rdd assignments is persisted, it is unnecessary to be 
> persisted.
> {code:scala}
>   private[spark] def run(
>   input: RDD[Vector],
>   instr: Option[Instrumentation]): BisectingKMeansModel = {
> if (input.getStorageLevel == StorageLevel.NONE) {
>   logWarning(s"The input RDD ${input.id} is not directly cached, which 
> may hurt performance if"
> + " its parent RDDs are also not cached.")
> }
> // Needs to persist input
> val d = input.map(_.size).first() 
> logInfo(s"Feature dimension: $d.")
> val dMeasure: DistanceMeasure = 
> DistanceMeasure.decodeFromString(this.distanceMeasure)
> // Compute and cache vector norms for fast distance computation.
> val norms = input.map(v => Vectors.norm(v, 
> 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary 

[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28939:
--
Affects Version/s: 2.3.4

> SQL configuration are not always propagated
> ---
>
> Key: SPARK-28939
> URL: https://issues.apache.org/jira/browse/SPARK-28939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> The SQL configurations are propagated to executors in order to be effective.
> Unfortunately, in some cases, we are missing to propagate them, making them 
> un-effective.
> The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
> And this is pretty frequent in the codebase.
> Please notice that there are 2 parts of this issue:
>  - when a user directly uses those APIs
>  - when Spark invokes them (eg. throughout the ML lib and other usages or the 
> {{describe}} method on the {{Dataset}} class)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28939:
--
Fix Version/s: 2.4.5

> SQL configuration are not always propagated
> ---
>
> Key: SPARK-28939
> URL: https://issues.apache.org/jira/browse/SPARK-28939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> The SQL configurations are propagated to executors in order to be effective.
> Unfortunately, in some cases, we are missing to propagate them, making them 
> un-effective.
> The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
> And this is pretty frequent in the codebase.
> Please notice that there are 2 parts of this issue:
>  - when a user directly uses those APIs
>  - when Spark invokes them (eg. throughout the ML lib and other usages or the 
> {{describe}} method on the {{Dataset}} class)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer

2019-11-10 Thread Jira
Jörn Franke created SPARK-29830:
---

 Summary: PySpark.context.Sparkcontext.binaryfiles improved memory 
with buffer
 Key: SPARK-29830
 URL: https://issues.apache.org/jira/browse/SPARK-29830
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Jörn Franke


At the moment, Pyspark reads binary files into a byte array directly. This 
means it reads the full binary file immediately into memory, which is 1) memory 
in-efficient 2) differs from the Scala implementation (see pyspark here: 
[https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles).
   
|https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles]

In Scala, Spark returns a PortableDataStream, which means the application does 
not need to read the full content of the stream in memory to work on it (see 
[https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).]

 

Hence, it is proposed to adapt the Pyspark implementation to return something 
similar to a PortableDataStream in Scala (e.g. 
[BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].]

 

Reading binary files in an efficient manner is crucial for many IoT 
applications, but potentially also other fields (e.g. disk image analysis in 
forensics).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-10 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971191#comment-16971191
 ] 

Pablo Langa Blanco commented on SPARK-29829:


I am working on this

> SHOW TABLE EXTENDED should look up catalog/table like v2 commands
> -
>
> Key: SPARK-29829
> URL: https://issues.apache.org/jira/browse/SPARK-29829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Priority: Major
>
> SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29829) SHOW TABLE EXTENDED should look up catalog/table like v2 commands

2019-11-10 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-29829:
--

 Summary: SHOW TABLE EXTENDED should look up catalog/table like v2 
commands
 Key: SPARK-29829
 URL: https://issues.apache.org/jira/browse/SPARK-29829
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Pablo Langa Blanco


SHOW TABLE EXTENDED should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29821) Allow calling non-aggregate SQL functions with column name

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29821.
--
Resolution: Not A Problem

> Allow calling non-aggregate SQL functions with column name
> --
>
> Key: SPARK-29821
> URL: https://issues.apache.org/jira/browse/SPARK-29821
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.4.4
>Reporter: Max Härtwig
>Priority: Minor
>
> Some functions in sql/functions.scala can be called with a Column or String 
> parameter, other can only be called with a Column parameter. This behavior 
> should be made consistent.
>  
> Example:
> Exists:{{ sqrt(e: Column): Column}}
> Exists:{{ sqrt(columnName: String): Column}}
> Exists:{{ isnan(e: Column): Column}}
> Doesn't exist: {{isnan(columnName: String): Column}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29826) Missing persist on data in mllib.feature.ChiSqSelector.fit

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29826.
--
Resolution: Duplicate

> Missing persist on data in mllib.feature.ChiSqSelector.fit
> --
>
> Key: SPARK-29826
> URL: https://issues.apache.org/jira/browse/SPARK-29826
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd data in mllib.feature.ChiSqSelector.fit() is used by an action in 
> Statistics.chiSqTest(data) and other actions in the following code, but it is 
> not persisted.
> {code:scala} 
>   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
> val chiSqTestResult = Statistics.chiSqTest(data).zipWithIndex
> val features = selectorType match {
>   case ChiSqSelector.NumTopFeatures =>
> chiSqTestResult
>   .sortBy { case (res, _) => res.pValue }
>   .take(numTopFeatures)
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29813) Missing persist in mllib.PrefixSpan.findFrequentItems()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29813.
--
Resolution: Duplicate

> Missing persist in mllib.PrefixSpan.findFrequentItems()
> ---
>
> Key: SPARK-29813
> URL: https://issues.apache.org/jira/browse/SPARK-29813
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There are three actions in this piece of code: reduceByKey, sortBy, and 
> collect. But data is not persisted, which will cause recomputation.
> {code:scala}
>   private[fpm] def findFrequentItems[Item: ClassTag](
>   data: RDD[Array[Array[Item]]],
>   minCount: Long): Array[Item] = {
> data.flatMap { itemsets =>
>   val uniqItems = mutable.Set.empty[Item]
>   itemsets.foreach(set => uniqItems ++= set)
>   uniqItems.toIterator.map((_, 1L))
> }.reduceByKey(_ + _).filter { case (_, count) =>
>   count >= minCount
> }.sortBy(-_._2).map(_._1).collect()
>   }
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29809) Missing persist in Word2Vec.fit()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29809.
--
Resolution: Duplicate

> Missing persist in Word2Vec.fit()
> -
>
> Key: SPARK-29809
> URL: https://issues.apache.org/jira/browse/SPARK-29809
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The RDD dataset is used by more than two actions in learnVocab(dataset) and 
> doFit. It needs to be persisted.
> {code:scala}
> def fit[S <: Iterable[String]](dataset: RDD[S]): Word2VecModel = {
> // Needs to persist dataset here
> learnVocab(dataset) // has action on dataset
> createBinaryTree()
> val sc = dataset.context
> val expTable = sc.broadcast(createExpTable())
> val bcVocab = sc.broadcast(vocab)
> val bcVocabHash = sc.broadcast(vocabHash)
> try {
>   doFit(dataset, sc, expTable, bcVocab, bcVocabHash) // has action on 
> dataset
> {code}
> This issue is reported by our tool _CacheCheck_, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29811) Missing persist on oldDataset in ml.RandomForestRegressor.train()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29811.
--
Resolution: Duplicate

> Missing persist on oldDataset in ml.RandomForestRegressor.train()
> -
>
> Key: SPARK-29811
> URL: https://issues.apache.org/jira/browse/SPARK-29811
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd oldDataset in ml.regression.RandomForestRegressor.train() needs to be 
> persisted, because it used in two actions in RandomForest.run() and 
> oldDataset.first().
> {code:scala}
> override protected def train(
>   dataset: Dataset[_]): RandomForestRegressionModel = instrumented { 
> instr =>
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
> val oldDataset: RDD[LabeledPoint] = extractLabeledPoints(dataset) // 
> Needs to persist
> val strategy =
>   super.getOldStrategy(categoricalFeatures, numClasses = 0, 
> OldAlgo.Regression, getOldImpurity)
> instr.logPipelineStage(this)
> instr.logDataset(dataset)
> instr.logParams(this, labelCol, featuresCol, predictionCol, impurity, 
> numTrees,
>   featureSubsetStrategy, maxDepth, maxBins, maxMemoryInMB, minInfoGain,
>   minInstancesPerNode, seed, subsamplingRate, cacheNodeIds, 
> checkpointInterval)
>// First use oldDataset
> val trees = RandomForest
>   .run(oldDataset, strategy, getNumTrees, getFeatureSubsetStrategy, 
> getSeed, Some(instr))
>   .map(_.asInstanceOf[DecisionTreeRegressionModel])
>// Second use oldDataset
> val numFeatures = oldDataset.first().features.size
> instr.logNamedValue(Instrumentation.loggerTags.numFeatures, numFeatures)
> new RandomForestRegressionModel(uid, trees, numFeatures)
>   }
> {code}
> The same situation exits in ml.classification.RandomForestClassifier.train.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29812) Missing persist on predictionAndLabels in MulticlassClassificationEvaluator

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29812.
--
Resolution: Duplicate

> Missing persist on predictionAndLabels in MulticlassClassificationEvaluator
> ---
>
> Key: SPARK-29812
> URL: https://issues.apache.org/jira/browse/SPARK-29812
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd predictionAndLabels in 
> ml.evaluation.MulticlassificationEvaluator.evaluate() needs to be persisted. 
> When MulticlassMetrics uses predictionAndLabels to initialize fileds, there 
> will be at least five actions executed on predictionAndLabels.
> {code:scala}
>   override def evaluate(dataset: Dataset[_]): Double = {
> val schema = dataset.schema
> SchemaUtils.checkColumnType(schema, $(predictionCol), DoubleType)
> SchemaUtils.checkNumericType(schema, $(labelCol))
> // Needs to be persisted
> val predictionAndLabels =
>   dataset.select(col($(predictionCol)), 
> col($(labelCol)).cast(DoubleType)).rdd.map {
> case Row(prediction: Double, label: Double) => (prediction, label)
>   }
> // The initialization will use predictionAndLabels multi times in 
> different actions.
> val metrics = new MulticlassMetrics(predictionAndLabels)
> val metric = $(metricName) match {
>   case "f1" => metrics.weightedFMeasure
>   case "weightedPrecision" => metrics.weightedPrecision
>   case "weightedRecall" => metrics.weightedRecall
>   case "accuracy" => metrics.accuracy
> }
> metric
>   }
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29810) Missing persist on retaggedInput in RandomForest.run()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29810.
--
Resolution: Duplicate

> Missing persist on retaggedInput in RandomForest.run()
> --
>
> Key: SPARK-29810
> URL: https://issues.apache.org/jira/browse/SPARK-29810
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd retaggedInput should be persisted in ml.tree.impl.RandomForest.run(), 
> because it will be used more than one actions.
> {code:scala}
>   def run(
>   input: RDD[LabeledPoint],
>   strategy: OldStrategy,
>   numTrees: Int,
>   featureSubsetStrategy: String,
>   seed: Long,
>   instr: Option[Instrumentation],
>   prune: Boolean = true, // exposed for testing only, real trees are 
> always pruned
>   parentUID: Option[String] = None): Array[DecisionTreeModel] = {
> val timer = new TimeTracker()
> timer.start("total")
> timer.start("init")
> val retaggedInput = input.retag(classOf[LabeledPoint]) // it needs to be 
> persisted
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29828) Missing persist on ratings in ml.recommendation.ALS.train

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29828.
--
Resolution: Duplicate

> Missing persist on ratings in ml.recommendation.ALS.train
> -
>
> Key: SPARK-29828
> URL: https://issues.apache.org/jira/browse/SPARK-29828
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There is a ratings.isEmpty() at the beginning of 
> theml.recommendation.ALS.train(). Actually, isEmpty() has an action operator.
> {code:scala}
>   def isEmpty(): Boolean = withScope {
> partitions.length == 0 || take(1).length == 0
>   }
> {code}
> So rdd ratings will be used by multi actions, it should be persisted, and 
> unpersisted after its child rdd has been persisted.
> {code:scala}
>   def train[ID: ClassTag]( // scalastyle:ignore
>   ratings: RDD[Rating[ID]],
> ...
> require(!ratings.isEmpty(), s"No ratings available from $ratings") // 
> first use ratings
> ...
> val blockRatings = partitionRatings(ratings, userPart, itemPart)
>   .persist(intermediateRDDStorageLevel)
> val (userInBlocks, userOutBlocks) =
>   makeBlocks("user", blockRatings, userPart, itemPart, 
> intermediateRDDStorageLevel)
> userOutBlocks.count()// materialize blockRatings and user blocks
> // ratings should be unpersisted here
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29824) Missing persist on trainDataset in ml.classification.GBTClassifier.train()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29824.
--
Resolution: Duplicate

> Missing persist on trainDataset in ml.classification.GBTClassifier.train()
> --
>
> Key: SPARK-29824
> URL: https://issues.apache.org/jira/browse/SPARK-29824
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd trainDataset in ml.classification.GBTClassifier.train() is used by an 
> action first and other actions in GradientBoostedTrees.run/runWithValidation, 
> but it is not persisted, which will cause recomputation on trainDataset.
> {code:scala}
>   override protected def train(
>   dataset: Dataset[_]): GBTClassificationModel = instrumented { instr =>
> val categoricalFeatures: Map[Int, Int] =
>   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
>     ...
> val numFeatures = trainDataset.first().features.size // first use 
> trainDataset
>     ...
>     // trainDataset will be used by other actions in run methods.
> val (baseLearners, learnerWeights) = if (withValidation) {
>   GradientBoostedTrees.runWithValidation(trainDataset, validationDataset, 
> boostingStrategy,
> $(seed), $(featureSubsetStrategy))
> } else {
>   GradientBoostedTrees.run(trainDataset, boostingStrategy, $(seed), 
> $(featureSubsetStrategy))
> }
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29815) Missing persist in ml.tuning.CrossValidator.fit()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29815.
--
Resolution: Duplicate

> Missing persist in ml.tuning.CrossValidator.fit()
> -
>
> Key: SPARK-29815
> URL: https://issues.apache.org/jira/browse/SPARK-29815
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> dataset.toDF.rdd in ml.tuning.CrossValidator.fit(dataset: Dataset[_]) will 
> generate two rdds: training and validation. Some actions will be operated on 
> these two rdds, but dataset.toDF.rdd is not persisted, which will cause 
> recomputation.
> {code:scala}
> // Compute metrics for each model over each split
> val splits = MLUtils.kFold(dataset.toDF.rdd, $(numFolds), $(seed)) // 
> dataset.toDF.rdd should be persisted
> val metrics = splits.zipWithIndex.map { case ((training, validation), 
> splitIndex) =>
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
> {scala}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29816) Missing persist in mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29816.
--
Resolution: Duplicate

> Missing persist in 
> mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()
> ---
>
> Key: SPARK-29816
> URL: https://issues.apache.org/jira/browse/SPARK-29816
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Minor
>
> The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and 
> count(), so it needs to be persisted.
> {code:scala}
> val counts = scoreAndLabels.combineByKey(
>   createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += 
> label,
>   mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
>   mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 
> += c2
> ).sortByKey(ascending = false) // first use
> val binnedCounts =
>   // Only down-sample if bins is > 0
>   if (numBins == 0) {
> // Use original directly
> counts
>   } else {
> val countsSize = counts.count() //second use
> {scala}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29814) Missing persist on sources in mllib.feature.PCA

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29814.
--
Resolution: Duplicate

> Missing persist on sources in mllib.feature.PCA
> ---
>
> Key: SPARK-29814
> URL: https://issues.apache.org/jira/browse/SPARK-29814
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd is used in more than one actions: first() and actions in 
> computePrincipalComponentsAndExplainedVariance(), so it needs to be persisted.
> {code:scala}
>   def fit(sources: RDD[Vector]): PCAModel = {
> // first use rdd sources on action first()
> val numFeatures = sources.first().size
> require(k <= numFeatures,
>   s"source vector size $numFeatures must be no less than k=$k")
> require(PCAUtil.memoryCost(k, numFeatures) < Int.MaxValue,
>   "The param k and numFeatures is too large for SVD computation. " +
>   "Try reducing the parameter k for PCA, or reduce the input feature " +
>   "vector dimension to make this tractable.")
> val mat = new RowMatrix(sources)
> // second use rdd sources
> val (pc, explainedVariance) = 
> mat.computePrincipalComponentsAndExplainedVariance(k)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29817) Missing persist on docs in mllib.clustering.LDAOptimizer.initialize

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29817.
--
Resolution: Duplicate

We should not be making 10+ JIRAs here. The changes in question are small and 
closely related and probably need to be considered together, or else we may 
resolve them differently by different people at different times. I'm going to 
close them in favor of the parent. We can consider breaking out some of them 
later.

> Missing persist on docs in mllib.clustering.LDAOptimizer.initialize
> ---
>
> Key: SPARK-29817
> URL: https://issues.apache.org/jira/browse/SPARK-29817
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd docs in mllib.clustering.LDAOptimizer is used in two actions: 
> verticesTMP.reduceByKey, and docs.take(1). It should be persisted.
> {code:scala}
>   override private[clustering] def initialize(
>   docs: RDD[(Long, Vector)],
>   lda: LDA): EMLDAOptimizer = {
>   ...
> val edges: RDD[Edge[TokenCount]] = docs.flatMap { case (docID: Long, 
> termCounts: Vector) =>
>   // Add edges for terms with non-zero counts.
>   termCounts.asBreeze.activeIterator.filter(_._2 != 0.0).map { case 
> (term, cnt) =>
> Edge(docID, term2index(term), cnt)
>   }
> }
> // Create vertices.
> // Initially, we use random soft assignments of tokens to topics (random 
> gamma).
> val docTermVertices: RDD[(VertexId, TopicCounts)] = {
>   val verticesTMP: RDD[(VertexId, TopicCounts)] =
> edges.mapPartitionsWithIndex { case (partIndex, partEdges) =>
>   val random = new Random(partIndex + randomSeed)
>   partEdges.flatMap { edge =>
> val gamma = normalize(BDV.fill[Double](k)(random.nextDouble()), 
> 1.0)
> val sum = gamma * edge.attr
> Seq((edge.srcId, sum), (edge.dstId, sum))
>   }
> }
>   verticesTMP.reduceByKey(_ + _) // RDD dependency: verticesTMP - edges - 
> docs. First use docs
> }
> // Partition such that edges are grouped by document
> this.graph = Graph(docTermVertices, 
> edges).partitionBy(PartitionStrategy.EdgePartition1D)
> this.k = k
> this.vocabSize = docs.take(1).head._2.size // Second use docs
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29601.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26364
[https://github.com/apache/spark/pull/26364]

> JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL 
> statement
> ---
>
> Key: SPARK-29601
> URL: https://issues.apache.org/jira/browse/SPARK-29601
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Statement column in JDBC/ODBC gives whole SQL statement and page size 
> increases.
> Suppose user submit and TPCDS Queries, then Page it display whole Query under 
> statement and User Experience is not good.
> Expected:
> It should display the ... Ellipsis and on clicking the stmt. it should Expand 
> display the whole SQL Statement.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29601:


Assignee: pavithra ramachandran

> JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL 
> statement
> ---
>
> Key: SPARK-29601
> URL: https://issues.apache.org/jira/browse/SPARK-29601
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: pavithra ramachandran
>Priority: Minor
>
> Statement column in JDBC/ODBC gives whole SQL statement and page size 
> increases.
> Suppose user submit and TPCDS Queries, then Page it display whole Query under 
> statement and User Experience is not good.
> Expected:
> It should display the ... Ellipsis and on clicking the stmt. it should Expand 
> display the whole SQL Statement.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29601) JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL statement

2019-11-10 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29601:
-
Priority: Minor  (was: Major)

> JDBC ODBC Tab Statement column should be provided ellipsis for bigger SQL 
> statement
> ---
>
> Key: SPARK-29601
> URL: https://issues.apache.org/jira/browse/SPARK-29601
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Statement column in JDBC/ODBC gives whole SQL statement and page size 
> increases.
> Suppose user submit and TPCDS Queries, then Page it display whole Query under 
> statement and User Experience is not good.
> Expected:
> It should display the ... Ellipsis and on clicking the stmt. it should Expand 
> display the whole SQL Statement.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29820) Use GitHub Action Cache for `./.m2/repository`

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29820.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26456
[https://github.com/apache/spark/pull/26456]

> Use GitHub Action Cache for `./.m2/repository`
> --
>
> Key: SPARK-29820
> URL: https://issues.apache.org/jira/browse/SPARK-29820
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> To reduce the Maven download flakiness, we had better enable caching on local 
> maven repository.
> - https://github.com/apache/spark/runs/295869450
> {code}
> [ERROR] Failed to execute goal on project spark-streaming-kafka-0-10_2.12: 
> Could not resolve dependencies for project 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:jar:spark-367716: Could not 
> transfer artifact 
> com.fasterxml.jackson.datatype:jackson-datatype-jdk8:jar:2.10.0 from/to 
> central (https://repo.maven.apache.org/maven2): Failed to transfer file 
> https://repo.maven.apache.org/maven2/com/fasterxml/jackson/datatype/jackson-datatype-jdk8/2.10.0/jackson-datatype-jdk8-2.10.0.jar
>  with status code 503 -> [Help 1]
> 13
> [ERROR] 
> 14
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> 15
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 16
> [ERROR] 
> 17
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> 18
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> 19
> [ERROR] 
> 20
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> 21
> [ERROR]   mvn  -rf :spark-streaming-kafka-0-10_2.12
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29820) Use GitHub Action Cache for `./.m2/repository`

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29820:
-

Assignee: Dongjoon Hyun

> Use GitHub Action Cache for `./.m2/repository`
> --
>
> Key: SPARK-29820
> URL: https://issues.apache.org/jira/browse/SPARK-29820
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> To reduce the Maven download flakiness, we had better enable caching on local 
> maven repository.
> - https://github.com/apache/spark/runs/295869450
> {code}
> [ERROR] Failed to execute goal on project spark-streaming-kafka-0-10_2.12: 
> Could not resolve dependencies for project 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:jar:spark-367716: Could not 
> transfer artifact 
> com.fasterxml.jackson.datatype:jackson-datatype-jdk8:jar:2.10.0 from/to 
> central (https://repo.maven.apache.org/maven2): Failed to transfer file 
> https://repo.maven.apache.org/maven2/com/fasterxml/jackson/datatype/jackson-datatype-jdk8/2.10.0/jackson-datatype-jdk8-2.10.0.jar
>  with status code 503 -> [Help 1]
> 13
> [ERROR] 
> 14
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> 15
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> 16
> [ERROR] 
> 17
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> 18
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
> 19
> [ERROR] 
> 20
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> 21
> [ERROR]   mvn  -rf :spark-streaming-kafka-0-10_2.12
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29408) Support interval literal with negative sign `-`

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29408.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26438
[https://github.com/apache/spark/pull/26438]

> Support interval literal with negative sign `-`
> ---
>
> Key: SPARK-29408
> URL: https://issues.apache.org/jira/browse/SPARK-29408
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> For example:
> {code}
> maxim=# select -interval '1 day -1 hour';
>  ?column?
> ---
>  -1 days +01:00:00
> (1 row)
> maxim=# select - interval '1-2' AS "negative year-month";
>  negative year-month 
> -
>  -1 years -2 mons
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29408) Support interval literal with negative sign `-`

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29408:
-

Assignee: Maxim Gekk

> Support interval literal with negative sign `-`
> ---
>
> Key: SPARK-29408
> URL: https://issues.apache.org/jira/browse/SPARK-29408
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> For example:
> {code}
> maxim=# select -interval '1 day -1 hour';
>  ?column?
> ---
>  -1 days +01:00:00
> (1 row)
> maxim=# select - interval '1-2' AS "negative year-month";
>  negative year-month 
> -
>  -1 years -2 mons
> (1 row)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29528) Upgrade scala-maven-plugin to 4.3.0 for Scala 2.13.1

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29528.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26457
[https://github.com/apache/spark/pull/26457]

> Upgrade scala-maven-plugin to 4.3.0 for Scala 2.13.1
> 
>
> Key: SPARK-29528
> URL: https://issues.apache.org/jira/browse/SPARK-29528
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Scala 2.13.1 seems to break the binary compatibility.
> We need to upgrade `scala-maven-plugin` to bring the the following fixes for 
> the latest Scala 2.13.1. 
> - https://github.com/davidB/scala-maven-plugin/issues/363
> - https://github.com/sbt/zinc/issues/698
> We need `scala-maven-plugin` 4.3.0 because `4.2.4` throws error on Windows.
> - https://github.com/davidB/scala-maven-plugin/issues/370



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29819) Introduce an enum for interval units

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29819.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26455
[https://github.com/apache/spark/pull/26455]

> Introduce an enum for interval units
> 
>
> Key: SPARK-29819
> URL: https://issues.apache.org/jira/browse/SPARK-29819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add enum for interval units. This will allow to type check inputs and to 
> avoid typos in interval unit names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29819) Introduce an enum for interval units

2019-11-10 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29819:
-

Assignee: Maxim Gekk

> Introduce an enum for interval units
> 
>
> Key: SPARK-29819
> URL: https://issues.apache.org/jira/browse/SPARK-29819
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Add enum for interval units. This will allow to type check inputs and to 
> avoid typos in interval unit names.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29828) Missing persist on ratings in ml.recommendation.ALS.train

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29828:
--
Description: 
There is a ratings.isEmpty() at the beginning of 
theml.recommendation.ALS.train(). Actually, isEmpty() has an action operator.
{code:scala}
  def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
  }
{code}

So rdd ratings will be used by multi actions, it should be persisted, and 
unpersisted after its child rdd has been persisted.
{code:scala}
  def train[ID: ClassTag]( // scalastyle:ignore
  ratings: RDD[Rating[ID]],
...
require(!ratings.isEmpty(), s"No ratings available from $ratings") // first 
use ratings
...
val blockRatings = partitionRatings(ratings, userPart, itemPart)
  .persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, 
intermediateRDDStorageLevel)
userOutBlocks.count()// materialize blockRatings and user blocks
// ratings should be unpersisted here
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
Two missing persist issues in ml.recommendation.ALS.train().

1. There is a ratings.isEmpty() at the beginning of the method. Actually, 
isEmpty() has an action operator.
{code:scala}
  def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
  }
{code}

So rdd ratings will be used by multi actions, it should be persisted, and 
unpersisted after its child rdd has been persisted.
{code:scala}
  def train[ID: ClassTag]( // scalastyle:ignore
  ratings: RDD[Rating[ID]],
...
require(!ratings.isEmpty(), s"No ratings available from $ratings") // first 
use ratings
...
val blockRatings = partitionRatings(ratings, userPart, itemPart)
  .persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, 
intermediateRDDStorageLevel)
userOutBlocks.count()// materialize blockRatings and user blocks
// ratings should be unpersisted here
{code}

2. 
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Missing persist on ratings in ml.recommendation.ALS.train
> -
>
> Key: SPARK-29828
> URL: https://issues.apache.org/jira/browse/SPARK-29828
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There is a ratings.isEmpty() at the beginning of 
> theml.recommendation.ALS.train(). Actually, isEmpty() has an action operator.
> {code:scala}
>   def isEmpty(): Boolean = withScope {
> partitions.length == 0 || take(1).length == 0
>   }
> {code}
> So rdd ratings will be used by multi actions, it should be persisted, and 
> unpersisted after its child rdd has been persisted.
> {code:scala}
>   def train[ID: ClassTag]( // scalastyle:ignore
>   ratings: RDD[Rating[ID]],
> ...
> require(!ratings.isEmpty(), s"No ratings available from $ratings") // 
> first use ratings
> ...
> val blockRatings = partitionRatings(ratings, userPart, itemPart)
>   .persist(intermediateRDDStorageLevel)
> val (userInBlocks, userOutBlocks) =
>   makeBlocks("user", blockRatings, userPart, itemPart, 
> intermediateRDDStorageLevel)
> userOutBlocks.count()// materialize blockRatings and user blocks
> // ratings should be unpersisted here
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29828) Missing persist on ratings in ml.recommendation.ALS.train

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29828:
--
Summary: Missing persist on ratings in ml.recommendation.ALS.train  (was: 
Missing persist in ml.recommendation.ALS.train)

> Missing persist on ratings in ml.recommendation.ALS.train
> -
>
> Key: SPARK-29828
> URL: https://issues.apache.org/jira/browse/SPARK-29828
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> Two missing persist issues in ml.recommendation.ALS.train().
> 1. There is a ratings.isEmpty() at the beginning of the method. Actually, 
> isEmpty() has an action operator.
> {code:scala}
>   def isEmpty(): Boolean = withScope {
> partitions.length == 0 || take(1).length == 0
>   }
> {code}
> So rdd ratings will be used by multi actions, it should be persisted, and 
> unpersisted after its child rdd has been persisted.
> {code:scala}
>   def train[ID: ClassTag]( // scalastyle:ignore
>   ratings: RDD[Rating[ID]],
> ...
> require(!ratings.isEmpty(), s"No ratings available from $ratings") // 
> first use ratings
> ...
> val blockRatings = partitionRatings(ratings, userPart, itemPart)
>   .persist(intermediateRDDStorageLevel)
> val (userInBlocks, userOutBlocks) =
>   makeBlocks("user", blockRatings, userPart, itemPart, 
> intermediateRDDStorageLevel)
> userOutBlocks.count()// materialize blockRatings and user blocks
> // ratings should be unpersisted here
> {code}
> 2. 
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29828) Missing persist in ml.recommendation.ALS.train

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29828:
--
Description: 
Two missing persist issues in ml.recommendation.ALS.train().

1. There is a ratings.isEmpty() at the beginning of the method. Actually, 
isEmpty() has an action operator.
{code:scala}
  def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
  }
{code}

So rdd ratings will be used by multi actions, it should be persisted, and 
unpersisted after its child rdd has been persisted.
{code:scala}
  def train[ID: ClassTag]( // scalastyle:ignore
  ratings: RDD[Rating[ID]],
...
require(!ratings.isEmpty(), s"No ratings available from $ratings") // first 
use ratings
...
val blockRatings = partitionRatings(ratings, userPart, itemPart)
  .persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, 
intermediateRDDStorageLevel)
userOutBlocks.count()// materialize blockRatings and user blocks
// ratings should be unpersisted here
{code}

2. 
This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

  was:
There is a ratings.isEmpty() in ml.recommendation.ALS.train(). Actually, 
isEmpty() has an action operator.

{code:scala}
  def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
  }
{code}

So rdd ratings will be used by multi actions, it should be persisted, and 
unpersisted after its child rdd has been persisted.

{code:scala}
  def train[ID: ClassTag]( // scalastyle:ignore
  ratings: RDD[Rating[ID]],
...
require(!ratings.isEmpty(), s"No ratings available from $ratings") // first 
use ratings
...
val blockRatings = partitionRatings(ratings, userPart, itemPart)
  .persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, 
intermediateRDDStorageLevel)
userOutBlocks.count()// materialize blockRatings and user blocks
// ratings should be unpersisted here
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.


> Missing persist in ml.recommendation.ALS.train
> --
>
> Key: SPARK-29828
> URL: https://issues.apache.org/jira/browse/SPARK-29828
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> Two missing persist issues in ml.recommendation.ALS.train().
> 1. There is a ratings.isEmpty() at the beginning of the method. Actually, 
> isEmpty() has an action operator.
> {code:scala}
>   def isEmpty(): Boolean = withScope {
> partitions.length == 0 || take(1).length == 0
>   }
> {code}
> So rdd ratings will be used by multi actions, it should be persisted, and 
> unpersisted after its child rdd has been persisted.
> {code:scala}
>   def train[ID: ClassTag]( // scalastyle:ignore
>   ratings: RDD[Rating[ID]],
> ...
> require(!ratings.isEmpty(), s"No ratings available from $ratings") // 
> first use ratings
> ...
> val blockRatings = partitionRatings(ratings, userPart, itemPart)
>   .persist(intermediateRDDStorageLevel)
> val (userInBlocks, userOutBlocks) =
>   makeBlocks("user", blockRatings, userPart, itemPart, 
> intermediateRDDStorageLevel)
> userOutBlocks.count()// materialize blockRatings and user blocks
> // ratings should be unpersisted here
> {code}
> 2. 
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29828) Missing persist in ml.recommendation.ALS.train

2019-11-10 Thread Dong Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dong Wang updated SPARK-29828:
--
Summary: Missing persist in ml.recommendation.ALS.train  (was: Missing 
persist on ratings on ratings in ml.recommendation.ALS.train)

> Missing persist in ml.recommendation.ALS.train
> --
>
> Key: SPARK-29828
> URL: https://issues.apache.org/jira/browse/SPARK-29828
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There is a ratings.isEmpty() in ml.recommendation.ALS.train(). Actually, 
> isEmpty() has an action operator.
> {code:scala}
>   def isEmpty(): Boolean = withScope {
> partitions.length == 0 || take(1).length == 0
>   }
> {code}
> So rdd ratings will be used by multi actions, it should be persisted, and 
> unpersisted after its child rdd has been persisted.
> {code:scala}
>   def train[ID: ClassTag]( // scalastyle:ignore
>   ratings: RDD[Rating[ID]],
> ...
> require(!ratings.isEmpty(), s"No ratings available from $ratings") // 
> first use ratings
> ...
> val blockRatings = partitionRatings(ratings, userPart, itemPart)
>   .persist(intermediateRDDStorageLevel)
> val (userInBlocks, userOutBlocks) =
>   makeBlocks("user", blockRatings, userPart, itemPart, 
> intermediateRDDStorageLevel)
> userOutBlocks.count()// materialize blockRatings and user blocks
> // ratings should be unpersisted here
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29828) Missing persist on ratings on ratings in ml.recommendation.ALS.train

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29828:
-

 Summary: Missing persist on ratings on ratings in 
ml.recommendation.ALS.train
 Key: SPARK-29828
 URL: https://issues.apache.org/jira/browse/SPARK-29828
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.4.3
Reporter: Dong Wang


There is a ratings.isEmpty() in ml.recommendation.ALS.train(). Actually, 
isEmpty() has an action operator.

{code:scala}
  def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
  }
{code}

So rdd ratings will be used by multi actions, it should be persisted, and 
unpersisted after its child rdd has been persisted.

{code:scala}
  def train[ID: ClassTag]( // scalastyle:ignore
  ratings: RDD[Rating[ID]],
...
require(!ratings.isEmpty(), s"No ratings available from $ratings") // first 
use ratings
...
val blockRatings = partitionRatings(ratings, userPart, itemPart)
  .persist(intermediateRDDStorageLevel)
val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, 
intermediateRDDStorageLevel)
userOutBlocks.count()// materialize blockRatings and user blocks
// ratings should be unpersisted here
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29827) Wrong persist strategy in mllib.clustering.BisectingKMeans.run

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29827:
-

 Summary: Wrong persist strategy in 
mllib.clustering.BisectingKMeans.run
 Key: SPARK-29827
 URL: https://issues.apache.org/jira/browse/SPARK-29827
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.3
Reporter: Dong Wang


There are three persist misuses in mllib.clustering.BisectingKMeans.run.
First, rdd input should be persisted, because it is used by action first and 
actions in the following code.
Second, rdd assignments should be persisted. It is used in summarize() more 
than once, which has an action on assignment.
Third, persisting rdd norms is unnecessary, because it is just a mediate rdd. 
Since its child rdd assignments is persisted, it is unnecessary to be persisted.

{code:scala}
  private[spark] def run(
  input: RDD[Vector],
  instr: Option[Instrumentation]): BisectingKMeansModel = {
if (input.getStorageLevel == StorageLevel.NONE) {
  logWarning(s"The input RDD ${input.id} is not directly cached, which may 
hurt performance if"
+ " its parent RDDs are also not cached.")
}
// Needs to persist input
val d = input.map(_.size).first() 
logInfo(s"Feature dimension: $d.")
val dMeasure: DistanceMeasure = 
DistanceMeasure.decodeFromString(this.distanceMeasure)
// Compute and cache vector norms for fast distance computation.
val norms = input.map(v => Vectors.norm(v, 
2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
val vectors = input.zip(norms).map { case (x, norm) => new 
VectorWithNorm(x, norm) }
var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
var activeClusters = summarize(d, assignments, dMeasure)
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29826) Missing persist on data in mllib.feature.ChiSqSelector.fit

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29826:
-

 Summary: Missing persist on data in mllib.feature.ChiSqSelector.fit
 Key: SPARK-29826
 URL: https://issues.apache.org/jira/browse/SPARK-29826
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 2.4.3
Reporter: Dong Wang


The rdd data in mllib.feature.ChiSqSelector.fit() is used by an action in 
Statistics.chiSqTest(data) and other actions in the following code, but it is 
not persisted.

{code:scala} 
  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
val chiSqTestResult = Statistics.chiSqTest(data).zipWithIndex
val features = selectorType match {
  case ChiSqSelector.NumTopFeatures =>
chiSqTestResult
  .sortBy { case (res, _) => res.pValue }
  .take(numTopFeatures)
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29825) Add join conditions in join-related tests of SQLQueryTestSuite

2019-11-10 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-29825:


 Summary: Add join conditions in join-related tests of 
SQLQueryTestSuite
 Key: SPARK-29825
 URL: https://issues.apache.org/jira/browse/SPARK-29825
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29824) Missing persist on trainDataset in ml.classification.GBTClassifier.train()

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29824:
-

 Summary: Missing persist on trainDataset in 
ml.classification.GBTClassifier.train()
 Key: SPARK-29824
 URL: https://issues.apache.org/jira/browse/SPARK-29824
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.4.3
Reporter: Dong Wang


The rdd trainDataset in ml.classification.GBTClassifier.train() is used by an 
action first and other actions in GradientBoostedTrees.run/runWithValidation, 
but it is not persisted, which will cause recomputation on trainDataset.

{code:scala}
  override protected def train(
  dataset: Dataset[_]): GBTClassificationModel = instrumented { instr =>
val categoricalFeatures: Map[Int, Int] =
  MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
    ...
val numFeatures = trainDataset.first().features.size // first use 
trainDataset
    ...
    // trainDataset will be used by other actions in run methods.
val (baseLearners, learnerWeights) = if (withValidation) {
  GradientBoostedTrees.runWithValidation(trainDataset, validationDataset, 
boostingStrategy,
$(seed), $(featureSubsetStrategy))
} else {
  GradientBoostedTrees.run(trainDataset, boostingStrategy, $(seed), 
$(featureSubsetStrategy))
}
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29823) Wrong persist strategy in mllib.clustering.KMeans.run()

2019-11-10 Thread Dong Wang (Jira)
Dong Wang created SPARK-29823:
-

 Summary: Wrong persist strategy in mllib.clustering.KMeans.run()
 Key: SPARK-29823
 URL: https://issues.apache.org/jira/browse/SPARK-29823
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.3
Reporter: Dong Wang


In mllib.clustering.KMeans.run(), the rdd norms is persisted. But it only has a 
single child rdd zippedData, so it's a unnecessary persist. On the other hand, 
norms's child rdd zippedData will be used by multi times in runAlgorithm, so 
zippedData should be persisted.

{code:scala}
  private[spark] def run(
  data: RDD[Vector],
  instr: Option[Instrumentation]): KMeansModel = {
if (data.getStorageLevel == StorageLevel.NONE) {
  logWarning("The input data is not directly cached, which may hurt 
performance if its"
+ " parent RDDs are also uncached.")
}
// Compute squared norms and cache them.
val norms = data.map(Vectors.norm(_, 2.0))
norms.persist() // Unnecessary persist. Only used to generate zippedData.
val zippedData = data.zip(norms).map { case (v, norm) =>
  new VectorWithNorm(v, norm)
} // needs to persist
val model = runAlgorithm(zippedData, instr)
norms.unpersist() // Change to zippedData.unpersist()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29822) Cast error when there are spaces between signs and values

2019-11-10 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971059#comment-16971059
 ] 

Kent Yao commented on SPARK-29822:
--

[https://github.com/apache/spark/pull/26449]

> Cast error when there are spaces between signs and values
> -
>
> Key: SPARK-29822
> URL: https://issues.apache.org/jira/browse/SPARK-29822
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> With the latest string to literal optimization, some interval strings can not 
> be cast when there are some spaces between signs and unit values.
> How to reproduce, 
> {code:java}
> select cast(v as interval) from values ('+ 1 second') t(v);
> select cast(v as interval) from values ('- 1 second') t(v);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29822) Cast error when there are spaces between signs and values

2019-11-10 Thread Kent Yao (Jira)
Kent Yao created SPARK-29822:


 Summary: Cast error when there are spaces between signs and values
 Key: SPARK-29822
 URL: https://issues.apache.org/jira/browse/SPARK-29822
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


With the latest string to literal optimization, some interval strings can not 
be cast when there are some spaces between signs and unit values.

How to reproduce, 
{code:java}
select cast(v as interval) from values ('+ 1 second') t(v);
select cast(v as interval) from values ('- 1 second') t(v);
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org