[jira] [Commented] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-21 Thread Nick Nicolini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551885#comment-16551885
 ] 

Nick Nicolini commented on SPARK-16203:
---

[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 

 

 

 

 

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-21 Thread Nick Nicolini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551885#comment-16551885
 ] 

Nick Nicolini edited comment on SPARK-16203 at 7/22/18 3:21 AM:


[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?


was (Author: nnicolini):
[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

 

 

 

 

 

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16203) regexp_extract to return an ArrayType(StringType())

2018-07-21 Thread Nick Nicolini (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551885#comment-16551885
 ] 

Nick Nicolini edited comment on SPARK-16203 at 7/22/18 3:21 AM:


[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?


was (Author: nnicolini):
[~srowen] [~hvanhovell] I want to re-open this discussion. I've recently hit 
many cases of regexp parsing where we need to match on something that is always 
arbitrary in length; for example, a text block that looks something like:

 
{code:java}
AAA:WORDS|
BBB:TEXT|
MSG:ASDF|
MSG:QWER|
...
MSG:ZXCV|{code}
Where I need to pull out all values between "MSG:" and "|", which can occur in 
each instance between 1 and n times. I cannot reliably use the method shown 
above, and while I can write a UDF to handle this it'd be great if this was 
supported natively in Spark.

Perhaps we can implement something like "regexp_extract_all" as 
[Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
[Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
 have?

> regexp_extract to return an ArrayType(StringType())
> ---
>
> Key: SPARK-16203
> URL: https://issues.apache.org/jira/browse/SPARK-16203
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Max Moroz
>Priority: Minor
>
> regexp_extract only returns a single matched group. If (as if often the case 
> - e.g., web log parsing) we need to parse the entire line and get all the 
> groups, we'll need to call it as many times as there are groups.
> It's only a minor annoyance syntactically.
> But unless I misunderstand something, it would be very inefficient.  (How 
> would Spark know not to do multiple pattern matching operations, when only 
> one is needed? Or does the optimizer actually check whether the patterns are 
> identical, and if they are, avoid the repeated regex matching operations??)
> Would it be  possible to have it return an array when the index is not 
> specified (defaulting to None)?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-21 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551859#comment-16551859
 ] 

Dilip Biswal commented on SPARK-24864:
--

I agree with [~srowen]

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24864) Cannot resolve auto-generated column ordinals in a hive view

2018-07-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551798#comment-16551798
 ] 

Sean Owen commented on SPARK-24864:
---

No compatibility is promised between 1.x and 2.x. I don't think this behavior 
was guaranteed to begin with. You should always specify aliases directly if you 
depend on their value.

> Cannot resolve auto-generated column ordinals in a hive view
> 
>
> Key: SPARK-24864
> URL: https://issues.apache.org/jira/browse/SPARK-24864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Abhishek Madav
>Priority: Major
>
> Spark job reading from a hive-view fails with analysis exception when 
> resolving column ordinals which are autogenerated.
> *Exception*:
> {code:java}
> scala> spark.sql("Select * from vsrc1new").show
> org.apache.spark.sql.AnalysisException: cannot resolve '`vsrc1new._c1`' given 
> input columns: [id, upper(name)]; line 1 pos 24;
> 'Project [*]
> +- 'SubqueryAlias vsrc1new, `default`.`vsrc1new`
>    +- 'Project [id#634, 'vsrc1new._c1 AS uname#633]
>   +- SubqueryAlias vsrc1new
>  +- Project [id#634, upper(name#635) AS upper(name)#636]
>     +- MetastoreRelation default, src1
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
> {code}
> *Steps to reproduce:*
> 1: Create a simple table, say src
> {code:java}
> CREATE TABLE `src1`(`id` int,  `name` string) ROW FORMAT DELIMITED FIELDS 
> TERMINATED BY ','
> {code}
> 2: Create a view, say with name vsrc1new
> {code:java}
> CREATE VIEW vsrc1new AS SELECT id, `_c1` AS uname FROM (SELECT id, 
> upper(name) FROM src1) vsrc1new;
> {code}
> 3. Selecting data from this view in hive-cli/beeline doesn't cause any error.
> 4. Creating a dataframe using:
> {code:java}
> spark.sql("Select * from vsrc1new").show //throws error
> {code}
> The auto-generated column names for the view are not resolved. Am I possibly 
> missing some spark-sql configuration here? I tried the repro-case against 
> spark 1.6 and that worked fine. Any inputs are appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24869) SaveIntoDataSourceCommand's input Dataset does not use Cached Data

2018-07-21 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551779#comment-16551779
 ] 

Takeshi Yamamuro commented on SPARK-24869:
--

Actually, the cache is correctly used in the case? 
https://github.com/apache/spark/compare/master...maropu:SPARK-24869
I'm not sure about the case you pointed out in this ticket...

> SaveIntoDataSourceCommand's input Dataset does not use Cached Data
> --
>
> Key: SPARK-24869
> URL: https://issues.apache.org/jira/browse/SPARK-24869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> {code}
> withTable("t") {
>   withTempPath { path =>
> var numTotalCachedHit = 0
> val listener = new QueryExecutionListener {
>   override def onFailure(f: String, qe: QueryExecution, e: 
> Exception):Unit = {}
>   override def onSuccess(funcName: String, qe: QueryExecution, 
> duration: Long): Unit = {
> qe.withCachedData match {
>   case c: SaveIntoDataSourceCommand
>   if c.query.isInstanceOf[InMemoryRelation] =>
> numTotalCachedHit += 1
>   case _ =>
> println(qe.withCachedData)
> }
>   }
> }
> spark.listenerManager.register(listener)
> val udf1 = udf({ (x: Int, y: Int) => x + y })
> val df = spark.range(0, 3).toDF("a")
>   .withColumn("b", udf1(col("a"), lit(10)))
> df.write.mode(SaveMode.Overwrite).jdbc(url1, "TEST.DROPTEST", 
> properties)
> assert(numTotalCachedHit == 1, "expected to be cached in jdbc")
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24779) Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2018-07-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24779:


Assignee: (was: Apache Spark)

> Add sequence / map_concat  / map_from_entries / an option in months_between 
> UDF to disable rounding-off
> ---
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * sequence -SPARK-23927-
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24779) Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2018-07-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551737#comment-16551737
 ] 

Apache Spark commented on SPARK-24779:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21835

> Add sequence / map_concat  / map_from_entries / an option in months_between 
> UDF to disable rounding-off
> ---
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * sequence -SPARK-23927-
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24779) Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off

2018-07-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24779:


Assignee: Apache Spark

> Add sequence / map_concat  / map_from_entries / an option in months_between 
> UDF to disable rounding-off
> ---
>
> Key: SPARK-24779
> URL: https://issues.apache.org/jira/browse/SPARK-24779
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add R versions of 
>  * sequence -SPARK-23927-
>  * map_concat   -SPARK-23936-
>  * map_from_entries   SPARK-23934
>  * an option in months_between UDF to disable rounding-off  -SPARK-23902-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24340) Clean up non-shuffle disk block manager files following executor death

2018-07-21 Thread Jiang Xingbo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo resolved SPARK-24340.
--
Resolution: Fixed

> Clean up non-shuffle disk block manager files following executor death
> --
>
> Key: SPARK-24340
> URL: https://issues.apache.org/jira/browse/SPARK-24340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Currently we only clean up local folders on application removed, and we don't 
> clean up non-shuffle files, such as temp. shuffle blocks, cached 
> RDD/broadcast blocks, spill files, etc. and this can cause disk space leaks 
> when executors periodically die and are replaced.
> To avoid this source of disk space leak, we can clean up executor disk store 
> files except for shuffle index and data files on executor finished.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24340) Clean up non-shuffle disk block manager files following executor death

2018-07-21 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551714#comment-16551714
 ] 

Jiang Xingbo commented on SPARK-24340:
--

Thanks~

> Clean up non-shuffle disk block manager files following executor death
> --
>
> Key: SPARK-24340
> URL: https://issues.apache.org/jira/browse/SPARK-24340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Currently we only clean up local folders on application removed, and we don't 
> clean up non-shuffle files, such as temp. shuffle blocks, cached 
> RDD/broadcast blocks, spill files, etc. and this can cause disk space leaks 
> when executors periodically die and are replaced.
> To avoid this source of disk space leak, we can clean up executor disk store 
> files except for shuffle index and data files on executor finished.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24340) Clean up non-shuffle disk block manager files following executor death

2018-07-21 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551687#comment-16551687
 ] 

Li Yuanjian commented on SPARK-24340:
-

cc [~jiangxb1987] I think this was resolved by your pr 21390, we need change 
the status?

> Clean up non-shuffle disk block manager files following executor death
> --
>
> Key: SPARK-24340
> URL: https://issues.apache.org/jira/browse/SPARK-24340
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Currently we only clean up local folders on application removed, and we don't 
> clean up non-shuffle files, such as temp. shuffle blocks, cached 
> RDD/broadcast blocks, spill files, etc. and this can cause disk space leaks 
> when executors periodically die and are replaced.
> To avoid this source of disk space leak, we can clean up executor disk store 
> files except for shuffle index and data files on executor finished.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23231) Add doc for string indexer ordering to user guide (also to RFormula guide)

2018-07-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23231:
-

Assignee: zhengruifeng

> Add doc for string indexer ordering to user guide (also to RFormula guide)
> --
>
> Key: SPARK-23231
> URL: https://issues.apache.org/jira/browse/SPARK-23231
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-20619 and SPARK-20899 added an ordering parameter to {{StringIndexer}} 
> and is also used internally in {{RFormula}}. Update the user guide for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23231) Add doc for string indexer ordering to user guide (also to RFormula guide)

2018-07-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23231.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21792
[https://github.com/apache/spark/pull/21792]

> Add doc for string indexer ordering to user guide (also to RFormula guide)
> --
>
> Key: SPARK-23231
> URL: https://issues.apache.org/jira/browse/SPARK-23231
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-20619 and SPARK-20899 added an ordering parameter to {{StringIndexer}} 
> and is also used internally in {{RFormula}}. Update the user guide for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-21 Thread Antoine Galataud (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551640#comment-16551640
 ] 

Antoine Galataud commented on SPARK-24875:
--

True, I was proposing this not as a replacement, but as an option (e.g 
setUseApproxStats on MulticlassMetrics) that wouldn’t be the default. 
Correctness is key, but having an approximate result is better than no result 
at all.
However there should be better solutions that using countByValueApprox. Open to 
suggestions! 

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-07-21 Thread Li Yuanjian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551638#comment-16551638
 ] 

Li Yuanjian commented on SPARK-23128:
-

[~tgraves] Thanks for your comment, as far as I know [~carsonwang] are still 
working on this and porting the patch to spark 2.3. Also this patch has been 
used by several team in their internal product env, hope this can be reviewed 
soon.

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24873.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21784
[https://github.com/apache/spark/pull/21784]

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: pic.jpg
>
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input,so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn
>  
> !pic.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24873) increase switch to shielding frequent interaction reports with yarn

2018-07-21 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24873:


Assignee: Yuming Wang

> increase switch to shielding frequent interaction reports with yarn
> ---
>
> Key: SPARK-24873
> URL: https://issues.apache.org/jira/browse/SPARK-24873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, YARN
>Affects Versions: 2.4.0
>Reporter: JieFang.He
>Assignee: Yuming Wang
>Priority: Major
> Attachments: pic.jpg
>
>
> There is too much frequent interaction reports when i use spark shell commend 
> which affect my input,so i think it need to increase a switch to shielding 
> frequent interaction reports with yarn
>  
> !pic.jpg!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24862) Spark Encoder is not consistent to scala case class semantic for multiple argument lists

2018-07-21 Thread Antonio Murgia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551582#comment-16551582
 ] 

Antonio Murgia commented on SPARK-24862:


We can check if {{y}} is also synthesized as a field and if it is we can access 
it through reflection. About the inconsistency I actually don’t know. Maybe you 
are right, the inconsistency may cause issues. If it is the case, we might 
throw an exception earlier (when generating the encoder) instead of throwing it 
when the first action is called. I am just sketching down ideas anyway.

> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists
> 
>
> Key: SPARK-24862
> URL: https://issues.apache.org/jira/browse/SPARK-24862
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Antonio Murgia
>Priority: Major
>
> Spark Encoder is not consistent to scala case class semantic for multiple 
> argument lists.
> For example if I create a case class with multiple constructor argument lists:
> {code:java}
> case class Multi(x: String)(y: Int){code}
> Scala creates a product with arity 1, while if I apply 
> {code:java}
> Encoders.product[Multi].schema.printTreeString{code}
> I get
> {code:java}
> root
> |-- x: string (nullable = true)
> |-- y: integer (nullable = false){code}
> That is not consistent and leads to:
> {code:java}
> Error while encoding: java.lang.RuntimeException: Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> Couldn't find y on class 
> it.enel.next.platform.service.events.common.massive.immutable.Multi
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).x, true) AS x#0
> assertnotnull(assertnotnull(input[0, 
> it.enel.next.platform.service.events.common.massive.immutable.Multi, 
> true])).y AS y#1
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:464)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:464)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply$mcV$sp(ParquetQueueSuite.scala:48)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at 
> it.enel.next.platform.service.events.common.massive.immutable.ParquetQueueSuite$$anonfun$1.apply(ParquetQueueSuite.scala:46)
> at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
> at org.scalatest.Transformer.apply(Transformer.scala:20)
> at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1682)
> at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
> at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1679)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1692)
> at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
> at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1692)
> at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1685)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1750)
> at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
> at 
> 

[jira] [Commented] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2018-07-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551580#comment-16551580
 ] 

Apache Spark commented on SPARK-22814:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21834

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2018-07-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22814:


Assignee: (was: Apache Spark)

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22814) JDBC support date/timestamp type as partitionColumn

2018-07-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22814:


Assignee: Apache Spark

> JDBC support date/timestamp type as partitionColumn
> ---
>
> Key: SPARK-22814
> URL: https://issues.apache.org/jira/browse/SPARK-22814
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.1
>Reporter: Yuechen Chen
>Assignee: Apache Spark
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> In spark, you can partition MySQL queries by partitionColumn.
> val df = (spark.read.jdbc(url=jdbcUrl,
> table="employees",
> columnName="emp_no",
> lowerBound=1L,
> upperBound=10L,
> numPartitions=100,
> connectionProperties=connectionProperties))
> display(df)
> But, partitionColumn must be a numeric column from the table.
> However, there are lots of table, which has no primary key, and has some 
> date/timestamp indexes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21097) Dynamic allocation will preserve cached data

2018-07-21 Thread James (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551572#comment-16551572
 ] 

James commented on SPARK-21097:
---

Hi [~bradkaiser]

 

I find that you consider the memory space when preserving the cache. Have you 
consider the load balance?

e.g. if you want to preserve data from C to A and B, but A has higher cpu load 
than B. should we choose B as candidate if there is future computation on the 
cache data from C?

 

Thanks

> Dynamic allocation will preserve cached data
> 
>
> Key: SPARK-21097
> URL: https://issues.apache.org/jira/browse/SPARK-21097
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Scheduler, Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Brad
>Priority: Major
> Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org