[jira] [Created] (SPARK-21053) Number overflow on agg function of Dataframe

2017-06-10 Thread DUC LIEM NGUYEN (JIRA)
DUC LIEM NGUYEN created SPARK-21053:
---

 Summary: Number overflow on agg function of Dataframe
 Key: SPARK-21053
 URL: https://issues.apache.org/jira/browse/SPARK-21053
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
 Environment: Databricks Community version
Reporter: DUC LIEM NGUYEN


The use of average on aggregation function on a large data set return a NaN 
instead of the desired numerical value although it's range between 0 and 1.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045783#comment-16045783
 ] 

Apache Spark commented on SPARK-20427:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18266

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20427:


Assignee: Apache Spark

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Apache Spark
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20427:


Assignee: (was: Apache Spark)

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset

2017-06-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045773#comment-16045773
 ] 

Takeshi Yamamuro commented on SPARK-21043:
--

Thank you for ping me! Yea, I'll try

> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21052) Add hash map metrics to join

2017-06-10 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045769#comment-16045769
 ] 

Liang-Chi Hsieh commented on SPARK-21052:
-

I'll submit a PR for this soon.

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21051) Add hash map metrics to aggregate

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21051:


Assignee: Apache Spark

> Add hash map metrics to aggregate
> -
>
> Key: SPARK-21051
> URL: https://issues.apache.org/jira/browse/SPARK-21051
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> We should add avg hash map probe metric to aggregate operator and report it 
> on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21052) Add hash map metrics to join

2017-06-10 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21052:

Description: We should add avg hash map probe metric to join operator and 
report it on UI.

> Add hash map metrics to join
> 
>
> Key: SPARK-21052
> URL: https://issues.apache.org/jira/browse/SPARK-21052
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We should add avg hash map probe metric to join operator and report it on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21051) Add hash map metrics to aggregate

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21051:


Assignee: (was: Apache Spark)

> Add hash map metrics to aggregate
> -
>
> Key: SPARK-21051
> URL: https://issues.apache.org/jira/browse/SPARK-21051
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We should add avg hash map probe metric to aggregate operator and report it 
> on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21051) Add hash map metrics to aggregate

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045768#comment-16045768
 ] 

Apache Spark commented on SPARK-21051:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/18258

> Add hash map metrics to aggregate
> -
>
> Key: SPARK-21051
> URL: https://issues.apache.org/jira/browse/SPARK-21051
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We should add avg hash map probe metric to aggregate operator and report it 
> on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21051) Add hash map metrics to aggregate

2017-06-10 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21051:

Description: 
We should add avg hash map probe metric to aggregate operator and report it on 
UI.


> Add hash map metrics to aggregate
> -
>
> Key: SPARK-21051
> URL: https://issues.apache.org/jira/browse/SPARK-21051
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> We should add avg hash map probe metric to aggregate operator and report it 
> on UI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21052) Add hash map metrics to join

2017-06-10 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-21052:
---

 Summary: Add hash map metrics to join
 Key: SPARK-21052
 URL: https://issues.apache.org/jira/browse/SPARK-21052
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21051) Add hash map metrics to aggregate

2017-06-10 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-21051:
---

 Summary: Add hash map metrics to aggregate
 Key: SPARK-21051
 URL: https://issues.apache.org/jira/browse/SPARK-21051
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21050:


Assignee: Apache Spark  (was: Joseph K. Bradley)

> ml word2vec write has overflow issue in calculating numPartitions
> -
>
> Key: SPARK-21050
> URL: https://issues.apache.org/jira/browse/SPARK-21050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
> version), so it is very easily to have an overflow in calculating the number 
> of partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21050:


Assignee: Joseph K. Bradley  (was: Apache Spark)

> ml word2vec write has overflow issue in calculating numPartitions
> -
>
> Key: SPARK-21050
> URL: https://issues.apache.org/jira/browse/SPARK-21050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
> version), so it is very easily to have an overflow in calculating the number 
> of partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045763#comment-16045763
 ] 

Apache Spark commented on SPARK-21050:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/18265

> ml word2vec write has overflow issue in calculating numPartitions
> -
>
> Key: SPARK-21050
> URL: https://issues.apache.org/jira/browse/SPARK-21050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
> version), so it is very easily to have an overflow in calculating the number 
> of partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-21050:
-

 Summary: ml word2vec write has overflow issue in calculating 
numPartitions
 Key: SPARK-21050
 URL: https://issues.apache.org/jira/browse/SPARK-21050
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
version), so it is very easily to have an overflow in calculating the number of 
partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20877) Shorten test sets to run on CRAN

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045732#comment-16045732
 ] 

Apache Spark commented on SPARK-20877:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/18264

> Shorten test sets to run on CRAN
> 
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20877) Shorten test sets to run on CRAN

2017-06-10 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-20877:
-
Summary: Shorten test sets to run on CRAN  (was: Investigate if tests will 
time out on CRAN)

> Shorten test sets to run on CRAN
> 
>
> Key: SPARK-20877
> URL: https://issues.apache.org/jira/browse/SPARK-20877
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21044) Add `RemoveInvalidRange` optimizer

2017-06-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-21044.
-
Resolution: Invalid

> Add `RemoveInvalidRange` optimizer
> --
>
> Key: SPARK-21044
> URL: https://issues.apache.org/jira/browse/SPARK-21044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>
> This issue aims to add an optimizer remove invalid `Range` operator from the 
> beginning. There are two cases of invalidity.
> 1. The `start` and `end` value are equal.
> 2. The sign of `step` does not match `start` and `end`. In this case, 
> SPARK-21041 is reported as a bug, too.
> *BEFORE*
> {code}
> scala> spark.range(0,10,-1).explain
> == Physical Plan ==
> *Range (0, 10, step=-1, splits=8)
> scala> spark.range(0,0,-1).explain
> == Physical Plan ==
> *Range (0, 0, step=-1, splits=8)
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
> 9223372036854775806)
> {code}
> *AFTER*
> {code}
> scala> spark.range(0,10,-1).explain
> == Physical Plan ==
> LocalTableScan , [id#0L]
> scala> spark.range(0,0,-1).explain
> == Physical Plan ==
> LocalTableScan , [id#4L]
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res2: Array[Long] = Array()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-10 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17642:
-
Summary: Support DESC FORMATTED TABLE COLUMN command to show column-level 
statistics  (was: support DESC FORMATTED TABLE COLUMN command to show 
column-level statistics)

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21039:


Assignee: (was: Apache Spark)

> Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
> 
>
> Key: SPARK-21039
> URL: https://issues.apache.org/jira/browse/SPARK-21039
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that 
> the bloom filters received for each partition of data are merged in the 
> driver. The cost of this operation can be very high if the bloom filters are 
> large. It would be nice if it used RDD.treeAggregate instead, in order to 
> parallelize the operation of merging the bloom filters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21039:


Assignee: Apache Spark

> Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
> 
>
> Key: SPARK-21039
> URL: https://issues.apache.org/jira/browse/SPARK-21039
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>Assignee: Apache Spark
>
> Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that 
> the bloom filters received for each partition of data are merged in the 
> driver. The cost of this operation can be very high if the bloom filters are 
> large. It would be nice if it used RDD.treeAggregate instead, in order to 
> parallelize the operation of merging the bloom filters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045650#comment-16045650
 ] 

Apache Spark commented on SPARK-21039:
--

User 'rishabhbhardwaj' has created a pull request for this issue:
https://github.com/apache/spark/pull/18263

> Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
> 
>
> Key: SPARK-21039
> URL: https://issues.apache.org/jira/browse/SPARK-21039
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that 
> the bloom filters received for each partition of data are merged in the 
> driver. The cost of this operation can be very high if the bloom filters are 
> large. It would be nice if it used RDD.treeAggregate instead, in order to 
> parallelize the operation of merging the bloom filters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset

2017-06-10 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045630#comment-16045630
 ] 

Xiao Li commented on SPARK-21043:
-

[~maropu] Do you want to make a try?

> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045604#comment-16045604
 ] 

Apache Spark commented on SPARK-21045:
--

User 'dataknocker' has created a pull request for this issue:
https://github.com/apache/spark/pull/18262

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway
> ---
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: Joshuawangzj
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045592#comment-16045592
 ] 

Apache Spark commented on SPARK-21045:
--

User 'dataknocker' has created a pull request for this issue:
https://github.com/apache/spark/pull/18261

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway
> ---
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: Joshuawangzj
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21045:


Assignee: Apache Spark

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway
> ---
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: Joshuawangzj
>Assignee: Apache Spark
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21045:


Assignee: (was: Apache Spark)

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway
> ---
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: Joshuawangzj
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Summary: Spark executor blocked instead of throwing exception because 
exception occur when python worker send exception info to Java Gateway  (was: 
Spark executor is blocked instead of throwing exception because exception occur 
when python worker send exception trace stack info to Java Gateway)

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway
> ---
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
>Reporter: Joshuawangzj
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Closed] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR

2017-06-10 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-20684.
-
Resolution: Later

> expose createGlobalTempView and dropGlobalTempView in SparkR
> 
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045539#comment-16045539
 ] 

Lantao Jin commented on SPARK-21048:


Thank you for pointing out about JIRA. Confusing name can be explain by 
document. Better than miss default configuration when use \-\-properties-file 
option. That the key point I try to fix. Any idea? 

> Add an option --merged-properties-file to distinguish the configuration 
> loading behavior
> 
>
> Key: SPARK-21048
> URL: https://issues.apache.org/jira/browse/SPARK-21048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The problem description is the same as 
> [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But 
> different with that ticket. The purpose is not making sure the default 
> properties file always be loaded. Instead, just offering other option to let 
> user choose what they want.
> {quote}
> {{\-\-properties-file}} user-specified properties file which will replace the 
> default properties file. deprecated.
> {{\-\-replaced-properties-file}} new option which equals the 
> {{\-\-properties-file}} but more friendly. 
> {{\-\-merged-properties-file}} user-specified properties file which will 
> merge with the default properties file.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21049) why do we need computeGramianMatrix when computing SVD

2017-06-10 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045536#comment-16045536
 ] 

Vincent commented on SPARK-21049:
-

[~srowen]thanks. that's right. But we found it quite often that, the matrix is 
not skinny, and it spent quite a lot of time computing gramian matrix. 
Actually, we found that in such case, if we compute the svd on the original 
matrix, we could at least have 5x+ speedup. So, I wonder, whether it's possible 
to add an option here, to offer the user a choice to choose whether go with 
gramian or the original matrix. After all, user knows their data better, what 
do u think?

> why do we need computeGramianMatrix when computing SVD
> --
>
> Key: SPARK-21049
> URL: https://issues.apache.org/jira/browse/SPARK-21049
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Vincent
>
> computeSVD will compute SVD for matrix A by computing AT*A first and svd on 
> the Gramian matrix, we found that the gramian matrix computation is the hot 
> spot of the overall SVD computation, but, per my understanding, we can simply 
> do svd on the original matrix. The singular vector of the gramian matrix 
> should be the same as the right singular vector of the original matrix A, 
> while the singular value of the gramian matrix is double as that of the 
> original matrix. why do we svd on the gramian matrix then?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior

2017-06-10 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045533#comment-16045533
 ] 

Sean Owen commented on SPARK-21048:
---

I think this is confusing relative to any value it adds, and don't think this 
should be done.
(You shouldn't open new JIRAs for the same issue as it forks the discussion)

> Add an option --merged-properties-file to distinguish the configuration 
> loading behavior
> 
>
> Key: SPARK-21048
> URL: https://issues.apache.org/jira/browse/SPARK-21048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The problem description is the same as 
> [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But 
> different with that ticket. The purpose is not making sure the default 
> properties file always be loaded. Instead, just offering other option to let 
> user choose what they want.
> {quote}
> {{\-\-properties-file}} user-specified properties file which will replace the 
> default properties file. deprecated.
> {{\-\-replaced-properties-file}} new option which equals the 
> {{\-\-properties-file}} but more friendly. 
> {{\-\-merged-properties-file}} user-specified properties file which will 
> merge with the default properties file.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21049) why do we need computeGramianMatrix when computing SVD

2017-06-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21049.
---
Resolution: Invalid

Questions should go to the mailing list. 
Consider what "just" computing the SVD of the original matrix entails, when 
it's a huge distributed matrix. Assuming the matrix is huge but skinny, the 
Gramian is small and can be handled in-core.

> why do we need computeGramianMatrix when computing SVD
> --
>
> Key: SPARK-21049
> URL: https://issues.apache.org/jira/browse/SPARK-21049
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Vincent
>
> computeSVD will compute SVD for matrix A by computing AT*A first and svd on 
> the Gramian matrix, we found that the gramian matrix computation is the hot 
> spot of the overall SVD computation, but, per my understanding, we can simply 
> do svd on the original matrix. The singular vector of the gramian matrix 
> should be the same as the right singular vector of the original matrix A, 
> while the singular value of the gramian matrix is double as that of the 
> original matrix. why do we svd on the gramian matrix then?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.

2017-06-10 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045531#comment-16045531
 ] 

Hyukjin Kwon commented on SPARK-21001:
--

Does this still exist in 2.1.0?

> Staging folders from Hive table are not being cleared.
> --
>
> Key: SPARK-21001
> URL: https://issues.apache.org/jira/browse/SPARK-21001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Ajay Cherukuri
>
> Staging folders that were being created as a part of Data loading to Hive 
> table by using spark job, are not cleared.
> Staging folder are remaining in Hive External table folders even after Spark 
> job is completed.
> This is the same issue mentioned in the 
> ticket:https://issues.apache.org/jira/browse/SPARK-18372
> This ticket says the issues was resolved in 1.6.4. But, now i found that it's 
> still existing on 2.0.2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21049) why do we need computeGramianMatrix when computing SVD

2017-06-10 Thread Vincent (JIRA)
Vincent created SPARK-21049:
---

 Summary: why do we need computeGramianMatrix when computing SVD
 Key: SPARK-21049
 URL: https://issues.apache.org/jira/browse/SPARK-21049
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.1
Reporter: Vincent


computeSVD will compute SVD for matrix A by computing AT*A first and svd on the 
Gramian matrix, we found that the gramian matrix computation is the hot spot of 
the overall SVD computation, but, per my understanding, we can simply do svd on 
the original matrix. The singular vector of the gramian matrix should be the 
same as the right singular vector of the original matrix A, while the singular 
value of the gramian matrix is double as that of the original matrix. why do we 
svd on the gramian matrix then?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045502#comment-16045502
 ] 

Lantao Jin commented on SPARK-21048:


Will push a PR soon after a short discussing.

> Add an option --merged-properties-file to distinguish the configuration 
> loading behavior
> 
>
> Key: SPARK-21048
> URL: https://issues.apache.org/jira/browse/SPARK-21048
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The problem description is the same as 
> [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But 
> different with that ticket. The purpose is not making sure the default 
> properties file always be loaded. Instead, just offering other option to let 
> user choose what they want.
> {quote}
> {{\-\-properties-file}} user-specified properties file which will replace the 
> default properties file. deprecated.
> {{\-\-replaced-properties-file}} new option which equals the 
> {{\-\-properties-file}} but more friendly. 
> {{\-\-merged-properties-file}} user-specified properties file which will 
> merge with the default properties file.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin closed SPARK-21023.
--
Resolution: Not A Problem

Close as NotAProblem and new solution see 
https://issues.apache.org/jira/browse/SPARK-21048

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior

2017-06-10 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-21048:
--

 Summary: Add an option --merged-properties-file to distinguish the 
configuration loading behavior
 Key: SPARK-21048
 URL: https://issues.apache.org/jira/browse/SPARK-21048
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.1.1
Reporter: Lantao Jin
Priority: Minor


The problem description is the same as 
[SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But different 
with that ticket. The purpose is not making sure the default properties file 
always be loaded. Instead, just offering other option to let user choose what 
they want.
{quote}
{{\-\-properties-file}} user-specified properties file which will replace the 
default properties file. deprecated.
{{\-\-replaced-properties-file}} new option which equals the 
{{\-\-properties-file}} but more friendly. 
{{\-\-merged-properties-file}} user-specified properties file which will merge 
with the default properties file.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21015) Check field name is not null and empty in GenericRowWithSchema

2017-06-10 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21015.
--
Resolution: Invalid

I am resolving this per 
https://github.com/apache/spark/pull/18236#issuecomment-307560317

Please reopen this if I misunderstood.

> Check field name is not null and empty in GenericRowWithSchema
> --
>
> Key: SPARK-21015
> URL: https://issues.apache.org/jira/browse/SPARK-21015
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Minor
>
> When we get field index from row with schema , we shoule make sure the field 
> name is not null and empty . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045495#comment-16045495
 ] 

Lantao Jin commented on SPARK-21023:


[~cloud_fan], I come to see what you mean. Maybe add a 
{{\-\-merged-properties-file}} as a option and explain in document is good 
enough for this case. Don't spend effort to make sure the default properties 
file always be loaded. Just make sure the spark user knows what they do.

And in document, we can explain the different options:
{quote}
{{\-\-properties-file}} user-specified properties file which will replace the 
default properties file.
{{\-\-merged-properties-file}} user-specified properties file which will merge 
with the default properties file.
{quote}

I think I should close this as JIRA as the original purpose (make sure load 
default properties file) is not an issue. I will file a new one to implement 
the new feature.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-21023:
---
Description: 
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

For example:
Current implement
||Property name||Value in default||Value in user-specified||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|N/A|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|N/A|N/A|
|spark.F|"foo"|N/A|N/A|

Expected right implement
||Property name||Value in default||Value in user-specified||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|"foo"|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|N/A|"foo"|
|spark.E|"foo"|N/A|"foo"|
|spark.F|"foo"|N/A|"foo"|

I can offer a patch to fix it if you think it make sense.

  was:
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

For example:
Current implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|N/A|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|N/A|N/A|
|spark.F|"foo"|N/A|N/A|

Expected right implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|"foo"|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|N/A|"foo"|
|spark.E|"foo"|N/A|"foo"|
|spark.F|"foo"|N/A|"foo"|

I can offer a patch to fix it if you think it make sense.


> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-specified||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045483#comment-16045483
 ] 

Lantao Jin edited comment on SPARK-21023 at 6/10/17 10:15 AM:
--

[~cloud_fan] {{\-\-properties-file}} and {{\-\-extra-properties-file}} both 
exist could confuse the user. Actually, it already confuse me. What is the 
{{\-\-extra-properties-file}} use for? [~vanzin]'s suggestion is do not change 
existing behavior and based on this suggestion I propose to add an environment 
variable {{SPARK_CONF_REPLACE_ALLOWED}}.


was (Author: cltlfcjin):
[~cloud_fan] {{--properties-file}} and {{--extra-properties-file}} both exist 
could confuse the user. Actually, it already confuse me. What is the 
{{--extra-properties-file}} use for? [~vanzin]'s suggestion is do not change 
existing behavior and based on this suggestion I propose to add an environment 
variable {{SPARK_CONF_REPLACE_ALLOWED}}.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045483#comment-16045483
 ] 

Lantao Jin commented on SPARK-21023:


[~cloud_fan] {{--properties-file}} and {{--extra-properties-file}} both exist 
could confuse the user. Actually, it already confuse me. What is the 
{{--extra-properties-file}} use for? [~vanzin]'s suggestion is do not change 
existing behavior and based on this suggestion I propose to add an environment 
variable {{SPARK_CONF_REPLACE_ALLOWED}}.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21047) Add test cases for nested array

2017-06-10 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-21047:
-
Summary: Add test cases for nested array  (was: Add a test case for nested 
array)

> Add test cases for nested array
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test cases for nested array in {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21047) Add a test case for nested array

2017-06-10 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-21047:


 Summary: Add a test case for nested array
 Key: SPARK-21047
 URL: https://issues.apache.org/jira/browse/SPARK-21047
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


Current {{ColumnarBatchSuite}} has very simple test cases for array. This JIRA 
will add test cases for nested array in {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-21023:
---
Description: 
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

For example:
Current implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|N/A|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|N/A|N/A|
|spark.F|"foo"|N/A|N/A|

Expected right implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|"foo"|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|N/A|"foo"|
|spark.E|"foo"|N/A|"foo"|
|spark.F|"foo"|N/A|"foo"|

I can offer a patch to fix it if you think it make sense.

  was:
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

For example:
Current implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|N/A|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|N/A|N/A|
|spark.F|"foo"|N/A|N/A|

Expected right implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|"foo"|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|"foo"|"foo"|
|spark.F|"foo"|"foo"|"foo"|

I can offer a patch to fix it if you think it make sense.


> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|N/A|"foo"|
> |spark.E|"foo"|N/A|"foo"|
> |spark.F|"foo"|N/A|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21037) ignoreNulls does not working properly with window functions

2017-06-10 Thread Stanislav Chernichkin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045479#comment-16045479
 ] 

Stanislav Chernichkin commented on SPARK-21037:
---

To be more precise the problem not related to the ignoreNulls property. It 
arises when orderBy used without specifying window boundaries. In this case it 
set boundaries to UNBOUNDED PRECEDING - CURRENT ROW and all aggregation 
functions behave accordingly. The problem does not arise then orderBy not used. 
This behavior is not documented and unintuitive, popular databases do not 
require specifying window boundaries to apply aggregation function to the whole 
group (it applied to the whole group by default) and do not adjust default 
window depending on presence of ordering.

> ignoreNulls does not working properly with window functions
> ---
>
> Key: SPARK-21037
> URL: https://issues.apache.org/jira/browse/SPARK-21037
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Stanislav Chernichkin
>
> Following code  reproduces issue:
> spark
>   .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
> 'value' as value, 1 as order")
>   .select($"*", first($"value", 
> true).over(partitionBy($"key").orderBy("order")).as("first_value"))
>   .show()
> Since documentation climes than {{first}} function will return first non-null 
> result I except to have: 
> |key|value|order|first_value|
> |  0| null|0|   value|
> |  0|value|1|  value|
> But actual result is: 
> |key|value|order|first_value|
> |  0| null|0|   null|
> |  0|value|1|  value|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21046:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-20960

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21046:


Assignee: Apache Spark  (was: Wenchen Fan)

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21046:


Assignee: Wenchen Fan  (was: Apache Spark)

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045477#comment-16045477
 ] 

Apache Spark commented on SPARK-21046:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18260

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21037) ignoreNulls does not working properly with window functions

2017-06-10 Thread Stanislav Chernichkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Chernichkin updated SPARK-21037:
--
Description: 
Following code  reproduces issue:

spark
  .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
'value' as value, 1 as order")
  .select($"*", first($"value", 
true).over(partitionBy($"key").orderBy("order")).as("first_value"))
  .show()

Since documentation climes than {{first}} function will return first non-null 
result I except to have: 

|key|value|order|first_value|
|  0| null|0|   value|
|  0|value|1|  value|

But actual result is: 

|key|value|order|first_value|
|  0| null|0|   null|
|  0|value|1|  value|


  was:
Following code  reproduces issue:

spark
  .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
'value' as value, 1 as order")
  .select($"*", first($"value", 
true).over(partitionBy($"key").orderBy("order")).as("first_value"))
  .show()

Since documentation climes than {{first}} function will return first non-null 
result I except to have: 

|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   value|
|  0|value|1|  value|
+---+-+-+---+
But actual result is: 
+---+-+-+---+
|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   null|
|  0|value|1|  value|
+---+-+-+---+



> ignoreNulls does not working properly with window functions
> ---
>
> Key: SPARK-21037
> URL: https://issues.apache.org/jira/browse/SPARK-21037
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Stanislav Chernichkin
>
> Following code  reproduces issue:
> spark
>   .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
> 'value' as value, 1 as order")
>   .select($"*", first($"value", 
> true).over(partitionBy($"key").orderBy("order")).as("first_value"))
>   .show()
> Since documentation climes than {{first}} function will return first non-null 
> result I except to have: 
> |key|value|order|first_value|
> |  0| null|0|   value|
> |  0|value|1|  value|
> But actual result is: 
> |key|value|order|first_value|
> |  0| null|0|   null|
> |  0|value|1|  value|



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21046:
---

 Summary: simplify the array offset and length in ColumnVector
 Key: SPARK-21046
 URL: https://issues.apache.org/jira/browse/SPARK-21046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21037) ignoreNulls does not working properly with window functions

2017-06-10 Thread Stanislav Chernichkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislav Chernichkin updated SPARK-21037:
--
Description: 
Following code  reproduces issue:

spark
  .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
'value' as value, 1 as order")
  .select($"*", first($"value", 
true).over(partitionBy($"key").orderBy("order")).as("first_value"))
  .show()

Since documentation climes than {{first}} function will return first non-null 
result I except to have: 

|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   value|
|  0|value|1|  value|
+---+-+-+---+
But actual result is: 
+---+-+-+---+
|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   null|
|  0|value|1|  value|
+---+-+-+---+


  was:
Following code  reproduces issue:

spark
  .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
'value' as value, 1 as order")
  .select($"*", first($"value", 
true).over(partitionBy($"key").orderBy("order")).as("first_value"))
  .show()

Since documentation climes than {{first}} function will return first non-null 
result I except to have: 
+---+-+-+---+
|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   value|
|  0|value|1|  value|
+---+-+-+---+
But actual result is: 
+---+-+-+---+
|key|value|order|first_value|
+---+-+-+---+
|  0| null|0|   null|
|  0|value|1|  value|
+---+-+-+---+



> ignoreNulls does not working properly with window functions
> ---
>
> Key: SPARK-21037
> URL: https://issues.apache.org/jira/browse/SPARK-21037
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Stanislav Chernichkin
>
> Following code  reproduces issue:
> spark
>   .sql("select 0 as key, null as value, 0 as order union select 0 as key, 
> 'value' as value, 1 as order")
>   .select($"*", first($"value", 
> true).over(partitionBy($"key").orderBy("order")).as("first_value"))
>   .show()
> Since documentation climes than {{first}} function will return first non-null 
> result I except to have: 
> |key|value|order|first_value|
> +---+-+-+---+
> |  0| null|0|   value|
> |  0|value|1|  value|
> +---+-+-+---+
> But actual result is: 
> +---+-+-+---+
> |key|value|order|first_value|
> +---+-+-+---+
> |  0| null|0|   null|
> |  0|value|1|  value|
> +---+-+-+---+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045472#comment-16045472
 ] 

Wenchen Fan commented on SPARK-21023:
-

can't we just introduce something like `--extra-properties-file` for this new 
feature?

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21006) Create rpcEnv and run later needs shutdown and awaitTermination

2017-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045468#comment-16045468
 ] 

Apache Spark commented on SPARK-21006:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/18259

> Create rpcEnv and run later needs shutdown and awaitTermination
> ---
>
> Key: SPARK-21006
> URL: https://issues.apache.org/jira/browse/SPARK-21006
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>Assignee: wangjiaochun
>Priority: Minor
> Fix For: 2.3.0
>
>
> test("port conflict") {
> val anotherEnv = createRpcEnv(new SparkConf(), "remote", env.address.port)
> assert(anotherEnv.address.port != env.address.port)
>   }
> should be shutdown and awaitTermination in RpcEnvSuit.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20752) Build-in SQL Function Support - SQRT

2017-06-10 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045466#comment-16045466
 ] 

Kazuaki Ishizaki commented on SPARK-20752:
--

ping [~smilegator]

> Build-in SQL Function Support - SQRT
> 
>
> Key: SPARK-20752
> URL: https://issues.apache.org/jira/browse/SPARK-20752
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>  Labels: starter
>
> {noformat}
> SQRT()
> {noformat}
> Returns Power(, 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}
{code:title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}

  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at 

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

We can triggle the bug use simple program:

{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at 

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}

{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at 

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}
{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at 

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at 

[jira] [Created] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)
Joshuawangzj created SPARK-21045:


 Summary: Spark executor is blocked instead of throwing exception 
because exception occur when python worker send exception trace stack info to 
Java Gateway
 Key: SPARK-21045
 URL: https://issues.apache.org/jira/browse/SPARK-21045
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1, 2.0.2, 2.0.1
Reporter: Joshuawangzj


My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File 
"/Users/wangzejie/software/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452
 ] 

Lantao Jin edited comment on SPARK-21023 at 6/10/17 8:18 AM:
-

If current behavior should be keep. I can add an environment variable.E.g 
{{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to 
{{childEnv}} map in SparkSubmitCommandBuilder and it will be set in very 
beginning.
{code}
  public SparkLauncher setConfReplaceBehavior(String allowed) {
checkNotNull(allowed, "allowed");
builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed);
return this;
  }
{code}
Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix 
this case and keep current behavior by default. Generally, the file 
{{spark-env.sh}} deployed by infra team and protect by linux file permission 
mechanism.

Of course, user can export to any value before submitting. But it means the 
user definitely know what they want instead of the current unexpected result.


was (Author: cltlfcjin):
If current behavior should be keep. I can add an environment variable.E.g 
{{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to 
{{childEnv}} map in SparkLauncher.class.
{code}
  public SparkLauncher setConfReplaceBehavior(String allowed) {
checkNotNull(allowed, "allowed");
builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed);
return this;
  }
{code}
Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix 
this case and keep current behavior by default. Generally, the file 
{{spark-env.sh}} deployed by infra team and protect by linux file permission 
mechanism.

Of course, user can export to any value before submitting. But it means the 
user definitely know what they want instead of the current unexpected result.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452
 ] 

Lantao Jin edited comment on SPARK-21023 at 6/10/17 8:16 AM:
-

If current behavior should be keep. I can add an environment variable.E.g 
{{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to 
{{childEnv}} map in SparkLauncher.class.
{code}
  public SparkLauncher setConfReplaceBehavior(String allowed) {
checkNotNull(allowed, "allowed");
builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed);
return this;
  }
{code}
Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix 
this case and keep current behavior by default. Generally, the file 
{{spark-env.sh}} deployed by infra team and protect by linux file permission 
mechanism.

Of course, user can export to any value before submitting. But it means the 
user definitely know what they want instead of the current unexpected result.


was (Author: cltlfcjin):
If current behavior should be keep. I can add an environment variable.E.g 
{{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to 
{{childEnv}} map in AbstractCommandBuilder.class.
{code}
  static final String SPARK_CONF_REPLACE_ALLOWED = "SPARK_CONF_REPLACE_ALLOWED";
{code}
Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix 
this case and keep current behavior by default. Generally, the file 
{{spark-env.sh}} deployed by infra team and protect by linux file permission 
mechanism.

Of course, user can export to any value before submitting. But it means the 
user definitely know what they want instead of the current unexpected result.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452
 ] 

Lantao Jin commented on SPARK-21023:


If current behavior should be keep. I can add an environment variable.E.g 
{{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to 
{{childEnv}} map in AbstractCommandBuilder.class.
{code}
  static final String SPARK_CONF_REPLACE_ALLOWED = "SPARK_CONF_REPLACE_ALLOWED";
{code}
Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix 
this case and keep current behavior by default. Generally, the file 
{{spark-env.sh}} deployed by infra team and protect by linux file permission 
mechanism.

Of course, user can export to any value before submitting. But it means the 
user definitely know what they want instead of the current unexpected result.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045441#comment-16045441
 ] 

Lantao Jin commented on SPARK-21023:


I modify the description with adding two tables to illustrate why I consider it 
as a bug. Escalate to dev mailing list to discuss.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-10 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-21023:
---
Description: 
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

For example:
Current implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|N/A|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|N/A|N/A|
|spark.F|"foo"|N/A|N/A|

Expected right implement
||Property name||Value in default||Value in user-special||Finally value||
|spark.A|"foo"|"bar"|"bar"|
|spark.B|"foo"|N/A|"foo"|
|spark.C|N/A|"bar"|"bar"|
|spark.D|"foo"|"foo"|"foo"|
|spark.E|"foo"|"foo"|"foo"|
|spark.F|"foo"|"foo"|"foo"|

I can offer a patch to fix it if you think it make sense.

  was:
The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

All this caused by below codes:
{code}
  private Properties loadPropertiesFile() throws IOException {
Properties props = new Properties();
File propsFile;
if (propertiesFile != null) {
// default conf property file will not be loaded when app developer use 
--properties-file as a submit args
  propsFile = new File(propertiesFile);
  checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
propertiesFile);
} else {
  propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
}

//...

return props;
  }
{code}

I can offer a patch to fix it if you think it make sense.


> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> For example:
> Current implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|N/A|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|N/A|N/A|
> |spark.F|"foo"|N/A|N/A|
> Expected right implement
> ||Property name||Value in default||Value in user-special||Finally value||
> |spark.A|"foo"|"bar"|"bar"|
> |spark.B|"foo"|N/A|"foo"|
> |spark.C|N/A|"bar"|"bar"|
> |spark.D|"foo"|"foo"|"foo"|
> |spark.E|"foo"|"foo"|"foo"|
> |spark.F|"foo"|"foo"|"foo"|
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org