[jira] [Created] (SPARK-21053) Number overflow on agg function of Dataframe
DUC LIEM NGUYEN created SPARK-21053: --- Summary: Number overflow on agg function of Dataframe Key: SPARK-21053 URL: https://issues.apache.org/jira/browse/SPARK-21053 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Environment: Databricks Community version Reporter: DUC LIEM NGUYEN The use of average on aggregation function on a large data set return a NaN instead of the desired numerical value although it's range between 0 and 1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045783#comment-16045783 ] Apache Spark commented on SPARK-20427: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18266 > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20427: Assignee: Apache Spark > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Apache Spark > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20427: Assignee: (was: Apache Spark) > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset
[ https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045773#comment-16045773 ] Takeshi Yamamuro commented on SPARK-21043: -- Thank you for ping me! Yea, I'll try > Add unionByName API to Dataset > -- > > Key: SPARK-21043 > URL: https://issues.apache.org/jira/browse/SPARK-21043 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > It would be useful to add unionByName which resolves columns by name, in > addition to the existing union (which resolves by position). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045769#comment-16045769 ] Liang-Chi Hsieh commented on SPARK-21052: - I'll submit a PR for this soon. > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21051) Add hash map metrics to aggregate
[ https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21051: Assignee: Apache Spark > Add hash map metrics to aggregate > - > > Key: SPARK-21051 > URL: https://issues.apache.org/jira/browse/SPARK-21051 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > We should add avg hash map probe metric to aggregate operator and report it > on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21052) Add hash map metrics to join
[ https://issues.apache.org/jira/browse/SPARK-21052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21052: Description: We should add avg hash map probe metric to join operator and report it on UI. > Add hash map metrics to join > > > Key: SPARK-21052 > URL: https://issues.apache.org/jira/browse/SPARK-21052 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We should add avg hash map probe metric to join operator and report it on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21051) Add hash map metrics to aggregate
[ https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21051: Assignee: (was: Apache Spark) > Add hash map metrics to aggregate > - > > Key: SPARK-21051 > URL: https://issues.apache.org/jira/browse/SPARK-21051 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We should add avg hash map probe metric to aggregate operator and report it > on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21051) Add hash map metrics to aggregate
[ https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045768#comment-16045768 ] Apache Spark commented on SPARK-21051: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/18258 > Add hash map metrics to aggregate > - > > Key: SPARK-21051 > URL: https://issues.apache.org/jira/browse/SPARK-21051 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We should add avg hash map probe metric to aggregate operator and report it > on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21051) Add hash map metrics to aggregate
[ https://issues.apache.org/jira/browse/SPARK-21051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-21051: Description: We should add avg hash map probe metric to aggregate operator and report it on UI. > Add hash map metrics to aggregate > - > > Key: SPARK-21051 > URL: https://issues.apache.org/jira/browse/SPARK-21051 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > We should add avg hash map probe metric to aggregate operator and report it > on UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21052) Add hash map metrics to join
Liang-Chi Hsieh created SPARK-21052: --- Summary: Add hash map metrics to join Key: SPARK-21052 URL: https://issues.apache.org/jira/browse/SPARK-21052 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21051) Add hash map metrics to aggregate
Liang-Chi Hsieh created SPARK-21051: --- Summary: Add hash map metrics to aggregate Key: SPARK-21051 URL: https://issues.apache.org/jira/browse/SPARK-21051 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions
[ https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21050: Assignee: Apache Spark (was: Joseph K. Bradley) > ml word2vec write has overflow issue in calculating numPartitions > - > > Key: SPARK-21050 > URL: https://issues.apache.org/jira/browse/SPARK-21050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib > version), so it is very easily to have an overflow in calculating the number > of partitions for ML persistence. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions
[ https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21050: Assignee: Joseph K. Bradley (was: Apache Spark) > ml word2vec write has overflow issue in calculating numPartitions > - > > Key: SPARK-21050 > URL: https://issues.apache.org/jira/browse/SPARK-21050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib > version), so it is very easily to have an overflow in calculating the number > of partitions for ML persistence. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions
[ https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045763#comment-16045763 ] Apache Spark commented on SPARK-21050: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/18265 > ml word2vec write has overflow issue in calculating numPartitions > - > > Key: SPARK-21050 > URL: https://issues.apache.org/jira/browse/SPARK-21050 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib > version), so it is very easily to have an overflow in calculating the number > of partitions for ML persistence. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions
Joseph K. Bradley created SPARK-21050: - Summary: ml word2vec write has overflow issue in calculating numPartitions Key: SPARK-21050 URL: https://issues.apache.org/jira/browse/SPARK-21050 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib version), so it is very easily to have an overflow in calculating the number of partitions for ML persistence. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20877) Shorten test sets to run on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045732#comment-16045732 ] Apache Spark commented on SPARK-20877: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/18264 > Shorten test sets to run on CRAN > > > Key: SPARK-20877 > URL: https://issues.apache.org/jira/browse/SPARK-20877 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20877) Shorten test sets to run on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-20877: - Summary: Shorten test sets to run on CRAN (was: Investigate if tests will time out on CRAN) > Shorten test sets to run on CRAN > > > Key: SPARK-20877 > URL: https://issues.apache.org/jira/browse/SPARK-20877 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21044) Add `RemoveInvalidRange` optimizer
[ https://issues.apache.org/jira/browse/SPARK-21044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-21044. - Resolution: Invalid > Add `RemoveInvalidRange` optimizer > -- > > Key: SPARK-21044 > URL: https://issues.apache.org/jira/browse/SPARK-21044 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun > > This issue aims to add an optimizer remove invalid `Range` operator from the > beginning. There are two cases of invalidity. > 1. The `start` and `end` value are equal. > 2. The sign of `step` does not match `start` and `end`. In this case, > SPARK-21041 is reported as a bug, too. > *BEFORE* > {code} > scala> spark.range(0,10,-1).explain > == Physical Plan == > *Range (0, 10, step=-1, splits=8) > scala> spark.range(0,0,-1).explain > == Physical Plan == > *Range (0, 0, step=-1, splits=8) > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, > 9223372036854775806) > {code} > *AFTER* > {code} > scala> spark.range(0,10,-1).explain > == Physical Plan == > LocalTableScan , [id#0L] > scala> spark.range(0,0,-1).explain > == Physical Plan == > LocalTableScan , [id#4L] > scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + > 2, 1).collect > res2: Array[Long] = Array() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
[ https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-17642: - Summary: Support DESC FORMATTED TABLE COLUMN command to show column-level statistics (was: support DESC FORMATTED TABLE COLUMN command to show column-level statistics) > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics > --- > > Key: SPARK-17642 > URL: https://issues.apache.org/jira/browse/SPARK-17642 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Zhenhua Wang > > Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command. > Support DESC FORMATTED TABLE COLUMN command to show column-level statistics. > We should resolve this jira after column-level statistics are supported. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
[ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21039: Assignee: (was: Apache Spark) > Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter > > > Key: SPARK-21039 > URL: https://issues.apache.org/jira/browse/SPARK-21039 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa > > Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that > the bloom filters received for each partition of data are merged in the > driver. The cost of this operation can be very high if the bloom filters are > large. It would be nice if it used RDD.treeAggregate instead, in order to > parallelize the operation of merging the bloom filters. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
[ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21039: Assignee: Apache Spark > Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter > > > Key: SPARK-21039 > URL: https://issues.apache.org/jira/browse/SPARK-21039 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa >Assignee: Apache Spark > > Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that > the bloom filters received for each partition of data are merged in the > driver. The cost of this operation can be very high if the bloom filters are > large. It would be nice if it used RDD.treeAggregate instead, in order to > parallelize the operation of merging the bloom filters. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21039) Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter
[ https://issues.apache.org/jira/browse/SPARK-21039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045650#comment-16045650 ] Apache Spark commented on SPARK-21039: -- User 'rishabhbhardwaj' has created a pull request for this issue: https://github.com/apache/spark/pull/18263 > Use treeAggregate instead of aggregate in DataFrame.stat.bloomFilter > > > Key: SPARK-21039 > URL: https://issues.apache.org/jira/browse/SPARK-21039 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lovasoa > > Currently, DataFrame.stat.bloomFilter uses RDD.aggregate, which means that > the bloom filters received for each partition of data are merged in the > driver. The cost of this operation can be very high if the bloom filters are > large. It would be nice if it used RDD.treeAggregate instead, in order to > parallelize the operation of merging the bloom filters. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset
[ https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045630#comment-16045630 ] Xiao Li commented on SPARK-21043: - [~maropu] Do you want to make a try? > Add unionByName API to Dataset > -- > > Key: SPARK-21043 > URL: https://issues.apache.org/jira/browse/SPARK-21043 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > It would be useful to add unionByName which resolves columns by name, in > addition to the existing union (which resolves by position). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045604#comment-16045604 ] Apache Spark commented on SPARK-21045: -- User 'dataknocker' has created a pull request for this issue: https://github.com/apache/spark/pull/18262 > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway > --- > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 >Reporter: Joshuawangzj > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['中']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045592#comment-16045592 ] Apache Spark commented on SPARK-21045: -- User 'dataknocker' has created a pull request for this issue: https://github.com/apache/spark/pull/18261 > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway > --- > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 >Reporter: Joshuawangzj > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['中']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21045: Assignee: Apache Spark > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway > --- > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 >Reporter: Joshuawangzj >Assignee: Apache Spark > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['中']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21045: Assignee: (was: Apache Spark) > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway > --- > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 >Reporter: Joshuawangzj > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['中']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Summary: Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway (was: Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway) > Spark executor blocked instead of throwing exception because exception occur > when python worker send exception info to Java Gateway > --- > > Key: SPARK-21045 > URL: https://issues.apache.org/jira/browse/SPARK-21045 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1, 2.0.2, 2.1.1 >Reporter: Joshuawangzj > > My pyspark program is always blocking in product yarn cluster. Then I jstack > and found : > {code} > "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 > tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] >java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x0007acab1c98> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > It is blocking in socket read. I view the log on blocking executor and found > error: > {code} > Traceback (most recent call last): > File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in > main > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: > ordinal not in range(128) > {code} > Finally I found the problem: > {code:title=worker.py|borderStyle=solid} > # 178 line in spark 2.1.1 > except Exception: > try: > write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) > write_with_length(traceback.format_exc().encode("utf-8"), outfile) > except IOError: > # JVM close the socket > pass > except Exception: > # Write the error to stderr if it happened while serializing > print("PySpark worker failed with exception:", file=sys.stderr) > print(traceback.format_exc(), file=sys.stderr) > {code} > when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur > exception like UnicodeDecodeError, the python worker can't send the trace > info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the > trace info length next. So it is blocking. > {code:title=PythonRDD.scala|borderStyle=solid} > # 190 line in spark 2.1.1 > case SpecialLengths.PYTHON_EXCEPTION_THROWN => > // Signals that an exception has been thrown in python > val exLength = stream.readInt() // It is possible to be blocked > {code} > {color:red} > We can triggle the bug use simple program: > {color} > {code:title=test.py|borderStyle=solid} > spark = SparkSession.builder.master('local').getOrCreate() > rdd = spark.sparkContext.parallelize(['中']).map(lambda x: > x.encode("utf8")) > rdd.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Closed] (SPARK-20684) expose createGlobalTempView and dropGlobalTempView in SparkR
[ https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-20684. - Resolution: Later > expose createGlobalTempView and dropGlobalTempView in SparkR > > > Key: SPARK-20684 > URL: https://issues.apache.org/jira/browse/SPARK-20684 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hossein Falaki > > This is a useful API that is not exposed in SparkR. It will help with moving > data between languages on a single single Spark application. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior
[ https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045539#comment-16045539 ] Lantao Jin commented on SPARK-21048: Thank you for pointing out about JIRA. Confusing name can be explain by document. Better than miss default configuration when use \-\-properties-file option. That the key point I try to fix. Any idea? > Add an option --merged-properties-file to distinguish the configuration > loading behavior > > > Key: SPARK-21048 > URL: https://issues.apache.org/jira/browse/SPARK-21048 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The problem description is the same as > [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But > different with that ticket. The purpose is not making sure the default > properties file always be loaded. Instead, just offering other option to let > user choose what they want. > {quote} > {{\-\-properties-file}} user-specified properties file which will replace the > default properties file. deprecated. > {{\-\-replaced-properties-file}} new option which equals the > {{\-\-properties-file}} but more friendly. > {{\-\-merged-properties-file}} user-specified properties file which will > merge with the default properties file. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21049) why do we need computeGramianMatrix when computing SVD
[ https://issues.apache.org/jira/browse/SPARK-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045536#comment-16045536 ] Vincent commented on SPARK-21049: - [~srowen]thanks. that's right. But we found it quite often that, the matrix is not skinny, and it spent quite a lot of time computing gramian matrix. Actually, we found that in such case, if we compute the svd on the original matrix, we could at least have 5x+ speedup. So, I wonder, whether it's possible to add an option here, to offer the user a choice to choose whether go with gramian or the original matrix. After all, user knows their data better, what do u think? > why do we need computeGramianMatrix when computing SVD > -- > > Key: SPARK-21049 > URL: https://issues.apache.org/jira/browse/SPARK-21049 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Vincent > > computeSVD will compute SVD for matrix A by computing AT*A first and svd on > the Gramian matrix, we found that the gramian matrix computation is the hot > spot of the overall SVD computation, but, per my understanding, we can simply > do svd on the original matrix. The singular vector of the gramian matrix > should be the same as the right singular vector of the original matrix A, > while the singular value of the gramian matrix is double as that of the > original matrix. why do we svd on the gramian matrix then? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior
[ https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045533#comment-16045533 ] Sean Owen commented on SPARK-21048: --- I think this is confusing relative to any value it adds, and don't think this should be done. (You shouldn't open new JIRAs for the same issue as it forks the discussion) > Add an option --merged-properties-file to distinguish the configuration > loading behavior > > > Key: SPARK-21048 > URL: https://issues.apache.org/jira/browse/SPARK-21048 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The problem description is the same as > [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But > different with that ticket. The purpose is not making sure the default > properties file always be loaded. Instead, just offering other option to let > user choose what they want. > {quote} > {{\-\-properties-file}} user-specified properties file which will replace the > default properties file. deprecated. > {{\-\-replaced-properties-file}} new option which equals the > {{\-\-properties-file}} but more friendly. > {{\-\-merged-properties-file}} user-specified properties file which will > merge with the default properties file. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21049) why do we need computeGramianMatrix when computing SVD
[ https://issues.apache.org/jira/browse/SPARK-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21049. --- Resolution: Invalid Questions should go to the mailing list. Consider what "just" computing the SVD of the original matrix entails, when it's a huge distributed matrix. Assuming the matrix is huge but skinny, the Gramian is small and can be handled in-core. > why do we need computeGramianMatrix when computing SVD > -- > > Key: SPARK-21049 > URL: https://issues.apache.org/jira/browse/SPARK-21049 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Vincent > > computeSVD will compute SVD for matrix A by computing AT*A first and svd on > the Gramian matrix, we found that the gramian matrix computation is the hot > spot of the overall SVD computation, but, per my understanding, we can simply > do svd on the original matrix. The singular vector of the gramian matrix > should be the same as the right singular vector of the original matrix A, > while the singular value of the gramian matrix is double as that of the > original matrix. why do we svd on the gramian matrix then? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.
[ https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045531#comment-16045531 ] Hyukjin Kwon commented on SPARK-21001: -- Does this still exist in 2.1.0? > Staging folders from Hive table are not being cleared. > -- > > Key: SPARK-21001 > URL: https://issues.apache.org/jira/browse/SPARK-21001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Ajay Cherukuri > > Staging folders that were being created as a part of Data loading to Hive > table by using spark job, are not cleared. > Staging folder are remaining in Hive External table folders even after Spark > job is completed. > This is the same issue mentioned in the > ticket:https://issues.apache.org/jira/browse/SPARK-18372 > This ticket says the issues was resolved in 1.6.4. But, now i found that it's > still existing on 2.0.2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21049) why do we need computeGramianMatrix when computing SVD
Vincent created SPARK-21049: --- Summary: why do we need computeGramianMatrix when computing SVD Key: SPARK-21049 URL: https://issues.apache.org/jira/browse/SPARK-21049 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.1 Reporter: Vincent computeSVD will compute SVD for matrix A by computing AT*A first and svd on the Gramian matrix, we found that the gramian matrix computation is the hot spot of the overall SVD computation, but, per my understanding, we can simply do svd on the original matrix. The singular vector of the gramian matrix should be the same as the right singular vector of the original matrix A, while the singular value of the gramian matrix is double as that of the original matrix. why do we svd on the gramian matrix then? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior
[ https://issues.apache.org/jira/browse/SPARK-21048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045502#comment-16045502 ] Lantao Jin commented on SPARK-21048: Will push a PR soon after a short discussing. > Add an option --merged-properties-file to distinguish the configuration > loading behavior > > > Key: SPARK-21048 > URL: https://issues.apache.org/jira/browse/SPARK-21048 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The problem description is the same as > [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But > different with that ticket. The purpose is not making sure the default > properties file always be loaded. Instead, just offering other option to let > user choose what they want. > {quote} > {{\-\-properties-file}} user-specified properties file which will replace the > default properties file. deprecated. > {{\-\-replaced-properties-file}} new option which equals the > {{\-\-properties-file}} but more friendly. > {{\-\-merged-properties-file}} user-specified properties file which will > merge with the default properties file. > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin closed SPARK-21023. -- Resolution: Not A Problem Close as NotAProblem and new solution see https://issues.apache.org/jira/browse/SPARK-21048 > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21048) Add an option --merged-properties-file to distinguish the configuration loading behavior
Lantao Jin created SPARK-21048: -- Summary: Add an option --merged-properties-file to distinguish the configuration loading behavior Key: SPARK-21048 URL: https://issues.apache.org/jira/browse/SPARK-21048 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 2.1.1 Reporter: Lantao Jin Priority: Minor The problem description is the same as [SPARK-21023|https://issues.apache.org/jira/browse/SPARK-21023]. But different with that ticket. The purpose is not making sure the default properties file always be loaded. Instead, just offering other option to let user choose what they want. {quote} {{\-\-properties-file}} user-specified properties file which will replace the default properties file. deprecated. {{\-\-replaced-properties-file}} new option which equals the {{\-\-properties-file}} but more friendly. {{\-\-merged-properties-file}} user-specified properties file which will merge with the default properties file. {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21015) Check field name is not null and empty in GenericRowWithSchema
[ https://issues.apache.org/jira/browse/SPARK-21015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21015. -- Resolution: Invalid I am resolving this per https://github.com/apache/spark/pull/18236#issuecomment-307560317 Please reopen this if I misunderstood. > Check field name is not null and empty in GenericRowWithSchema > -- > > Key: SPARK-21015 > URL: https://issues.apache.org/jira/browse/SPARK-21015 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Minor > > When we get field index from row with schema , we shoule make sure the field > name is not null and empty . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045495#comment-16045495 ] Lantao Jin commented on SPARK-21023: [~cloud_fan], I come to see what you mean. Maybe add a {{\-\-merged-properties-file}} as a option and explain in document is good enough for this case. Don't spend effort to make sure the default properties file always be loaded. Just make sure the spark user knows what they do. And in document, we can explain the different options: {quote} {{\-\-properties-file}} user-specified properties file which will replace the default properties file. {{\-\-merged-properties-file}} user-specified properties file which will merge with the default properties file. {quote} I think I should close this as JIRA as the original purpose (make sure load default properties file) is not an issue. I will file a new one to implement the new feature. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-21023: --- Description: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. For example: Current implement ||Property name||Value in default||Value in user-specified||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|N/A| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|N/A|N/A| |spark.F|"foo"|N/A|N/A| Expected right implement ||Property name||Value in default||Value in user-specified||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|"foo"| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|N/A|"foo"| |spark.E|"foo"|N/A|"foo"| |spark.F|"foo"|N/A|"foo"| I can offer a patch to fix it if you think it make sense. was: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. For example: Current implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|N/A| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|N/A|N/A| |spark.F|"foo"|N/A|N/A| Expected right implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|"foo"| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|N/A|"foo"| |spark.E|"foo"|N/A|"foo"| |spark.F|"foo"|N/A|"foo"| I can offer a patch to fix it if you think it make sense. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-specified||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045483#comment-16045483 ] Lantao Jin edited comment on SPARK-21023 at 6/10/17 10:15 AM: -- [~cloud_fan] {{\-\-properties-file}} and {{\-\-extra-properties-file}} both exist could confuse the user. Actually, it already confuse me. What is the {{\-\-extra-properties-file}} use for? [~vanzin]'s suggestion is do not change existing behavior and based on this suggestion I propose to add an environment variable {{SPARK_CONF_REPLACE_ALLOWED}}. was (Author: cltlfcjin): [~cloud_fan] {{--properties-file}} and {{--extra-properties-file}} both exist could confuse the user. Actually, it already confuse me. What is the {{--extra-properties-file}} use for? [~vanzin]'s suggestion is do not change existing behavior and based on this suggestion I propose to add an environment variable {{SPARK_CONF_REPLACE_ALLOWED}}. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045483#comment-16045483 ] Lantao Jin commented on SPARK-21023: [~cloud_fan] {{--properties-file}} and {{--extra-properties-file}} both exist could confuse the user. Actually, it already confuse me. What is the {{--extra-properties-file}} use for? [~vanzin]'s suggestion is do not change existing behavior and based on this suggestion I propose to add an environment variable {{SPARK_CONF_REPLACE_ALLOWED}}. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21047) Add test cases for nested array
[ https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-21047: - Summary: Add test cases for nested array (was: Add a test case for nested array) > Add test cases for nested array > --- > > Key: SPARK-21047 > URL: https://issues.apache.org/jira/browse/SPARK-21047 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > > Current {{ColumnarBatchSuite}} has very simple test cases for array. This > JIRA will add test cases for nested array in {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21047) Add a test case for nested array
Kazuaki Ishizaki created SPARK-21047: Summary: Add a test case for nested array Key: SPARK-21047 URL: https://issues.apache.org/jira/browse/SPARK-21047 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki Current {{ColumnarBatchSuite}} has very simple test cases for array. This JIRA will add test cases for nested array in {{ColumnVector}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-21023: --- Description: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. For example: Current implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|N/A| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|N/A|N/A| |spark.F|"foo"|N/A|N/A| Expected right implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|"foo"| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|N/A|"foo"| |spark.E|"foo"|N/A|"foo"| |spark.F|"foo"|N/A|"foo"| I can offer a patch to fix it if you think it make sense. was: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. For example: Current implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|N/A| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|N/A|N/A| |spark.F|"foo"|N/A|N/A| Expected right implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|"foo"| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|"foo"|"foo"| |spark.F|"foo"|"foo"|"foo"| I can offer a patch to fix it if you think it make sense. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|N/A|"foo"| > |spark.E|"foo"|N/A|"foo"| > |spark.F|"foo"|N/A|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21037) ignoreNulls does not working properly with window functions
[ https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045479#comment-16045479 ] Stanislav Chernichkin commented on SPARK-21037: --- To be more precise the problem not related to the ignoreNulls property. It arises when orderBy used without specifying window boundaries. In this case it set boundaries to UNBOUNDED PRECEDING - CURRENT ROW and all aggregation functions behave accordingly. The problem does not arise then orderBy not used. This behavior is not documented and unintuitive, popular databases do not require specifying window boundaries to apply aggregation function to the whole group (it applied to the whole group by default) and do not adjust default window depending on presence of ordering. > ignoreNulls does not working properly with window functions > --- > > Key: SPARK-21037 > URL: https://issues.apache.org/jira/browse/SPARK-21037 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Stanislav Chernichkin > > Following code reproduces issue: > spark > .sql("select 0 as key, null as value, 0 as order union select 0 as key, > 'value' as value, 1 as order") > .select($"*", first($"value", > true).over(partitionBy($"key").orderBy("order")).as("first_value")) > .show() > Since documentation climes than {{first}} function will return first non-null > result I except to have: > |key|value|order|first_value| > | 0| null|0| value| > | 0|value|1| value| > But actual result is: > |key|value|order|first_value| > | 0| null|0| null| > | 0|value|1| value| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21046) simplify the array offset and length in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-21046: Issue Type: Sub-task (was: Improvement) Parent: SPARK-20960 > simplify the array offset and length in ColumnVector > > > Key: SPARK-21046 > URL: https://issues.apache.org/jira/browse/SPARK-21046 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21046) simplify the array offset and length in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21046: Assignee: Apache Spark (was: Wenchen Fan) > simplify the array offset and length in ColumnVector > > > Key: SPARK-21046 > URL: https://issues.apache.org/jira/browse/SPARK-21046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21046) simplify the array offset and length in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21046: Assignee: Wenchen Fan (was: Apache Spark) > simplify the array offset and length in ColumnVector > > > Key: SPARK-21046 > URL: https://issues.apache.org/jira/browse/SPARK-21046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21046) simplify the array offset and length in ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045477#comment-16045477 ] Apache Spark commented on SPARK-21046: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18260 > simplify the array offset and length in ColumnVector > > > Key: SPARK-21046 > URL: https://issues.apache.org/jira/browse/SPARK-21046 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21037) ignoreNulls does not working properly with window functions
[ https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislav Chernichkin updated SPARK-21037: -- Description: Following code reproduces issue: spark .sql("select 0 as key, null as value, 0 as order union select 0 as key, 'value' as value, 1 as order") .select($"*", first($"value", true).over(partitionBy($"key").orderBy("order")).as("first_value")) .show() Since documentation climes than {{first}} function will return first non-null result I except to have: |key|value|order|first_value| | 0| null|0| value| | 0|value|1| value| But actual result is: |key|value|order|first_value| | 0| null|0| null| | 0|value|1| value| was: Following code reproduces issue: spark .sql("select 0 as key, null as value, 0 as order union select 0 as key, 'value' as value, 1 as order") .select($"*", first($"value", true).over(partitionBy($"key").orderBy("order")).as("first_value")) .show() Since documentation climes than {{first}} function will return first non-null result I except to have: |key|value|order|first_value| +---+-+-+---+ | 0| null|0| value| | 0|value|1| value| +---+-+-+---+ But actual result is: +---+-+-+---+ |key|value|order|first_value| +---+-+-+---+ | 0| null|0| null| | 0|value|1| value| +---+-+-+---+ > ignoreNulls does not working properly with window functions > --- > > Key: SPARK-21037 > URL: https://issues.apache.org/jira/browse/SPARK-21037 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Stanislav Chernichkin > > Following code reproduces issue: > spark > .sql("select 0 as key, null as value, 0 as order union select 0 as key, > 'value' as value, 1 as order") > .select($"*", first($"value", > true).over(partitionBy($"key").orderBy("order")).as("first_value")) > .show() > Since documentation climes than {{first}} function will return first non-null > result I except to have: > |key|value|order|first_value| > | 0| null|0| value| > | 0|value|1| value| > But actual result is: > |key|value|order|first_value| > | 0| null|0| null| > | 0|value|1| value| -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21046) simplify the array offset and length in ColumnVector
Wenchen Fan created SPARK-21046: --- Summary: simplify the array offset and length in ColumnVector Key: SPARK-21046 URL: https://issues.apache.org/jira/browse/SPARK-21046 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21037) ignoreNulls does not working properly with window functions
[ https://issues.apache.org/jira/browse/SPARK-21037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanislav Chernichkin updated SPARK-21037: -- Description: Following code reproduces issue: spark .sql("select 0 as key, null as value, 0 as order union select 0 as key, 'value' as value, 1 as order") .select($"*", first($"value", true).over(partitionBy($"key").orderBy("order")).as("first_value")) .show() Since documentation climes than {{first}} function will return first non-null result I except to have: |key|value|order|first_value| +---+-+-+---+ | 0| null|0| value| | 0|value|1| value| +---+-+-+---+ But actual result is: +---+-+-+---+ |key|value|order|first_value| +---+-+-+---+ | 0| null|0| null| | 0|value|1| value| +---+-+-+---+ was: Following code reproduces issue: spark .sql("select 0 as key, null as value, 0 as order union select 0 as key, 'value' as value, 1 as order") .select($"*", first($"value", true).over(partitionBy($"key").orderBy("order")).as("first_value")) .show() Since documentation climes than {{first}} function will return first non-null result I except to have: +---+-+-+---+ |key|value|order|first_value| +---+-+-+---+ | 0| null|0| value| | 0|value|1| value| +---+-+-+---+ But actual result is: +---+-+-+---+ |key|value|order|first_value| +---+-+-+---+ | 0| null|0| null| | 0|value|1| value| +---+-+-+---+ > ignoreNulls does not working properly with window functions > --- > > Key: SPARK-21037 > URL: https://issues.apache.org/jira/browse/SPARK-21037 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0, 2.1.1 >Reporter: Stanislav Chernichkin > > Following code reproduces issue: > spark > .sql("select 0 as key, null as value, 0 as order union select 0 as key, > 'value' as value, 1 as order") > .select($"*", first($"value", > true).over(partitionBy($"key").orderBy("order")).as("first_value")) > .show() > Since documentation climes than {{first}} function will return first non-null > result I except to have: > |key|value|order|first_value| > +---+-+-+---+ > | 0| null|0| value| > | 0|value|1| value| > +---+-+-+---+ > But actual result is: > +---+-+-+---+ > |key|value|order|first_value| > +---+-+-+---+ > | 0| null|0| null| > | 0|value|1| value| > +---+-+-+---+ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045472#comment-16045472 ] Wenchen Fan commented on SPARK-21023: - can't we just introduce something like `--extra-properties-file` for this new feature? > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21006) Create rpcEnv and run later needs shutdown and awaitTermination
[ https://issues.apache.org/jira/browse/SPARK-21006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045468#comment-16045468 ] Apache Spark commented on SPARK-21006: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/18259 > Create rpcEnv and run later needs shutdown and awaitTermination > --- > > Key: SPARK-21006 > URL: https://issues.apache.org/jira/browse/SPARK-21006 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.1 >Reporter: wangjiaochun >Assignee: wangjiaochun >Priority: Minor > Fix For: 2.3.0 > > > test("port conflict") { > val anotherEnv = createRpcEnv(new SparkConf(), "remote", env.address.port) > assert(anotherEnv.address.port != env.address.port) > } > should be shutdown and awaitTermination in RpcEnvSuit.scala -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20752) Build-in SQL Function Support - SQRT
[ https://issues.apache.org/jira/browse/SPARK-20752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045466#comment-16045466 ] Kazuaki Ishizaki commented on SPARK-20752: -- ping [~smilegator] > Build-in SQL Function Support - SQRT > > > Key: SPARK-20752 > URL: https://issues.apache.org/jira/browse/SPARK-20752 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > Labels: starter > > {noformat} > SQRT() > {noformat} > Returns Power(, 2) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Description: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} {color:red} We can triggle the bug use simple program: {color} {code:title=test.py|borderStyle=solid} spark = SparkSession.builder.master('local').getOrCreate() rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8")) rdd.collect() {code} was: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at
[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Description: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} We can triggle the bug use simple program: {code title=test.py|borderStyle=solid} spark = SparkSession.builder.master('local').getOrCreate() rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8")) rdd.collect() {code} was: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at
[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Description: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} {color:red} We can triggle the bug use simple program: {color} {code title=test.py|borderStyle=solid} spark = SparkSession.builder.master('local').getOrCreate() rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8")) rdd.collect() {code} was: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at
[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Description: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} {color:red} We can triggle the bug use simple program: {color} {code title=test.py|borderStyle=solid} spark = SparkSession.builder.master('local').getOrCreate() rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8")) rdd.collect() {code} was: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at
[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
[ https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshuawangzj updated SPARK-21045: - Description: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} was: My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at
[jira] [Created] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway
Joshuawangzj created SPARK-21045: Summary: Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway Key: SPARK-21045 URL: https://issues.apache.org/jira/browse/SPARK-21045 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1, 2.0.2, 2.0.1 Reporter: Joshuawangzj My pyspark program is always blocking in product yarn cluster. Then I jstack and found : {code} "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x0007acab1c98> (a java.io.BufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} It is blocking in socket read. I view the log on blocking executor and found error: {code} Traceback (most recent call last): File "/Users/wangzejie/software/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main write_with_length(traceback.format_exc().encode("utf-8"), outfile) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: ordinal not in range(128) {code} Finally I found the problem: {code:title=worker.py|borderStyle=solid} # 178 line in spark 2.1.1 except Exception: try: write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile) write_with_length(traceback.format_exc().encode("utf-8"), outfile) except IOError: # JVM close the socket pass except Exception: # Write the error to stderr if it happened while serializing print("PySpark worker failed with exception:", file=sys.stderr) print(traceback.format_exc(), file=sys.stderr) {code} when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur exception like UnicodeDecodeError, the python worker can't send the trace info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace info length next. So it is blocking. {code:title=PythonRDD.scala|borderStyle=solid} # 190 line in spark 2.1.1 case SpecialLengths.PYTHON_EXCEPTION_THROWN => // Signals that an exception has been thrown in python val exLength = stream.readInt() // It is possible to be blocked {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452 ] Lantao Jin edited comment on SPARK-21023 at 6/10/17 8:18 AM: - If current behavior should be keep. I can add an environment variable.E.g {{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to {{childEnv}} map in SparkSubmitCommandBuilder and it will be set in very beginning. {code} public SparkLauncher setConfReplaceBehavior(String allowed) { checkNotNull(allowed, "allowed"); builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed); return this; } {code} Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix this case and keep current behavior by default. Generally, the file {{spark-env.sh}} deployed by infra team and protect by linux file permission mechanism. Of course, user can export to any value before submitting. But it means the user definitely know what they want instead of the current unexpected result. was (Author: cltlfcjin): If current behavior should be keep. I can add an environment variable.E.g {{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to {{childEnv}} map in SparkLauncher.class. {code} public SparkLauncher setConfReplaceBehavior(String allowed) { checkNotNull(allowed, "allowed"); builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed); return this; } {code} Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix this case and keep current behavior by default. Generally, the file {{spark-env.sh}} deployed by infra team and protect by linux file permission mechanism. Of course, user can export to any value before submitting. But it means the user definitely know what they want instead of the current unexpected result. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452 ] Lantao Jin edited comment on SPARK-21023 at 6/10/17 8:16 AM: - If current behavior should be keep. I can add an environment variable.E.g {{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to {{childEnv}} map in SparkLauncher.class. {code} public SparkLauncher setConfReplaceBehavior(String allowed) { checkNotNull(allowed, "allowed"); builder.childEnv.put(SPARK_CONF_REPLACE_ALLOWED, allowed); return this; } {code} Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix this case and keep current behavior by default. Generally, the file {{spark-env.sh}} deployed by infra team and protect by linux file permission mechanism. Of course, user can export to any value before submitting. But it means the user definitely know what they want instead of the current unexpected result. was (Author: cltlfcjin): If current behavior should be keep. I can add an environment variable.E.g {{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to {{childEnv}} map in AbstractCommandBuilder.class. {code} static final String SPARK_CONF_REPLACE_ALLOWED = "SPARK_CONF_REPLACE_ALLOWED"; {code} Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix this case and keep current behavior by default. Generally, the file {{spark-env.sh}} deployed by infra team and protect by linux file permission mechanism. Of course, user can export to any value before submitting. But it means the user definitely know what they want instead of the current unexpected result. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045452#comment-16045452 ] Lantao Jin commented on SPARK-21023: If current behavior should be keep. I can add an environment variable.E.g {{SPARK_CONF_REPLACE_ALLOWED}} with default value "true" and set to {{childEnv}} map in AbstractCommandBuilder.class. {code} static final String SPARK_CONF_REPLACE_ALLOWED = "SPARK_CONF_REPLACE_ALLOWED"; {code} Then we can export SPARK_CONF_REPLACE_ALLOWED=false in {{spark-env.sh}} to fix this case and keep current behavior by default. Generally, the file {{spark-env.sh}} deployed by infra team and protect by linux file permission mechanism. Of course, user can export to any value before submitting. But it means the user definitely know what they want instead of the current unexpected result. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045441#comment-16045441 ] Lantao Jin commented on SPARK-21023: I modify the description with adding two tables to illustrate why I consider it as a bug. Escalate to dev mailing list to discuss. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-21023: --- Description: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. For example: Current implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|N/A| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|N/A|N/A| |spark.F|"foo"|N/A|N/A| Expected right implement ||Property name||Value in default||Value in user-special||Finally value|| |spark.A|"foo"|"bar"|"bar"| |spark.B|"foo"|N/A|"foo"| |spark.C|N/A|"bar"|"bar"| |spark.D|"foo"|"foo"|"foo"| |spark.E|"foo"|"foo"|"foo"| |spark.F|"foo"|"foo"|"foo"| I can offer a patch to fix it if you think it make sense. was: The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. All this caused by below codes: {code} private Properties loadPropertiesFile() throws IOException { Properties props = new Properties(); File propsFile; if (propertiesFile != null) { // default conf property file will not be loaded when app developer use --properties-file as a submit args propsFile = new File(propertiesFile); checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", propertiesFile); } else { propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); } //... return props; } {code} I can offer a patch to fix it if you think it make sense. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > For example: > Current implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|N/A| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|N/A|N/A| > |spark.F|"foo"|N/A|N/A| > Expected right implement > ||Property name||Value in default||Value in user-special||Finally value|| > |spark.A|"foo"|"bar"|"bar"| > |spark.B|"foo"|N/A|"foo"| > |spark.C|N/A|"bar"|"bar"| > |spark.D|"foo"|"foo"|"foo"| > |spark.E|"foo"|"foo"|"foo"| > |spark.F|"foo"|"foo"|"foo"| > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org