[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
[ https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25418: Assignee: (was: Apache Spark) > The metadata of DataSource table should not include Hive-generated storage > properties. > -- > > Key: SPARK-25418 > URL: https://issues.apache.org/jira/browse/SPARK-25418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > When Hive support enabled, Hive catalog puts extra storage properties into > table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
[ https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25418: Assignee: Apache Spark > The metadata of DataSource table should not include Hive-generated storage > properties. > -- > > Key: SPARK-25418 > URL: https://issues.apache.org/jira/browse/SPARK-25418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > When Hive support enabled, Hive catalog puts extra storage properties into > table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
[ https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613040#comment-16613040 ] Apache Spark commented on SPARK-25418: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/22410 > The metadata of DataSource table should not include Hive-generated storage > properties. > -- > > Key: SPARK-25418 > URL: https://issues.apache.org/jira/browse/SPARK-25418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Priority: Major > > When Hive support enabled, Hive catalog puts extra storage properties into > table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.
Takuya Ueshin created SPARK-25418: - Summary: The metadata of DataSource table should not include Hive-generated storage properties. Key: SPARK-25418 URL: https://issues.apache.org/jira/browse/SPARK-25418 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Takuya Ueshin When Hive support enabled, Hive catalog puts extra storage properties into table metadata even for DataSource tables, but we should not have them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613033#comment-16613033 ] Apache Spark commented on SPARK-25352: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22409 > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613032#comment-16613032 ] Apache Spark commented on SPARK-25352: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22409 > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yinan Li resolved SPARK-25295. -- Resolution: Fixed Fix Version/s: 2.4.0 > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > Fix For: 2.4.0 > > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25415: Priority: Major (was: Minor) > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Major > Fix For: 3.0.0 > > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25415. - Resolution: Fixed Assignee: Maryann Xue Fix Version/s: 3.0.0 > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Maryann Xue >Priority: Major > Fix For: 3.0.0 > > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25397) SparkSession.conf fails when given default value with Python 3
[ https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612994#comment-16612994 ] Hyukjin Kwon commented on SPARK-25397: -- [~josephkb], do you want to backport this bit or just resolve this? Either way sounds okay to me. > SparkSession.conf fails when given default value with Python 3 > -- > > Key: SPARK-25397 > URL: https://issues.apache.org/jira/browse/SPARK-25397 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Joseph K. Bradley >Priority: Minor > > Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from > SparkSession when you give non-string default values. Reproduce via > SparkSession call: > {{spark.conf.get("myConf", False)}} > This gives the error: > {code} > >>> spark.conf.get("myConf", False) > Traceback (most recent call last): > File "", line 1, in > File > "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py", > line 51, in get > self._checkType(default, "default") > File > "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py", > line 62, in _checkType > if not isinstance(obj, str) and not isinstance(obj, unicode): > *NameError: name 'unicode' is not defined* > {code} > The offending line in Spark in branch-2.3 is: > https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py > which uses the value {{unicode}} which is not available in Python 3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612965#comment-16612965 ] Hyukjin Kwon commented on SPARK-25378: -- {quote} If it is not pubic, why didn't we hide it in the first place? {quote} Because we already state the package itself it not meant to be public .. - https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22 These modifiers were removed in SPARK-16813 for this reason. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612947#comment-16612947 ] Wenchen Fan commented on SPARK-25378: - [~viirya] Can you take a look and see how hard it is to fix it? After a quick look, I think this works in 2.3 if and only if: the `GenericArrayData` is created with `Array[String]` (i.e. a malformed ArrayData), and we wrongly call the `toArray[String](StringType)` method. A quick solution is to revert SPARK-23875 from 2.4, but then we sacrifice performance to retain a buggy but backward-compatible behavior. So we need to make a trade off here. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25357) Add metadata to SparkPlanInfo to dump more information like file path to event log
[ https://issues.apache.org/jira/browse/SPARK-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25357: --- Assignee: Lantao Jin > Add metadata to SparkPlanInfo to dump more information like file path to > event log > -- > > Key: SPARK-25357 > URL: https://issues.apache.org/jira/browse/SPARK-25357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Minor > Fix For: 2.3.2, 2.4.0 > > > Field {{metadata}} removed from {{SparkPlanInfo}} in SPARK-17701. > Corresponding, this field was also removed from event > {{SparkListenerSQLExecutionStart}} in Spark event log. If we want to analyze > event log to get some fields which wider than 100 (e.g the Location or > ReadSchema of FileScan), they are abbreviated in {{simpleString}} of > SparkPlanInfo JSON or {{physicalPlanDescription}} JSON. > Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log (It > contains the metadata field): > {quote}Location: > InMemoryFileIndex[hdfs://hercules/sys/edw/prs_idm/idm_cbt_am_t/cbt/cbt_acct_prfl_info/snapshot/dt..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct" > {quote} > So I add this field back to SparkPlanInfo class. Then it will log out the > meta data to event log. Intact information in event log is very useful for > offline job analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25402) Null handling in BooleanSimplification
[ https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25402. - Resolution: Fixed Fix Version/s: 2.4.0 2.3.2 > Null handling in BooleanSimplification > -- > > Key: SPARK-25402 > URL: https://issues.apache.org/jira/browse/SPARK-25402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.2, 2.4.0 > > > SPARK-20350 introduced a bug BooleanSimplification for null handling. For > example, the following case returns a wrong answer. > {code} > val schema = StructType.fromDDL("a boolean, b int") > val rows = Seq(Row(null, 1)) > val rdd = sparkContext.parallelize(rows) > val df = spark.createDataFrame(rdd, schema) > checkAnswer(df.where("(NOT a) OR a"), Seq.empty) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25357) Add metadata to SparkPlanInfo to dump more information like file path to event log
[ https://issues.apache.org/jira/browse/SPARK-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25357. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 22353 [https://github.com/apache/spark/pull/22353] > Add metadata to SparkPlanInfo to dump more information like file path to > event log > -- > > Key: SPARK-25357 > URL: https://issues.apache.org/jira/browse/SPARK-25357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Lantao Jin >Priority: Minor > Fix For: 2.4.0, 2.3.2 > > > Field {{metadata}} removed from {{SparkPlanInfo}} in SPARK-17701. > Corresponding, this field was also removed from event > {{SparkListenerSQLExecutionStart}} in Spark event log. If we want to analyze > event log to get some fields which wider than 100 (e.g the Location or > ReadSchema of FileScan), they are abbreviated in {{simpleString}} of > SparkPlanInfo JSON or {{physicalPlanDescription}} JSON. > Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log (It > contains the metadata field): > {quote}Location: > InMemoryFileIndex[hdfs://hercules/sys/edw/prs_idm/idm_cbt_am_t/cbt/cbt_acct_prfl_info/snapshot/dt..., > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct" > {quote} > So I add this field back to SparkPlanInfo class. Then it will log out the > meta data to event log. Intact information in event log is very useful for > offline job analysis. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25387. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22374 [https://github.com/apache/spark/pull/22374] > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE
[ https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25387: --- Assignee: Maxim Gekk > Malformed CSV causes NPE > > > Key: SPARK-25387 > URL: https://issues.apache.org/jira/browse/SPARK-25387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Loading a malformed CSV files or a dataset can cause NullPointerException, > for example the code: > {code:scala} > val schema = StructType(StructField("a", IntegerType) :: Nil) > val input = spark.createDataset(Seq("\u\u\u0001234")) > spark.read.schema(schema).csv(input).collect() > {code} > crashes with the exception: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68) > {code} > If schema is not specified, the following exception is thrown: > {code:java} > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192) > at > scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99) > at > scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109) > at > org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23820) Allow the long form of call sites to be recorded in the log
[ https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23820. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22398 [https://github.com/apache/spark/pull/22398] > Allow the long form of call sites to be recorded in the log > --- > > Key: SPARK-23820 > URL: https://issues.apache.org/jira/browse/SPARK-23820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Michael Mior >Assignee: Michael Mior >Priority: Trivial > Fix For: 2.4.0 > > > It would be nice if the long form of the callsite information could be > included in the log. An example of what I'm proposing is here: > https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894 ] Stavros Kontopoulos edited comment on SPARK-25291 at 9/13/18 12:41 AM: --- [~ifilonenko] I can have a look a bit weird, but kind of expected as these tests are the only ones that use fabric8io client to connect to the running pod. There is no good way to do this right now. was (Author: skonto): [~ifilonenko] I can have a look a bit weird. > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894 ] Stavros Kontopoulos edited comment on SPARK-25291 at 9/13/18 12:41 AM: --- [~ifilonenko] I can have a look a bit weird, but kind of expected as these tests are the only ones that use fabric8io client to connect to the running pod. There is no good way to do this right now. I will try to debug it. was (Author: skonto): [~ifilonenko] I can have a look a bit weird, but kind of expected as these tests are the only ones that use fabric8io client to connect to the running pod. There is no good way to do this right now. > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)
[ https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894 ] Stavros Kontopoulos commented on SPARK-25291: - [~ifilonenko] I can have a look a bit weird. > Flakiness of tests in terms of executor memory (SecretsTestSuite) > - > > Key: SPARK-25291 > URL: https://issues.apache.org/jira/browse/SPARK-25291 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Ilan Filonenko >Priority: Major > > SecretsTestSuite shows flakiness in terms of correct setting of executor > memory: > Run SparkPi with env and mount secrets. *** FAILED *** > "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272) > When ran with default settings -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables
[ https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612881#comment-16612881 ] Yuming Wang commented on SPARK-23012: - It seems the following PR resolves your issue: https://github.com/apache/spark/pull/20816 > Support for predicate pushdown and partition pruning when left joining large > Hive tables > > > Key: SPARK-23012 > URL: https://issues.apache.org/jira/browse/SPARK-23012 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Rick Kramer >Priority: Major > > We have a hive view which left outer joins several large, partitioned orc > hive tables together on date. When the view is used in a hive query, hive > pushes date predicates down into the joins and prunes the partitions for all > tables. When I use this view from pyspark, the predicate is only used to > prune the left-most table and all partitions from the additional tables are > selected. > For example, consider two partitioned hive tables a & b joined in a view: > create table a ( >a_val string > ) > partitioned by (ds string) > stored as orc; > create table b ( >b_val string > ) > partitioned by (ds string) > stored as orc; > create view example_view as > select > a_val > , b_val > , ds > from a > left outer join b on b.ds = a.ds > Then in pyspark you might try to query from the view filtering on ds: > spark.table('example_view').filter(F.col('ds') == '2018-01-01') > If table a and b are large, this results in a plan that filters a on ds = > 2018-01-01, but selects scans all partitions of table b. > If the join in the view is changed to an inner join, the predicate gets > pushed down to a & b and the partitions are pruned as you'd expect. > In practice, the view is fairly complex and contains a lot of business logic > we'd prefer not to replicate in pyspark if we can avoid it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
[ https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25416: Assignee: Apache Spark > ArrayPosition function may return incorrect result when right expression is > implicitly downcasted. > -- > > Key: SPARK-25416 > URL: https://issues.apache.org/jira/browse/SPARK-25416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Major > > In ArrayPosition, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > spark-sql> select array_position(array(1), 1.34); > 1 > spark-sql> select array_position(array(1), 'foo'); > null > We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
[ https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612817#comment-16612817 ] Apache Spark commented on SPARK-25416: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22407 > ArrayPosition function may return incorrect result when right expression is > implicitly downcasted. > -- > > Key: SPARK-25416 > URL: https://issues.apache.org/jira/browse/SPARK-25416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayPosition, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > spark-sql> select array_position(array(1), 1.34); > 1 > spark-sql> select array_position(array(1), 'foo'); > null > We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
[ https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612818#comment-16612818 ] Apache Spark commented on SPARK-25416: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22407 > ArrayPosition function may return incorrect result when right expression is > implicitly downcasted. > -- > > Key: SPARK-25416 > URL: https://issues.apache.org/jira/browse/SPARK-25416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayPosition, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > spark-sql> select array_position(array(1), 1.34); > 1 > spark-sql> select array_position(array(1), 'foo'); > null > We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
[ https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25416: Assignee: (was: Apache Spark) > ArrayPosition function may return incorrect result when right expression is > implicitly downcasted. > -- > > Key: SPARK-25416 > URL: https://issues.apache.org/jira/browse/SPARK-25416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayPosition, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > spark-sql> select array_position(array(1), 1.34); > 1 > spark-sql> select array_position(array(1), 'foo'); > null > We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted
[ https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25417: Assignee: Apache Spark > ArrayContains function may return incorrect result when right expression is > implicitly down casted > -- > > Key: SPARK-25417 > URL: https://issues.apache.org/jira/browse/SPARK-25417 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Major > > In ArrayContains, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > {code:java} > spark-sql> select array_position(array(1), 1.34); > true > {code} > > {code:java} > spark-sql> select array_position(array(1), 'foo'); > null > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted
[ https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612816#comment-16612816 ] Apache Spark commented on SPARK-25417: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22408 > ArrayContains function may return incorrect result when right expression is > implicitly down casted > -- > > Key: SPARK-25417 > URL: https://issues.apache.org/jira/browse/SPARK-25417 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayContains, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > {code:java} > spark-sql> select array_position(array(1), 1.34); > true > {code} > > {code:java} > spark-sql> select array_position(array(1), 'foo'); > null > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted
[ https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25417: Assignee: (was: Apache Spark) > ArrayContains function may return incorrect result when right expression is > implicitly down casted > -- > > Key: SPARK-25417 > URL: https://issues.apache.org/jira/browse/SPARK-25417 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayContains, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > {code:java} > spark-sql> select array_position(array(1), 1.34); > true > {code} > > {code:java} > spark-sql> select array_position(array(1), 'foo'); > null > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted
Dilip Biswal created SPARK-25417: Summary: ArrayContains function may return incorrect result when right expression is implicitly down casted Key: SPARK-25417 URL: https://issues.apache.org/jira/browse/SPARK-25417 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dilip Biswal In ArrayContains, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : {code:java} spark-sql> select array_position(array(1), 1.34); true {code} {code:java} spark-sql> select array_position(array(1), 'foo'); null {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612804#comment-16612804 ] Apache Spark commented on SPARK-25415: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22407 > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612803#comment-16612803 ] Apache Spark commented on SPARK-25415: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/22407 > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25415: Assignee: (was: Apache Spark) > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612766#comment-16612766 ] Apache Spark commented on SPARK-25415: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/22406 > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Priority: Minor > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
[ https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25415: Assignee: Apache Spark > Make plan change log in RuleExecutor configurable by SQLConf > > > Key: SPARK-25415 > URL: https://issues.apache.org/jira/browse/SPARK-25415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maryann Xue >Assignee: Apache Spark >Priority: Minor > > In RuleExecutor, after applying a rule, if the plan has changed, the before > and after plan will be logged using level "trace". At times, however, such > information can be very helpful for debugging, so making the log level > configurable in SQLConf would allow users to turn on the plan change log > independently and save the trouble of tweaking log4j settings. > Meanwhile, filtering plan change log for specific rules can also be very > useful. > So I propose adding two confs: > 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for > logging plan changes after a rule is applied. > 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only > for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.
[ https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dilip Biswal updated SPARK-25416: - Summary: ArrayPosition function may return incorrect result when right expression is implicitly downcasted. (was: ArrayPosition may return incorrect result when right expression is downcasted.) > ArrayPosition function may return incorrect result when right expression is > implicitly downcasted. > -- > > Key: SPARK-25416 > URL: https://issues.apache.org/jira/browse/SPARK-25416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dilip Biswal >Priority: Major > > In ArrayPosition, we currently cast the right hand side expression to match > the element type of the left hand side Array. This may result in down casting > and may return wrong result or questionable result. > Example : > spark-sql> select array_position(array(1), 1.34); > 1 > spark-sql> select array_position(array(1), 'foo'); > null > We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25416) ArrayPosition may return incorrect result when right expression is downcasted.
Dilip Biswal created SPARK-25416: Summary: ArrayPosition may return incorrect result when right expression is downcasted. Key: SPARK-25416 URL: https://issues.apache.org/jira/browse/SPARK-25416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dilip Biswal In ArrayPosition, we currently cast the right hand side expression to match the element type of the left hand side Array. This may result in down casting and may return wrong result or questionable result. Example : spark-sql> select array_position(array(1), 1.34); 1 spark-sql> select array_position(array(1), 'foo'); null We should safely coerce both left and right hand side expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf
Maryann Xue created SPARK-25415: --- Summary: Make plan change log in RuleExecutor configurable by SQLConf Key: SPARK-25415 URL: https://issues.apache.org/jira/browse/SPARK-25415 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Maryann Xue In RuleExecutor, after applying a rule, if the plan has changed, the before and after plan will be logged using level "trace". At times, however, such information can be very helpful for debugging, so making the log level configurable in SQLConf would allow users to turn on the plan change log independently and save the trouble of tweaking log4j settings. Meanwhile, filtering plan change log for specific rules can also be very useful. So I propose adding two confs: 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for logging plan changes after a rule is applied. 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only for a set of specified rules, separated by commas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612662#comment-16612662 ] Apache Spark commented on SPARK-25295: -- User 'skonto' has created a pull request for this issue: https://github.com/apache/spark/pull/22405 > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25295: Assignee: Apache Spark > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Assignee: Apache Spark >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25295: Assignee: (was: Apache Spark) > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20350) Apply Complementation Laws during boolean expression simplification
[ https://issues.apache.org/jira/browse/SPARK-20350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20350: -- Component/s: (was: Optimizer) SQL > Apply Complementation Laws during boolean expression simplification > --- > > Key: SPARK-20350 > URL: https://issues.apache.org/jira/browse/SPARK-20350 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles >Assignee: Michael Styles >Priority: Major > Fix For: 2.2.0, 2.3.0 > > > Apply Complementation Laws during boolean expression simplification. > * A AND NOT(A) == FALSE > * A OR NOT(A) == TRUE -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-20799. - > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-20799. --- Resolution: Won't Fix > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612642#comment-16612642 ] Dongjoon Hyun commented on SPARK-20799: --- +1 for closing this as a WONTFIX. > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25402) Null handling in BooleanSimplification
[ https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612530#comment-16612530 ] Apache Spark commented on SPARK-25402: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22403 > Null handling in BooleanSimplification > -- > > Key: SPARK-25402 > URL: https://issues.apache.org/jira/browse/SPARK-25402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > SPARK-20350 introduced a bug BooleanSimplification for null handling. For > example, the following case returns a wrong answer. > {code} > val schema = StructType.fromDDL("a boolean, b int") > val rows = Seq(Row(null, 1)) > val rdd = sparkContext.parallelize(rows) > val df = spark.createDataFrame(rdd, schema) > checkAnswer(df.where("(NOT a) OR a"), Seq.empty) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25402) Null handling in BooleanSimplification
[ https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612529#comment-16612529 ] Apache Spark commented on SPARK-25402: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22403 > Null handling in BooleanSimplification > -- > > Key: SPARK-25402 > URL: https://issues.apache.org/jira/browse/SPARK-25402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.1 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > > SPARK-20350 introduced a bug BooleanSimplification for null handling. For > example, the following case returns a wrong answer. > {code} > val schema = StructType.fromDDL("a boolean, b int") > val rows = Seq(Row(null, 1)) > val rdd = sparkContext.parallelize(rows) > val df = spark.createDataFrame(rdd, schema) > checkAnswer(df.where("(NOT a) OR a"), Seq.empty) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25363) Schema pruning doesn't work if nested column is used in where clause
[ https://issues.apache.org/jira/browse/SPARK-25363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25363. - Resolution: Fixed Fix Version/s: 2.4.0 3.0.0 Issue resolved by pull request 22357 [https://github.com/apache/spark/pull/22357] > Schema pruning doesn't work if nested column is used in where clause > > > Key: SPARK-25363 > URL: https://issues.apache.org/jira/browse/SPARK-25363 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0, 2.4.0 > > > Schema pruning doesn't work if nested column is used in where clause. > For example, > {code} > sql("select name.first from contacts where name.first = 'David'") > == Physical Plan == > *(1) Project [name#19.first AS first#40] > +- *(1) Filter (isnotnull(name#19) && (name#19.first = David)) >+- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, > PartitionFilters: [], > PushedFilters: [IsNotNull(name)], ReadSchema: > struct> > {code} > In above query plan, the scan node reads the entire schema of `name` column. > This issue is reported by: > https://github.com/apache/spark/pull/21320#issuecomment-419290197 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs
[ https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612436#comment-16612436 ] Joseph K. Bradley commented on SPARK-25321: --- You're right; these are breaking changes. If we're sticking with the rules, then we should revert these in branch-2.4, but we could keep them in master if the next release is 3.0. Is it easy to revert these PRs, or have they collected conflicts by now? > ML, Graph 2.4 QA: API: New Scala APIs, docs > --- > > Key: SPARK-25321 > URL: https://issues.apache.org/jira/browse/SPARK-25321 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 2.4.0 >Reporter: Weichen Xu >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public Scala APIs added to MLlib & GraphX. Take note of: > * Protected/public classes or methods. If access can be more private, then > it should be. > * Also look for non-sealed traits. > * Documentation: Missing? Bad links or formatting? > *Make sure to check the object doc!* > As you find issues, please create JIRAs and link them to this issue. > For *user guide issues* link the new JIRAs to the relevant user guide QA issue -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612433#comment-16612433 ] Marcelo Vanzin commented on SPARK-25380: Yep. That's a 200MB plan description string... > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4
[ https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612399#comment-16612399 ] Xiangrui Meng commented on SPARK-25378: --- Comments from [~vomjom] at https://github.com/tensorflow/ecosystem/pull/100: {quote} We currently only do releases along with TensorFlow releases, and the next one that'll include this is TF 1.12. {quote} This means Spark+TF users cannot migrate to Spark 2.4 until TF 1.12 is released. I think we need to decide based on the impact instead of just saying "this is not a public API". If it is not pubic, why didn't we hide it in the first place? And as [~cloud_fan] mentioned, it is hard to implement data source without touching those "private" APIs. > ArrayData.toArray(StringType) assume UTF8String in 2.4 > -- > > Key: SPARK-25378 > URL: https://issues.apache.org/jira/browse/SPARK-25378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Critical > > The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT: > {code} > import org.apache.spark.sql.catalyst.util._ > import org.apache.spark.sql.types.StringType > ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType) > res0: Array[String] = Array(a, b) > {code} > In 2.4.0-SNAPSHOT, the error is > {code}java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.spark.unsafe.types.UTF8String > at > org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at > org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136) > at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178) > ... 51 elided > {code} > cc: [~cloud_fan] [~yogeshg] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.
[ https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612337#comment-16612337 ] Stavros Kontopoulos commented on SPARK-25295: - guys I started working on a short fix. > Pod names conflicts in client mode, if previous submission was not a clean > shutdown. > > > Key: SPARK-25295 > URL: https://issues.apache.org/jira/browse/SPARK-25295 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Prashant Sharma >Priority: Major > > If the previous job was killed somehow, by disconnecting the client. It > leaves behind the executor pods named spark-exec-#, which cause naming > conflicts and failures for the next job submission. > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods > "spark-exec-4" already exists. Received status: Status(apiVersion=v1, > code=409, details=StatusDetails(causes=[], group=null, kind=pods, > name=spark-exec-4, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=pods "spark-exec-4" already > exists, metadata=ListMeta(resourceVersion=null, selfLink=null, > additionalProperties={}), reason=AlreadyExists, status=Failure, > additionalProperties={}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25352: --- Assignee: Liang-Chi Hsieh > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25352. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22344 [https://github.com/apache/spark/pull/22344] > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline
[ https://issues.apache.org/jira/browse/SPARK-24627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612085#comment-16612085 ] Ayush Anubhava edited comment on SPARK-24627 at 9/12/18 12:53 PM: -- Check the the principal name given in spark-default conf in driver side. The principal name should be with realm so that at the time of renewal , the HDFS Delegation token can be given to spark was (Author: ayush007): Check the the principal name given in spark-default conf in driver side. > [Spark2.3.0] After HDFS Token expire kinit not able to submit job using > beeline > --- > > Key: SPARK-24627 > URL: https://issues.apache.org/jira/browse/SPARK-24627 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3.0 > Hadoop: 2.8.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > beeline session was active. > 1.Launch spark-beeline > 2. create table alt_s1 (time timestamp, name string, isright boolean, > datetoday date, num binary, height double, score float, decimaler > decimal(10,0), id tinyint, age int, license bigint, length smallint) row > format delimited fields terminated by ','; > 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1; > 4. show tables;( Table listed successfully ) > 5. select * from alt_s1; > Throws HDFS_DELEGATION_TOKEN Exception > 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 22.0 (TID 106, blr123110, executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255) > at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) > at > org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272) > at > org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > **Note: Even after kinit spark/hadoop token is not getting renewed.** > Now Launch spark sql session ( Select * from alt_s1 ) is successful. > 1. Launch spark-sql > 2.spark-sql> select * from
[jira] [Commented] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline
[ https://issues.apache.org/jira/browse/SPARK-24627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612085#comment-16612085 ] Ayush Anubhava commented on SPARK-24627: Check the the principal name given in spark-default conf in driver side. > [Spark2.3.0] After HDFS Token expire kinit not able to submit job using > beeline > --- > > Key: SPARK-24627 > URL: https://issues.apache.org/jira/browse/SPARK-24627 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version: 2.3.0 > Hadoop: 2.8.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > beeline session was active. > 1.Launch spark-beeline > 2. create table alt_s1 (time timestamp, name string, isright boolean, > datetoday date, num binary, height double, score float, decimaler > decimal(10,0), id tinyint, age int, license bigint, length smallint) row > format delimited fields terminated by ','; > 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1; > 4. show tables;( Table listed successfully ) > 5. select * from alt_s1; > Throws HDFS_DELEGATION_TOKEN Exception > 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1; > Error: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in > stage 22.0 (TID 106, blr123110, executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache > at org.apache.hadoop.ipc.Client.call(Client.java:1475) > at org.apache.hadoop.ipc.Client.call(Client.java:1412) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255) > at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213) > at > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201) > at > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306) > at > org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272) > at > org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304) > at > org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) > at > org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > **Note: Even after kinit spark/hadoop token is not getting renewed.** > Now Launch spark sql session ( Select * from alt_s1 ) is successful. > 1. Launch spark-sql > 2.spark-sql> select * from alt_s1; > 2018-06-22 14:24:04 INFO HiveMetaStore:746 - 0: get_table : db=test_one > tbl=alt_s1 > 2018-06-22 14:24:04 INFO audit:371 - ugi=spark/had...@hadoop.com > ip=unknown-ip-addr cmd=get_table : db=test_one tbl=alt_s1 > 2018-06-22 14:24:04 INFO
[jira] [Resolved] (SPARK-25371) Vector Assembler with no input columns leads to opaque error
[ https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25371. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.4.0 2.3.2 > Vector Assembler with no input columns leads to opaque error > > > Key: SPARK-25371 > URL: https://issues.apache.org/jira/browse/SPARK-25371 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0, 2.3.1 >Reporter: Victor Alor >Assignee: Marco Gaido >Priority: Trivial > Fix For: 2.3.2, 2.4.0 > > > When `VectorAssembler ` is given an empty array as its inputColumns it throws > an opaque error. In versions less than 2.3 `VectorAssembler` it simply > appends a column containing empty vectors. > > {code:java} > val inputCols = Array() > val outputCols = Array("A") > val vectorAssembler = new VectorAssembler() > .setInputCols(inputCols) > .setOutputCol(outputCols) > vectorAssmbler.fit(data).transform(df) > {code} > In versions 2.3 > this throws the exception below > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due > to data type mismatch: input to function named_struct requires at least one > argument;; > {code} > Whereas in versions less than 2.3 it just adds a column containing an empty > vector. > I'm not certain if this is an intentional choice or an actual bug. If this is > a bug, the `VectorAssembler` should be modified to append an empty vector > column if it detects no inputCols. > > If it is a design decision it would be nice to throw a human readable > exception explicitly stating inputColumns must not be empty. The current > error is somewhat opaque. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results
[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612026#comment-16612026 ] Evelyn Bayes commented on SPARK-25150: -- Hey Peter, don't stress it. I'm new to the community as well but I'm been a busy so all good :) > Joining DataFrames derived from the same source yields confusing/incorrect > results > -- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Nicholas Chammas >Priority: Major > Attachments: output-with-implicit-cross-join.txt, > output-without-implicit-cross-join.txt, persons.csv, states.csv, > zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not correct in the sense that it should be > left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612001#comment-16612001 ] Steve Loughran commented on SPARK-20153: bq. Amazon EMR does not currently support use of the Apache Hadoop S3A file system." the Amazon EMR team are free to copy and paste any parts of the ASF-licensed s3a code into their own closed-source connector to S3. The best thing you can do here is ask them to do so. URL on S3A in emr has changed BTW, it's now a footnote in [https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html] > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
[ https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611995#comment-16611995 ] Steve Loughran commented on SPARK-20799: Update: Hadoop 3.3+ will remove all support for user:secret in S3A URIs because it's impossible to keep those secrets out of logs, and logs get everywhere. No plans to backport that, though HADOOP-15747 will, so giving people the specific hadoop version where this dangerous feature gets pull. Propose, close as a WONTFIX. > Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL > - > > Key: SPARK-20799 > URL: https://issues.apache.org/jira/browse/SPARK-20799 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Hadoop 2.8.0 binaries >Reporter: Jork Zijlstra >Priority: Minor > > We are getting the following exception: > {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. > It must be specified manually.{code} > Combining following factors will cause it: > - Use S3 > - Use format ORC > - Don't apply a partitioning on de data > - Embed AWS credentials in the path > The problem is in the PartitioningAwareFileIndex def allFiles() > {code} > leafDirToChildrenFiles.get(qualifiedPath) > .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } > .getOrElse(Array.empty) > {code} > leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the > qualifiedPath contains the path WITH credentials. > So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no > data is read and the schema cannot be defined. > Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login > details. This is insecure and may be unsupported in future., but this should > not mean that it shouldn't work anymore. > Workaround: > Move the AWS credentials from the path to the SparkSession > {code} > SparkSession.builder > .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) > .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25414: Assignee: Wenchen Fan (was: Apache Spark) > The numInputRows metrics can be incorrect for streaming self-join > - > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25414: Assignee: Apache Spark (was: Wenchen Fan) > The numInputRows metrics can be incorrect for streaming self-join > - > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join
[ https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611932#comment-16611932 ] Apache Spark commented on SPARK-25414: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22402 > The numInputRows metrics can be incorrect for streaming self-join > - > > Key: SPARK-25414 > URL: https://issues.apache.org/jira/browse/SPARK-25414 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join
Wenchen Fan created SPARK-25414: --- Summary: The numInputRows metrics can be incorrect for streaming self-join Key: SPARK-25414 URL: https://issues.apache.org/jira/browse/SPARK-25414 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-25413: -- Attachment: decimalBoundaryDataHive.csv > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > sql("select avg(salary)+10 from hiveBigDecimal").show(false) > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364| > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-25413: -- Attachment: (was: decimalBoundaryDataHive.csv) > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > sql("select avg(salary)+10 from hiveBigDecimal").show(false) > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364| > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-25413: -- Description: sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary decimal(27, 10))row format delimited fields terminated by ','") sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO table hiveBigDecimal") sql("select avg(salary)+10 from hiveBigDecimal").show(false) Output with 2.3.1 ++ |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536364| ++ OutPut with 2.3.2_RC5 |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536000 | + *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* was: sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary decimal(27, 10))row format delimited fields terminated by ','") sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO table hiveBigDecimal") sql("select avg(salary)+10 from hiveBigDecimal").show(fals Output with 2.3.1 ++ |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536364 | ++ OutPut with 2.3.2_RC5 |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536000 | + *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > sql("select avg(salary)+10 from hiveBigDecimal").show(false) > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364| > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000
[jira] [Commented] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611759#comment-16611759 ] Ajith S commented on SPARK-25413: - Thank you for reporting the issue sandeep. I think the problem is with org.apache.spark.sql.catalyst.expressions.aggregate.AverageLike#sumDataType as it increases the precision unnecessarily. Adding a PR to fix this. Refer https://github.com/apache/spark/pull/22401 > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25413: Assignee: (was: Apache Spark) > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611758#comment-16611758 ] Apache Spark commented on SPARK-25413: -- User 'ajithme' has created a pull request for this issue: https://github.com/apache/spark/pull/22401 > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25413: Assignee: Apache Spark > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Assignee: Apache Spark >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-25413: -- Priority: Blocker (was: Minor) > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Blocker > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
[ https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep katta updated SPARK-25413: -- Attachment: decimalBoundaryDataHive.csv > [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done > -- > > Key: SPARK-25413 > URL: https://issues.apache.org/jira/browse/SPARK-25413 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Csv FIle content > > 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 > 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 > 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 > 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 > 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 > 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 > 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 > 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 > 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 > 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 > 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 >Reporter: sandeep katta >Priority: Minor > Attachments: decimalBoundaryDataHive.csv > > > sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, > country String, name String, phonetype String, serialname String, salary > decimal(27, 10))row format delimited fields terminated by ','") > > sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' > INTO table hiveBigDecimal") > > sql("select avg(salary)+10 from hiveBigDecimal").show(fals > > Output with 2.3.1 > ++ > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536364 | > ++ > OutPut with 2.3.2_RC5 > |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS > DECIMAL(32,14)))| > ++ > |37800224355780013.75982042536000 > | > + > *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
sandeep katta created SPARK-25413: - Summary: [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done Key: SPARK-25413 URL: https://issues.apache.org/jira/browse/SPARK-25413 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.1 Environment: Csv FIle content 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16 Reporter: sandeep katta sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary decimal(27, 10))row format delimited fields terminated by ','") sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO table hiveBigDecimal") sql("select avg(salary)+10 from hiveBigDecimal").show(fals Output with 2.3.1 ++ |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536364 | ++ OutPut with 2.3.2_RC5 |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS DECIMAL(32,14)))| ++ |37800224355780013.75982042536000 | + *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence
[ https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611733#comment-16611733 ] Peter Knight commented on SPARK-21542: -- Thanks for the replay [~JohnHBauer]. Yes I am using @keyword_only decorator exactly like in the stack overflow example you cite. I'll be interested to see your code if you get it working. Thanks. > Helper functions for custom Python Persistence > -- > > Key: SPARK-21542 > URL: https://issues.apache.org/jira/browse/SPARK-21542 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Ajay Saini >Assignee: Ajay Saini >Priority: Major > Fix For: 2.3.0 > > > Currently, there is no way to easily persist Json-serializable parameters in > Python only. All parameters in Python are persisted by converting them to > Java objects and using the Java persistence implementation. In order to > facilitate the creation of custom Python-only pipeline stages, it would be > good to have a Python-only persistence framework so that these stages do not > need to be implemented in Scala for persistence. > This task involves: > - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, > DefaultParamsReader, and DefaultParamsWriter in pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717 ] Kazuaki Ishizaki commented on SPARK-20184: -- In {{branch-2.4}}, we still see the performance degradation compared to w/o codegen {code:java} OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz SPARK-20184: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative codegen = T 2915 / 3204 0.0 2915001883.0 1.0X codegen = F 1178 / 1368 0.0 1178020462.0 2.5X {code} > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org