[jira] [Updated] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
[ https://issues.apache.org/jira/browse/SPARK-23705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23705: - Target Version/s: (was: 2.3.0) Let's avoid to set a target version which is usually reserved for a committer. > dataframe.groupBy() may inadvertently receive sequence of non-distinct strings > -- > > Key: SPARK-23705 > URL: https://issues.apache.org/jira/browse/SPARK-23705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Khoa Tran >Priority: Minor > Labels: beginner, easyfix, features, newbie, starter > Original Estimate: 1h > Remaining Estimate: 1h > > {code:java} > // code placeholder > package org.apache.spark.sql > . > . > . > class Dataset[T] private[sql]( > . > . > . > def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { > val colNames: Seq[String] = col1 +: cols > RelationalGroupedDataset( > toDF(), colNames.map(colName => resolve(colName)), > RelationalGroupedDataset.GroupByType) > } > {code} > should append a `.distinct` after `colNames` when used in `groupBy` > > Not sure if the community agrees with this or it's up to the users to perform > the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson
[ https://issues.apache.org/jira/browse/SPARK-23696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401479#comment-16401479 ] Hyukjin Kwon commented on SPARK-23696: -- Why don't we just directly use {{DataType.fromJson}} if you need to catch the exception? > StructType.fromString swallows exceptions from DataType.fromJson > > > Key: SPARK-23696 > URL: https://issues.apache.org/jira/browse/SPARK-23696 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Simeon H.K. Fitch >Priority: Trivial > > `StructType.fromString` swallows exceptions from `DataType.fromJson`, > assuming they are an indication that the `LegacyTypeStringParser.parse` > should be called instead. When that fails (because it throws an excreption), > an error message is generated that does not reflect the true problem at hand, > effectively swallowing the exception from `DataType.fromJson`. This makes > debugging Parquet schema issues more difficult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
[ https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23162: Assignee: Apache Spark > PySpark ML LinearRegressionSummary missing r2adj > > > Key: SPARK-23162 > URL: https://issues.apache.org/jira/browse/SPARK-23162 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
[ https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23162: Assignee: (was: Apache Spark) > PySpark ML LinearRegressionSummary missing r2adj > > > Key: SPARK-23162 > URL: https://issues.apache.org/jira/browse/SPARK-23162 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj
[ https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401458#comment-16401458 ] Apache Spark commented on SPARK-23162: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/20842 > PySpark ML LinearRegressionSummary missing r2adj > > > Key: SPARK-23162 > URL: https://issues.apache.org/jira/browse/SPARK-23162 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23706: Assignee: Apache Spark > spark.conf.get(value, default=None) should produce None in PySpark > -- > > Key: SPARK-23706 > URL: https://issues.apache.org/jira/browse/SPARK-23706 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Scala: > {code} > scala> spark.conf.get("hey") > java.util.NoSuchElementException: hey > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600) > at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74) > ... 49 elided > scala> spark.conf.get("hey", null) > res1: String = null > scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) > res2: String = null > {code} > Python: > {code} > >>> spark.conf.get("hey") > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("hey", None) > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) > u'STATIC' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401451#comment-16401451 ] Apache Spark commented on SPARK-23706: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20841 > spark.conf.get(value, default=None) should produce None in PySpark > -- > > Key: SPARK-23706 > URL: https://issues.apache.org/jira/browse/SPARK-23706 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Scala: > {code} > scala> spark.conf.get("hey") > java.util.NoSuchElementException: hey > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600) > at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74) > ... 49 elided > scala> spark.conf.get("hey", null) > res1: String = null > scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) > res2: String = null > {code} > Python: > {code} > >>> spark.conf.get("hey") > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("hey", None) > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) > u'STATIC' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark
[ https://issues.apache.org/jira/browse/SPARK-23706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23706: Assignee: (was: Apache Spark) > spark.conf.get(value, default=None) should produce None in PySpark > -- > > Key: SPARK-23706 > URL: https://issues.apache.org/jira/browse/SPARK-23706 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Scala: > {code} > scala> spark.conf.get("hey") > java.util.NoSuchElementException: hey > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at > org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600) > at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74) > ... 49 elided > scala> spark.conf.get("hey", null) > res1: String = null > scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) > res2: String = null > {code} > Python: > {code} > >>> spark.conf.get("hey") > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("hey", None) > ... > py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. > : java.util.NoSuchElementException: hey > ... > >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) > u'STATIC' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23706) spark.conf.get(value, default=None) should produce None in PySpark
Hyukjin Kwon created SPARK-23706: Summary: spark.conf.get(value, default=None) should produce None in PySpark Key: SPARK-23706 URL: https://issues.apache.org/jira/browse/SPARK-23706 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Scala: {code} scala> spark.conf.get("hey") java.util.NoSuchElementException: hey at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) at org.apache.spark.sql.internal.SQLConf$$anonfun$getConfString$2.apply(SQLConf.scala:1600) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:1600) at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:74) ... 49 elided scala> spark.conf.get("hey", null) res1: String = null scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null) res2: String = null {code} Python: {code} >>> spark.conf.get("hey") ... py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. : java.util.NoSuchElementException: hey ... >>> spark.conf.get("hey", None) ... py4j.protocol.Py4JJavaError: An error occurred while calling o30.get. : java.util.NoSuchElementException: hey ... >>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) u'STATIC' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23705) dataframe.groupBy() may inadvertently receive sequence of non-distinct strings
Khoa Tran created SPARK-23705: - Summary: dataframe.groupBy() may inadvertently receive sequence of non-distinct strings Key: SPARK-23705 URL: https://issues.apache.org/jira/browse/SPARK-23705 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Khoa Tran {code:java} // code placeholder package org.apache.spark.sql . . . class Dataset[T] private[sql]( . . . def groupBy(col1: String, cols: String*): RelationalGroupedDataset = { val colNames: Seq[String] = col1 +: cols RelationalGroupedDataset( toDF(), colNames.map(colName => resolve(colName)), RelationalGroupedDataset.GroupByType) } {code} should append a `.distinct` after `colNames` when used in `groupBy` Not sure if the community agrees with this or it's up to the users to perform the distinct operation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401428#comment-16401428 ] sirisha edited comment on SPARK-23685 at 3/16/18 3:47 AM: -- [~apachespark] Can anyone please guide me on how to assign this story to myself? I do not see an option to assign it to myself. was (Author: sindiri): [~apachespark] Can anyone please guide me on how to assign this pull request to myself? I do not see an option to assign it to myself. > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401428#comment-16401428 ] sirisha commented on SPARK-23685: - [~apachespark] Can anyone please guide me on how to assign this pull request to myself? I do not see an option to assign it to myself. > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23673) PySpark dayofweek does not conform with ISO 8601
[ https://issues.apache.org/jira/browse/SPARK-23673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401392#comment-16401392 ] Kazuaki Ishizaki commented on SPARK-23673: -- In Spark, `dayofweek` comes from SQL. The result value is based on [ODBC standard|https://mariadb.com/kb/en/library/dayofweek/]. Would it be better to prepare another function that is compatible with Pandas? cc [~ueshin] [~bryanc] > PySpark dayofweek does not conform with ISO 8601 > > > Key: SPARK-23673 > URL: https://issues.apache.org/jira/browse/SPARK-23673 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.3.0 >Reporter: Ivan SPM >Priority: Minor > > The new function dayofweek, that returns 1 = Sunday and 7 for Saturday, does > not conform with ISO 8601, that states that the first day of the week is > Monday, so 1 = Monday and 7 = Sunday. This behavior is also different from > Pandas, that uses 0 = Monday and 6 = Sunday, but pandas is at least > consistent with the ordering of ISO 8601. > [https://en.wikipedia.org/wiki/ISO_8601#Week_dates] > (Also reported by Antonio Pedro Vieira (aps.vieir...@gmail.com)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401378#comment-16401378 ] Huaxin Gao edited comment on SPARK-22390 at 3/16/18 1:56 AM: - [~cloud_fan], I am working on Aggregate push down design doc and prototype. Could you please review the doc? Thanks a lot! [https://docs.google.com/document/d/1X3EVXjyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit] was (Author: huaxingao): [~cloud_fan], I am working on Aggregate push down design doc and prototype. Could you please review the doc? Thanks a lot! [https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|http://example.com] > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22390) Aggregate push down
[ https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401378#comment-16401378 ] Huaxin Gao commented on SPARK-22390: [~cloud_fan], I am working on Aggregate push down design doc and prototype. Could you please review the doc? Thanks a lot! [https://docs.google.com/document/d/1X3EVX-jyMv76KuZfX_VjQFmXeAmW3xYHe3M8DlGkbKQ/edit|http://example.com] > Aggregate push down > --- > > Key: SPARK-22390 > URL: https://issues.apache.org/jira/browse/SPARK-22390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23651) Add a check for host name
[ https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian resolved SPARK-23651. - Resolution: Fixed > Add a check for host name > -- > > Key: SPARK-23651 > URL: https://issues.apache.org/jira/browse/SPARK-23651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > I encountered a error like this: > _org.apache.spark.SparkException: Invalid Spark URL: > spark://HeartbeatReceiver@ci_164:42849_ > _at > org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_ > _at > org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_ > _at > org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_ > _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_ > _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_ > _at org.apache.spark.executor.Executor.(Executor.scala:155)_ > _at > org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_ > _at > org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_ > _at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_ > > I didn't know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is > invalid, so i think we should give a clearer reminder for this error. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23704) PySpark access of individual trees in random forest is slow
Julian King created SPARK-23704: --- Summary: PySpark access of individual trees in random forest is slow Key: SPARK-23704 URL: https://issues.apache.org/jira/browse/SPARK-23704 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.1 Environment: PySpark 2.2.1 / Windows 10 Reporter: Julian King Making predictions from a randomForestClassifier PySpark is much faster than making predictions from an individual tree contained within the .trees attribute. In fact, the model.transform call without an action is more than 10x slower for an individual tree vs the model.transform call for the random forest model. See [https://stackoverflow.com/questions/49297470/slow-individual-tree-access-for-random-forest-in-pyspark] for example with timing. Ideally: * Getting a prediction from a single tree should be comparable to or faster than getting predictions from the whole tree * Getting all the predictions from all the individual trees should be comparable in speed to getting the predictions from the random forest -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23670) Memory leak of SparkPlanGraphWrapper in sparkUI
[ https://issues.apache.org/jira/browse/SPARK-23670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23670: -- Assignee: Myroslav Lisniak > Memory leak of SparkPlanGraphWrapper in sparkUI > --- > > Key: SPARK-23670 > URL: https://issues.apache.org/jira/browse/SPARK-23670 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Myroslav Lisniak >Assignee: Myroslav Lisniak >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: heap.png > > > Memory leak on driver for a long time running application. We have > application using structured streaming and running 48 hours. But driver fails > with out of memory after 25 hours. After investigating heap dump we found > that most of the memory was occupied with a lot of *SparkPlanGraphWrapper* > objects inside *InMemoryStore*. > Application was run with option: > --driver-memory 4G -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23670) Memory leak of SparkPlanGraphWrapper in sparkUI
[ https://issues.apache.org/jira/browse/SPARK-23670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23670. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20813 [https://github.com/apache/spark/pull/20813] > Memory leak of SparkPlanGraphWrapper in sparkUI > --- > > Key: SPARK-23670 > URL: https://issues.apache.org/jira/browse/SPARK-23670 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Myroslav Lisniak >Priority: Major > Fix For: 2.4.0, 2.3.1 > > Attachments: heap.png > > > Memory leak on driver for a long time running application. We have > application using structured streaming and running 48 hours. But driver fails > with out of memory after 25 hours. After investigating heap dump we found > that most of the memory was occupied with a lot of *SparkPlanGraphWrapper* > objects inside *InMemoryStore*. > Application was run with option: > --driver-memory 4G -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23608: -- Assignee: Ye Zhou > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Assignee: Ye Zhou >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23608) SHS needs synchronization between attachSparkUI and detachSparkUI functions
[ https://issues.apache.org/jira/browse/SPARK-23608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23608. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20744 [https://github.com/apache/spark/pull/20744] > SHS needs synchronization between attachSparkUI and detachSparkUI functions > --- > > Key: SPARK-23608 > URL: https://issues.apache.org/jira/browse/SPARK-23608 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Fix For: 2.4.0, 2.3.1 > > > We continuously hit an issue with SHS after it runs for a while and have some > REST API calls to it. SHS suddenly shows an empty home page with 0 > application. It is caused by the unexpected JSON data returned from rest call > "api/v1/applications?limit=8000". This REST call returns the home page html > codes instead of list of application summary. Some other REST call which asks > for application detailed information also returns home page html codes. But > there are still some working REST calls. We have to restart SHS to solve the > issue. > We attached remote debugger to the problematic process and checked the > attached jetty handlers tree in the web server. We found that the jetty > handler added by "attachHandler(ApiRootResource.getServletHandler(this))" is > not in the tree as well as some other handlers. Without the root resource > servlet handler, SHS will not work correctly serving both UI and REST calls. > SHS will directly return the HistoryServerPage html to user as it cannot find > handlers to handle the request. > Spark History Server has to attachSparkUI in order to serve user requests. > The application SparkUI getting attached when the application details data > gets loaded into Guava Cache. While attaching SparkUI, SHS will add attach > all jetty handlers into the current web service. But while the data gets > cleared out from Guava Cache, SHS will detach all the application's SparkUI > jetty handlers. Due to the asynchronous feature in Guava Cache, the clear out > from cache is not synchronized with loading into cache. The actual clear out > in Guava Cache which triggers detachSparkUI might be detaching the handlers > while the attachSparkUI is attaching jetty handlers. > After adding synchronization between attachSparkUI and detachSparkUI in > history server, this issue never happens again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23671) SHS is ignoring number of replay threads
[ https://issues.apache.org/jira/browse/SPARK-23671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23671: -- Assignee: Marcelo Vanzin > SHS is ignoring number of replay threads > > > Key: SPARK-23671 > URL: https://issues.apache.org/jira/browse/SPARK-23671 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 2.3.1, 2.4.0 > > > I mistakenly flipped a condition in a previous change and the SHS is now > basically doing single-threaded parsing of event logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23671) SHS is ignoring number of replay threads
[ https://issues.apache.org/jira/browse/SPARK-23671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23671. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20814 [https://github.com/apache/spark/pull/20814] > SHS is ignoring number of replay threads > > > Key: SPARK-23671 > URL: https://issues.apache.org/jira/browse/SPARK-23671 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > Fix For: 2.4.0, 2.3.1 > > > I mistakenly flipped a condition in a previous change and the SHS is now > basically doing single-threaded parsing of event logs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8008) JDBC data source can overload the external database system due to high concurrency
[ https://issues.apache.org/jira/browse/SPARK-8008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401300#comment-16401300 ] Jo Desmet commented on SPARK-8008: -- Too bad that this issue is not considered high priority. Too many times I come to the problem that I need to process billions of records. So the only way to handle this is to create a huge amount of partitions, and then throttle using spark.executor.cores. However this setting effectively throttles my entire RDD, not just the portion that loads from database. It would be hugely beneficial that I can not only restrict the number of partitions at any time, but also the task concurrency at any point in my RDD. > JDBC data source can overload the external database system due to high > concurrency > -- > > Key: SPARK-8008 > URL: https://issues.apache.org/jira/browse/SPARK-8008 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rene Treffer >Priority: Major > > Spark tries to load as many partitions as possible in parallel, which can in > turn overload the database although it would be possible to load all > partitions given a lower concurrency. > It would be nice to either limit the maximum concurrency or to at least warn > about this behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23703) Collapse sequential watermarks
Jose Torres created SPARK-23703: --- Summary: Collapse sequential watermarks Key: SPARK-23703 URL: https://issues.apache.org/jira/browse/SPARK-23703 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres When there are two sequential EventTimeWatermark nodes in a query plan, the topmost one overrides the column tracking metadata from its children, but leaves the nodes themselves untouched. When there is no intervening stateful operation to consume the watermark, we should remove the lower node entirely. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate
[ https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23702: Assignee: (was: Apache Spark) > Forbid watermarks on both sides of a streaming aggregate > > > Key: SPARK-23702 > URL: https://issues.apache.org/jira/browse/SPARK-23702 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate
[ https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401295#comment-16401295 ] Apache Spark commented on SPARK-23702: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20840 > Forbid watermarks on both sides of a streaming aggregate > > > Key: SPARK-23702 > URL: https://issues.apache.org/jira/browse/SPARK-23702 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate
[ https://issues.apache.org/jira/browse/SPARK-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23702: Assignee: Apache Spark > Forbid watermarks on both sides of a streaming aggregate > > > Key: SPARK-23702 > URL: https://issues.apache.org/jira/browse/SPARK-23702 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger
[ https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23658. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20815 [https://github.com/apache/spark/pull/20815] > InProcessAppHandle uses the wrong class in getLogger > > > Key: SPARK-23658 > URL: https://issues.apache.org/jira/browse/SPARK-23658 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Minor > Fix For: 2.4.0, 2.3.1 > > > {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the class in > {{getLogger}}, it should just use its own name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger
[ https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23658: -- Assignee: Sahil Takiar > InProcessAppHandle uses the wrong class in getLogger > > > Key: SPARK-23658 > URL: https://issues.apache.org/jira/browse/SPARK-23658 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Sahil Takiar >Assignee: Sahil Takiar >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the class in > {{getLogger}}, it should just use its own name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23702) Forbid watermarks on both sides of a streaming aggregate
Jose Torres created SPARK-23702: --- Summary: Forbid watermarks on both sides of a streaming aggregate Key: SPARK-23702 URL: https://issues.apache.org/jira/browse/SPARK-23702 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23701) Multiple sequential watermarks are not supported
Jose Torres created SPARK-23701: --- Summary: Multiple sequential watermarks are not supported Key: SPARK-23701 URL: https://issues.apache.org/jira/browse/SPARK-23701 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres In 2.3, we allowed query plans with multiple watermarks to run to enable stream-stream joins. But we've only implemented the functionality for watermarks in parallel feeding into a join operator. It won't work currently (and would require in-depth changes) if the watermarks are sequential in the plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23700) Cleanup unused imports
Bryan Cutler created SPARK-23700: Summary: Cleanup unused imports Key: SPARK-23700 URL: https://issues.apache.org/jira/browse/SPARK-23700 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Bryan Cutler I've noticed a fair amount of unused imports in pyspark, I'll take a look through and try to clean them up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401252#comment-16401252 ] cclauss commented on SPARK-23698: - A PR to fix for 17 of the 20 issues is at https://github.com/apache/spark/pull/20838/files > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Priority: Minor > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cclauss updated SPARK-23698: Comment: was deleted (was: A PR to fix for 17 of the 20 issues is at https://github.com/apache/spark/pull/20838/files) > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Priority: Minor > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled
[ https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23699: Assignee: Apache Spark > PySpark should raise same Error when Arrow fallback is disabled > --- > > Key: SPARK-23699 > URL: https://issues.apache.org/jira/browse/SPARK-23699 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > > When a schema or import error is encountered when using Arrow for > createDataFrame or toPandas and fallback is disabled, a RuntimeError is > raised. It would be better to raise the same type of error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled
[ https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23699: Assignee: (was: Apache Spark) > PySpark should raise same Error when Arrow fallback is disabled > --- > > Key: SPARK-23699 > URL: https://issues.apache.org/jira/browse/SPARK-23699 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Minor > > When a schema or import error is encountered when using Arrow for > createDataFrame or toPandas and fallback is disabled, a RuntimeError is > raised. It would be better to raise the same type of error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled
[ https://issues.apache.org/jira/browse/SPARK-23699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401250#comment-16401250 ] Apache Spark commented on SPARK-23699: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/20839 > PySpark should raise same Error when Arrow fallback is disabled > --- > > Key: SPARK-23699 > URL: https://issues.apache.org/jira/browse/SPARK-23699 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Bryan Cutler >Priority: Minor > > When a schema or import error is encountered when using Arrow for > createDataFrame or toPandas and fallback is disabled, a RuntimeError is > raised. It would be better to raise the same type of error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23698: Assignee: Apache Spark > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Assignee: Apache Spark >Priority: Minor > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23698: Assignee: (was: Apache Spark) > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Priority: Minor > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23698) Spark code contains numerous undefined names in Python 3
[ https://issues.apache.org/jira/browse/SPARK-23698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401247#comment-16401247 ] Apache Spark commented on SPARK-23698: -- User 'cclauss' has created a pull request for this issue: https://github.com/apache/spark/pull/20838 > Spark code contains numerous undefined names in Python 3 > > > Key: SPARK-23698 > URL: https://issues.apache.org/jira/browse/SPARK-23698 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: cclauss >Priority: Minor > > flake8 testing of https://github.com/apache/spark on Python 3.6.3 > $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source > --statistics* > ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' > result = raw_input("\n%s (y/n): " % prompt) > ^ > ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' > primary_author = raw_input( > ^ > ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' > pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) >^ > ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' > jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) > ^ > ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' > fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % > default_fix_versions) >^ > ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' > raw_assignee = raw_input( >^ > ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' > pr_num = raw_input("Which pull request would you like to merge? (e.g. > 34): ") > ^ > ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' > result = raw_input("Would you like to use the modified title? (y/n): > ") > ^ > ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' > while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": > ^ > ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' > response = raw_input("%s [y/n]: " % msg) >^ > ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' > author = unidecode.unidecode(unicode(author, "UTF-8")).strip() > ^ > ./python/setup.py:37:11: F821 undefined name '__version__' > VERSION = __version__ > ^ > ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' > dispatch[buffer] = save_buffer > ^ > ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' > dispatch[file] = save_file > ^ > ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' > if not isinstance(obj, str) and not isinstance(obj, unicode): > ^ > ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' > intlike = (int, long) > ^ > ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' > return self._sc._jvm.Time(long(timestamp * 1000)) > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 > undefined name 'xrange' > for i in xrange(50): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 > undefined name 'xrange' > for j in xrange(5): > ^ > ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 > undefined name 'xrange' > for k in xrange(20022): > ^ > 20F821 undefined name 'raw_input' > 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23699) PySpark should raise same Error when Arrow fallback is disabled
Bryan Cutler created SPARK-23699: Summary: PySpark should raise same Error when Arrow fallback is disabled Key: SPARK-23699 URL: https://issues.apache.org/jira/browse/SPARK-23699 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.4.0 Reporter: Bryan Cutler When a schema or import error is encountered when using Arrow for createDataFrame or toPandas and fallback is disabled, a RuntimeError is raised. It would be better to raise the same type of error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds
[ https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401210#comment-16401210 ] Jaehyeon Kim commented on SPARK-23632: -- I've looked into further and found it'd be better to have package download and session start separated if it takes long to download a package and its dependencies. A quick check shows the following doesn't throw an error and jars are downloaded to _~/.ivy2/jars_. {code:java} echo 'print("done")' > pkg_install.R /usr/local/spark-2.2.1/bin/spark-submit \ --master spark://master:7077 \ --packages org.apache.hadoop:hadoop-aws:2.8.2 \ pkg_install.R {code} Let me come back with a structured example. > sparkR.session() error with spark packages - JVM is not ready after 10 seconds > -- > > Key: SPARK-23632 > URL: https://issues.apache.org/jira/browse/SPARK-23632 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0, 2.2.1, 2.3.0 >Reporter: Jaehyeon Kim >Priority: Minor > > Hi > When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ > as following, > {code:java} > library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib')) > ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 > -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080' > sparkR.session(master = "spark://master:7077", >appName = 'ml demo', >sparkConfig = list(spark.driver.memory = '2g'), >sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2', >spark.driver.extraJavaOptions = ext_opts) > {code} > I see *JVM is not ready after 10 seconds* error. Below shows some of the log > messages. > {code:java} > Ivy Default Cache set to: /home/rstudio/.ivy2/cache > The jars for the packages stored in: /home/rstudio/.ivy2/jars > :: loading settings :: url = > jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > org.apache.hadoop#hadoop-aws added as a dependency > :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 > confs: [default] > found org.apache.hadoop#hadoop-aws;2.8.2 in central > ... > ... > found javax.servlet.jsp#jsp-api;2.1 in central > Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, : > JVM is not ready after 10 seconds > ... > ... > found joda-time#joda-time;2.9.4 in central > downloading > https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar > ... > ... > ... > xmlenc#xmlenc;0.52 from central in [default] > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 76 | 76 | 76 | 0 || 76 | 76 | > - > :: retrieving :: org.apache.spark#spark-submit-parent > confs: [default] > 76 artifacts copied, 0 already retrieved (27334kB/56ms) > {code} > It's fine if I re-execute it after the package and its dependencies are > downloaded. > I consider it's because of this part - > https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181 > {code:java} > if (!file.exists(path)) { > stop("JVM is not ready after 10 seconds") > } > {code} > Just wonder if it may be possible to update so that a user can determine how > much to wait? > Thanks. > Regards > Jaehyeon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23698) Spark code contains numerous undefined names in Python 3
cclauss created SPARK-23698: --- Summary: Spark code contains numerous undefined names in Python 3 Key: SPARK-23698 URL: https://issues.apache.org/jira/browse/SPARK-23698 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: cclauss flake8 testing of https://github.com/apache/spark on Python 3.6.3 $ *flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics* ./dev/merge_spark_pr.py:98:14: F821 undefined name 'raw_input' result = raw_input("\n%s (y/n): " % prompt) ^ ./dev/merge_spark_pr.py:136:22: F821 undefined name 'raw_input' primary_author = raw_input( ^ ./dev/merge_spark_pr.py:186:16: F821 undefined name 'raw_input' pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) ^ ./dev/merge_spark_pr.py:233:15: F821 undefined name 'raw_input' jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) ^ ./dev/merge_spark_pr.py:278:20: F821 undefined name 'raw_input' fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) ^ ./dev/merge_spark_pr.py:317:28: F821 undefined name 'raw_input' raw_assignee = raw_input( ^ ./dev/merge_spark_pr.py:430:14: F821 undefined name 'raw_input' pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ") ^ ./dev/merge_spark_pr.py:442:18: F821 undefined name 'raw_input' result = raw_input("Would you like to use the modified title? (y/n): ") ^ ./dev/merge_spark_pr.py:493:11: F821 undefined name 'raw_input' while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": ^ ./dev/create-release/releaseutils.py:58:16: F821 undefined name 'raw_input' response = raw_input("%s [y/n]: " % msg) ^ ./dev/create-release/releaseutils.py:152:38: F821 undefined name 'unicode' author = unidecode.unidecode(unicode(author, "UTF-8")).strip() ^ ./python/setup.py:37:11: F821 undefined name '__version__' VERSION = __version__ ^ ./python/pyspark/cloudpickle.py:275:18: F821 undefined name 'buffer' dispatch[buffer] = save_buffer ^ ./python/pyspark/cloudpickle.py:807:18: F821 undefined name 'file' dispatch[file] = save_file ^ ./python/pyspark/sql/conf.py:61:61: F821 undefined name 'unicode' if not isinstance(obj, str) and not isinstance(obj, unicode): ^ ./python/pyspark/sql/streaming.py:25:21: F821 undefined name 'long' intlike = (int, long) ^ ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' return self._sc._jvm.Time(long(timestamp * 1000)) ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:21:10: F821 undefined name 'xrange' for i in xrange(50): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:22:14: F821 undefined name 'xrange' for j in xrange(5): ^ ./sql/hive/src/test/resources/data/scripts/dumpdata_script.py:23:18: F821 undefined name 'xrange' for k in xrange(20022): ^ 20F821 undefined name 'raw_input' 20 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-23697: --- Description: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper. {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there? If not - is it ok to just override writeReplace for LegacyAccumulatorWrapper to prevent such failures? was: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper. {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there? If not - is it ok to just override writeReplace for LegacyAccumulatorWrapperto prevent such failures? > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how >
[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-23697: --- Description: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper. {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there? If not - is it ok to just override writeReplace for LegacyAccumulatorWrapperto prevent such failures? was: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper. {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how >
[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-23697: --- Description: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper. {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. was: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how > [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] > is implemented in LegacyAccumulatorWrapper. > {code:java} > override def isZero:
[jira] [Updated] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
[ https://issues.apache.org/jira/browse/SPARK-23697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-23697: --- Description: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise isZero is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. was: I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise `isZero` is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. > Accumulators of Spark 1.x no longer work with Spark 2.x > --- > > Key: SPARK-23697 > URL: https://issues.apache.org/jira/browse/SPARK-23697 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 > Environment: Spark 2.2.0 > Scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x > failing with > java.lang.AssertionError: assertion failed: copyAndReset must return a zero > value copy > It happens while serializing an accumulator > [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] > {code:java} > val copyAcc = copyAndReset() > assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} > ... although copyAndReset returns zero-value copy for sure, just consider the > accumulator below > {code:java} > val concatParam = new AccumulatorParam[jl.StringBuilder] { > override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new > jl.StringBuilder() > override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): > jl.StringBuilder = r1.append(r2) > }{code} > So, Spark treats zero value as non-zero due to how > [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] > is implemented in LegacyAccumulatorWrapper > {code:java} > override def isZero:
[jira] [Created] (SPARK-23697) Accumulators of Spark 1.x no longer work with Spark 2.x
Sergey Zhemzhitsky created SPARK-23697: -- Summary: Accumulators of Spark 1.x no longer work with Spark 2.x Key: SPARK-23697 URL: https://issues.apache.org/jira/browse/SPARK-23697 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1, 2.2.0 Environment: Spark 2.2.0 Scala 2.11 Reporter: Sergey Zhemzhitsky I've noticed that accumulators of Spark 1.x no longer work with Spark 2.x failing with java.lang.AssertionError: assertion failed: copyAndReset must return a zero value copy It happens while serializing an accumulator [here|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L165] {code:java} val copyAcc = copyAndReset() assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} ... although copyAndReset returns zero-value copy for sure, just consider the accumulator below {code:java} val concatParam = new AccumulatorParam[jl.StringBuilder] { override def zero(initialValue: jl.StringBuilder): jl.StringBuilder = new jl.StringBuilder() override def addInPlace(r1: jl.StringBuilder, r2: jl.StringBuilder): jl.StringBuilder = r1.append(r2) }{code} So, Spark treats zero value as non-zero due to how [isZero|https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L489] is implemented in LegacyAccumulatorWrapper {code:java} override def isZero: Boolean = _value == param.zero(initialValue){code} All this means that the values to be accumulated must implement equals and hashCode, otherwise `isZero` is very likely to always return false. So I'm wondering whether the assertion {code:java} assert(copyAcc.isZero, "copyAndReset must return a zero value copy"){code} is really necessary and whether it can be safely removed from there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23684) mode append function not working
[ https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23684. -- Resolution: Duplicate > mode append function not working > - > > Key: SPARK-23684 > URL: https://issues.apache.org/jira/browse/SPARK-23684 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Evan Zamir >Priority: Minor > > {{df.write.mode('append').jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"}) }} > produces the following error and does not write to existing table: > {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling > o894.jdbc.}} > {{: scala.MatchError: null}} > {{ at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}} > {{ at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}} > {{ at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}} > {{ at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}} > {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}} > {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}} > {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}} > {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} > {{ at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}} > {{ at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}} > {{ at java.lang.reflect.Method.invoke(Method.java:498)}} > {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}} > {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}} > {{ at py4j.Gateway.invoke(Gateway.java:280)}} > {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}} > {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}} > {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}} > {{ at java.lang.Thread.run(Thread.java:745)}} > However, > {{df.write.jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"},mode='append')}} > does not produce an error and adds a row to an exisiting table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401071#comment-16401071 ] Joseph K. Bradley commented on SPARK-7131: -- CCing people watching this JIRA about https://github.com/apache/spark/pull/20786 In that PR, we want to make LeafNode and InternalNode into traits (not classes) in order to split Regression from Classification nodes (to have stronger typing). Will this break anyone's code outside of org.apache.spark.ml? I doubt it since the node constructors are still private, but I wanted to CC people. Thanks! > Move tree,forest implementation from spark.mllib to spark.ml > > > Key: SPARK-7131 > URL: https://issues.apache.org/jira/browse/SPARK-7131 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 1.5.0 > > Original Estimate: 168h > Remaining Estimate: 168h > > We want to change and improve the spark.ml API for trees and ensembles, but > we cannot change the old API in spark.mllib. To support the changes we want > to make, we should move the implementation from spark.mllib to spark.ml. We > will generalize and modify it, but will also ensure that we do not change the > behavior of the old API. > There are several steps to this: > 1. Copy the implementation over to spark.ml and change the spark.ml classes > to use that implementation, rather than calling the spark.mllib > implementation. The current spark.ml tests will ensure that the 2 > implementations learn exactly the same models. Note: This should include > performance testing to make sure the updated code does not have any > regressions. --> *UPDATE*: I have run tests using spark-perf, and there were > no regressions. > 2. Remove the spark.mllib implementation, and make the spark.mllib APIs > wrappers around the spark.ml implementation. The spark.ml tests will again > ensure that we do not change any behavior. > 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to > verify model equivalence. > This JIRA is now for step 1 only. Steps 2 and 3 will be in separate JIRAs. > After these updates, we can more safely generalize and improve the spark.ml > implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20169) Groupby Bug with Sparksql
[ https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401029#comment-16401029 ] Dylan Guedes commented on SPARK-20169: -- Hi, I also reproduced it in v2.3 and master. I think that it is something related to the String type because if I cast the jr dataframe column to long it works fine - However, if I cast it to String, the bug still happens. I don't know the catalyst codebase that well (never touched it actually), do you guys have a suggestion to where to start looking after I call _jdf? I don't know how to follow the trace after converting to the JVM. Thank you! > Groupby Bug with Sparksql > - > > Key: SPARK-20169 > URL: https://issues.apache.org/jira/browse/SPARK-20169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bin Wu >Priority: Major > > We find a potential bug in Catalyst optimizer which cannot correctly > process "groupby". You can reproduce it by following simple example: > = > from pyspark.sql.functions import * > #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"]) > e = spark.read.csv("graph.csv", header=True) > r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src']) > r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src') > jr = e.join(r1, 'src') > jr.show() > r2 = jr.groupBy('dst').count() > r2.show() > = > FYI, "graph.csv" contains exactly the same data as the commented line. > You can find that jr is: > |src|dst|count| > | 3| 1|1| > | 1| 4|3| > | 1| 3|3| > | 1| 2|3| > | 4| 1|1| > | 2| 1|1| > But, after the last groupBy, the 3 rows with dst = 1 are not grouped together: > |dst|count| > | 1|1| > | 4|1| > | 3|1| > | 2|1| > | 1|1| > | 1|1| > If we build jr directly from raw data (commented line), this error will not > show up. So > we suspect that there is a bug in the Catalyst optimizer when multiple joins > and groupBy's > are being optimized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23686: Assignee: (was: Apache Spark) > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400903#comment-16400903 ] Apache Spark commented on SPARK-23686: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/20837 > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23686: Assignee: Apache Spark > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Apache Spark >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
[ https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23695. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Issue resolved by pull request 20834 https://github.com/apache/spark/pull/20834 > Confusing error message for PySpark's Kinesis tests when its jar is missing > but enabled > --- > > Key: SPARK-23695 > URL: https://issues.apache.org/jira/browse/SPARK-23695 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.1, 2.4.0 > > > Currently if its jar is missing but the Kinesis tests are enabled: > {code} > ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests > Skipped test_flume_stream (enable by setting environment variable > ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment > variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 174, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in > % kinesis_asl_assembly_dir) + > NameError: name 'kinesis_asl_assembly_dir' is not defined > {code} > It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
[ https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23695: - Assignee: Hyukjin Kwon > Confusing error message for PySpark's Kinesis tests when its jar is missing > but enabled > --- > > Key: SPARK-23695 > URL: https://issues.apache.org/jira/browse/SPARK-23695 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > Currently if its jar is missing but the Kinesis tests are enabled: > {code} > ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests > Skipped test_flume_stream (enable by setting environment variable > ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment > variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 174, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in > % kinesis_asl_assembly_dir) + > NameError: name 'kinesis_asl_assembly_dir' is not defined > {code} > It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400839#comment-16400839 ] Gustavo Orair commented on SPARK-4038: -- There is a paper that discuss multiple different strategies for distance based outlier detection: * [http://www2.cs.uh.edu/~ceick/7362/T1-9.pdf] This paper propose an adaptive parallel algorithm that executes local search for neighbors for a selected pool by ranking strategies before examining the data for outliers that looks promising for distributed computation. > Outlier Detection Algorithm for MLlib > - > > Key: SPARK-4038 > URL: https://issues.apache.org/jira/browse/SPARK-4038 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ashutosh Trivedi >Priority: Minor > > The aim of this JIRA is to discuss about which parallel outlier detection > algorithms can be included in MLlib. > The one which I am familiar with is Attribute Value Frequency (AVF). It > scales linearly with the number of data points and attributes, and relies on > a single data scan. It is not distance based and well suited for categorical > data. In original paper a parallel version is also given, which is not > complected to implement. I am working on the implementation and soon submit > the initial code for review. > Here is the Link for the paper > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 > As pointed out by Xiangrui in discussion > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html > There are other algorithms also. Lets discuss about which will be more > general and easily paralleled. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23684) mode append function not working
[ https://issues.apache.org/jira/browse/SPARK-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400823#comment-16400823 ] Evan Zamir commented on SPARK-23684: Yes, you're right. Feel free to close this. > mode append function not working > - > > Key: SPARK-23684 > URL: https://issues.apache.org/jira/browse/SPARK-23684 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.0 >Reporter: Evan Zamir >Priority: Minor > > {{df.write.mode('append').jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"}) }} > produces the following error and does not write to existing table: > {{2018-03-14 11:00:08,332 root ERROR An error occurred while calling > o894.jdbc.}} > {{: scala.MatchError: null}} > {{ at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:62)}} > {{ at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)}} > {{ at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)}} > {{ at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)}} > {{ at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)}} > {{ at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)}} > {{ at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)}} > {{ at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)}} > {{ at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)}} > {{ at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)}} > {{ at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)}} > {{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}} > {{ at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}} > {{ at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}} > {{ at java.lang.reflect.Method.invoke(Method.java:498)}} > {{ at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)}} > {{ at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)}} > {{ at py4j.Gateway.invoke(Gateway.java:280)}} > {{ at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)}} > {{ at py4j.commands.CallCommand.execute(CallCommand.java:79)}} > {{ at py4j.GatewayConnection.run(GatewayConnection.java:214)}} > {{ at java.lang.Thread.run(Thread.java:745)}} > However, > {{df.write.jdbc(url, table, properties=\{"driver": > "org.postgresql.Driver"},mode='append')}} > does not produce an error and adds a row to an exisiting table. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400755#comment-16400755 ] Apache Spark commented on SPARK-23685: -- User 'sirishaSindri' has created a pull request for this issue: https://github.com/apache/spark/pull/20836 > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23685: Assignee: Apache Spark > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Assignee: Apache Spark >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23685) Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)
[ https://issues.apache.org/jira/browse/SPARK-23685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23685: Assignee: (was: Apache Spark) > Spark Structured Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive > Offsets (i.e. Log Compaction) > - > > Key: SPARK-23685 > URL: https://issues.apache.org/jira/browse/SPARK-23685 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: sirisha >Priority: Major > > When Kafka does log compaction offsets often end up with gaps, meaning the > next requested offset will be frequently not be offset+1. The logic in > KafkaSourceRDD & CachedKafkaConsumer assumes that the next offset will always > be just an increment of 1 .If not, it throws the below exception: > > "Cannot fetch records in [5589, 5693) (GroupId: XXX, TopicPartition:). > Some data may have been lost because they are not available in Kafka any > more; either the data was aged out by Kafka or the topic may have been > deleted before all the data in the topic was processed. If you don't want > your streaming query to fail on such cases, set the source option > "failOnDataLoss" to "false". " > > FYI: This bug is related to https://issues.apache.org/jira/browse/SPARK-17147 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23696) StructType.fromString swallows exceptions from DataType.fromJson
Simeon H.K. Fitch created SPARK-23696: - Summary: StructType.fromString swallows exceptions from DataType.fromJson Key: SPARK-23696 URL: https://issues.apache.org/jira/browse/SPARK-23696 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.1 Reporter: Simeon H.K. Fitch `StructType.fromString` swallows exceptions from `DataType.fromJson`, assuming they are an indication that the `LegacyTypeStringParser.parse` should be called instead. When that fails (because it throws an excreption), an error message is generated that does not reflect the true problem at hand, effectively swallowing the exception from `DataType.fromJson`. This makes debugging Parquet schema issues more difficult. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-23693: - Description: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code} def uuid(): Column{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. was: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code:scala} def uuid(): String{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code} > def uuid(): Column{code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-23693: - Description: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code:scala} def uuid(): Column {code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. was: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code} def uuid(): Column{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
[ https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400325#comment-16400325 ] Apache Spark commented on SPARK-23695: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20834 > Confusing error message for PySpark's Kinesis tests when its jar is missing > but enabled > --- > > Key: SPARK-23695 > URL: https://issues.apache.org/jira/browse/SPARK-23695 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently if its jar is missing but the Kinesis tests are enabled: > {code} > ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests > Skipped test_flume_stream (enable by setting environment variable > ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment > variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 174, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in > % kinesis_asl_assembly_dir) + > NameError: name 'kinesis_asl_assembly_dir' is not defined > {code} > It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
[ https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23695: Assignee: Apache Spark > Confusing error message for PySpark's Kinesis tests when its jar is missing > but enabled > --- > > Key: SPARK-23695 > URL: https://issues.apache.org/jira/browse/SPARK-23695 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > Currently if its jar is missing but the Kinesis tests are enabled: > {code} > ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests > Skipped test_flume_stream (enable by setting environment variable > ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment > variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 174, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in > % kinesis_asl_assembly_dir) + > NameError: name 'kinesis_asl_assembly_dir' is not defined > {code} > It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
[ https://issues.apache.org/jira/browse/SPARK-23695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23695: Assignee: (was: Apache Spark) > Confusing error message for PySpark's Kinesis tests when its jar is missing > but enabled > --- > > Key: SPARK-23695 > URL: https://issues.apache.org/jira/browse/SPARK-23695 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently if its jar is missing but the Kinesis tests are enabled: > {code} > ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests > Skipped test_flume_stream (enable by setting environment variable > ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment > variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 174, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in > % kinesis_asl_assembly_dir) + > NameError: name 'kinesis_asl_assembly_dir' is not defined > {code} > It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23695) Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled
Hyukjin Kwon created SPARK-23695: Summary: Confusing error message for PySpark's Kinesis tests when its jar is missing but enabled Key: SPARK-23695 URL: https://issues.apache.org/jira/browse/SPARK-23695 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Currently if its jar is missing but the Kinesis tests are enabled: {code} ENABLE_KINESIS_TESTS=1 SPARK_TESTING=1 bin/pyspark pyspark.streaming.tests Skipped test_flume_stream (enable by setting environment variable ENABLE_FLUME_TESTS=1Skipped test_kafka_stream (enable by setting environment variable ENABLE_KAFKA_0_8_TESTS=1Traceback (most recent call last): File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/local/Cellar/python/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/.../spark/python/pyspark/streaming/tests.py", line 1572, in % kinesis_asl_assembly_dir) + NameError: name 'kinesis_asl_assembly_dir' is not defined {code} It shows a confusing error message. Seems a mistake. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23694) The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory
Yifeng Dong created SPARK-23694: --- Summary: The staging directory should under hive.exec.stagingdir if we set hive.exec.stagingdir but not under the table directory Key: SPARK-23694 URL: https://issues.apache.org/jira/browse/SPARK-23694 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Yifeng Dong When we set hive.exec.stagingdir but not under the table directory, for example: /tmp/hive-staging, I think the staging directory should under /tmp/hive-staging, not under /tmp/ like /tmp/hive-staging_xxx -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-23693: - Description: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code:scala} def uuid(): String{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. was: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code:java} def uuid(): String{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): String{code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-23693: - Description: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: {code:java} def uuid(): String{code} that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. was: Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: def uuid(): String that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:java} > def uuid(): String{code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23693) SQL function uuid()
Arseniy Tashoyan created SPARK-23693: Summary: SQL function uuid() Key: SPARK-23693 URL: https://issues.apache.org/jira/browse/SPARK-23693 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0, 2.2.1 Reporter: Arseniy Tashoyan Add function uuid() to org.apache.spark.sql.functions that returns [Universally Unique ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. Sometimes it is necessary to uniquely identify each row in a DataFrame. Currently the following ways are available: * monotonically_increasing_id() function * row_number() function over some window * convert the DataFrame to RDD and zipWithIndex() All these approaches do not work when appending this DataFrame to another DataFrame (union). Collisions may occur - two rows in different DataFrames may have the same ID. Re-generating IDs on the resulting DataFrame is not an option, because some data in some other system may already refer to old IDs. The proposed solution is to add new function: def uuid(): String that returns String representation of UUID. UUID is represented as a 128-bit number (two long numbers). Such numbers are not supported in Scala or Java. In addition, some storage systems do not support 128-bit numbers (Parquet's largest numeric type is INT96). This is the reason for the uuid() function to return String. I already have a simple implementation based on [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400177#comment-16400177 ] Apache Spark commented on SPARK-23683: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/20824 > FileCommitProtocol.instantiate to require 3-arg constructor for dynamic > partition overwrite > --- > > Key: SPARK-23683 > URL: https://issues.apache.org/jira/browse/SPARK-23683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Steve Loughran >Priority: Major > > with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three > argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. > If there is no such constructor, it falls back to the classic two-arg one. > When {{InsertIntoHadoopFsRelationCommand}} passes down that > {{dynamicPartitionOverwrite}} flag to {{FileCommitProtocol.instantiate()}}, > it _assumes_ that the instantiated protocol supports the specific > requirements of dynamic partition overwrite. It does not notice when this > does not hold, and so the output generated may be incorrect. > Proposed: when dynamicPartitionOverwrite == true, require the protocol > implementation to have a 3-arg constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23683: Assignee: Apache Spark > FileCommitProtocol.instantiate to require 3-arg constructor for dynamic > partition overwrite > --- > > Key: SPARK-23683 > URL: https://issues.apache.org/jira/browse/SPARK-23683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Major > > with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three > argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. > If there is no such constructor, it falls back to the classic two-arg one. > When {{InsertIntoHadoopFsRelationCommand}} passes down that > {{dynamicPartitionOverwrite}} flag to {{FileCommitProtocol.instantiate()}}, > it _assumes_ that the instantiated protocol supports the specific > requirements of dynamic partition overwrite. It does not notice when this > does not hold, and so the output generated may be incorrect. > Proposed: when dynamicPartitionOverwrite == true, require the protocol > implementation to have a 3-arg constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23683) FileCommitProtocol.instantiate to require 3-arg constructor for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-23683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23683: Assignee: (was: Apache Spark) > FileCommitProtocol.instantiate to require 3-arg constructor for dynamic > partition overwrite > --- > > Key: SPARK-23683 > URL: https://issues.apache.org/jira/browse/SPARK-23683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Steve Loughran >Priority: Major > > with SPARK-20236 {{FileCommitProtocol.instantiate()}} looks for a three > argument constructor, passing in the {{dynamicPartitionOverwrite}} parameter. > If there is no such constructor, it falls back to the classic two-arg one. > When {{InsertIntoHadoopFsRelationCommand}} passes down that > {{dynamicPartitionOverwrite}} flag to {{FileCommitProtocol.instantiate()}}, > it _assumes_ that the instantiated protocol supports the specific > requirements of dynamic partition overwrite. It does not notice when this > does not hold, and so the output generated may be incorrect. > Proposed: when dynamicPartitionOverwrite == true, require the protocol > implementation to have a 3-arg constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23692) Print metadata of files when infer schema failed
[ https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400106#comment-16400106 ] Apache Spark commented on SPARK-23692: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/20833 > Print metadata of files when infer schema failed > > > Key: SPARK-23692 > URL: https://issues.apache.org/jira/browse/SPARK-23692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Priority: Minor > > A trivial modify. > Currently, when we had no input files to infer schema,we will throw below > exception. > For some users it may be misleading.If we can print files' metadata it will > be more clearer. > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for > Parquet. It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) > at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) > at > com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23692) Print metadata of files when infer schema failed
[ https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23692: Assignee: (was: Apache Spark) > Print metadata of files when infer schema failed > > > Key: SPARK-23692 > URL: https://issues.apache.org/jira/browse/SPARK-23692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Priority: Minor > > A trivial modify. > Currently, when we had no input files to infer schema,we will throw below > exception. > For some users it may be misleading.If we can print files' metadata it will > be more clearer. > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for > Parquet. It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) > at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) > at > com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23692) Print metadata of files when infer schema failed
zhoukang created SPARK-23692: Summary: Print metadata of files when infer schema failed Key: SPARK-23692 URL: https://issues.apache.org/jira/browse/SPARK-23692 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: zhoukang A trivial modify. Currently, when we had no input files to infer schema,we will throw below exception. For some users it may be misleading.If we can print files' metadata it will be more clearer. {code:java} Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) at com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23692) Print metadata of files when infer schema failed
[ https://issues.apache.org/jira/browse/SPARK-23692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23692: Assignee: Apache Spark > Print metadata of files when infer schema failed > > > Key: SPARK-23692 > URL: https://issues.apache.org/jira/browse/SPARK-23692 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: Apache Spark >Priority: Minor > > A trivial modify. > Currently, when we had no input files to infer schema,we will throw below > exception. > For some users it may be misleading.If we can print files' metadata it will > be more clearer. > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Unable to infer schema for > Parquet. It must be specified manually.; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:189) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387) > at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425) > at > com.xiaomi.matrix.pipeline.jobspark.importer.MatrixAdEventDailyImportJob.(MatrixAdEventDailyImportJob.scala:18) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable
[ https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400021#comment-16400021 ] Apache Spark commented on SPARK-20536: -- User 'efimpoberezkin' has created a pull request for this issue: https://github.com/apache/spark/pull/20832 > Extend ColumnName to create StructFields with explicit nullable > --- > > Key: SPARK-20536 > URL: https://issues.apache.org/jira/browse/SPARK-20536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > > {{ColumnName}} defines methods to create {{StructFields}}. > It'd be very user-friendly if there were methods to create {{StructFields}} > with explicit {{nullable}} property (currently implicitly {{true}}). > That could look as follows: > {code} > // E.g. def int: StructField = StructField(name, IntegerType) > def int(nullable: Boolean): StructField = StructField(name, IntegerType, > nullable) > // or (untested) > def int(nullable: Boolean): StructField = int.copy(nullable = nullable) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable
[ https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20536: Assignee: (was: Apache Spark) > Extend ColumnName to create StructFields with explicit nullable > --- > > Key: SPARK-20536 > URL: https://issues.apache.org/jira/browse/SPARK-20536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > > {{ColumnName}} defines methods to create {{StructFields}}. > It'd be very user-friendly if there were methods to create {{StructFields}} > with explicit {{nullable}} property (currently implicitly {{true}}). > That could look as follows: > {code} > // E.g. def int: StructField = StructField(name, IntegerType) > def int(nullable: Boolean): StructField = StructField(name, IntegerType, > nullable) > // or (untested) > def int(nullable: Boolean): StructField = int.copy(nullable = nullable) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20536) Extend ColumnName to create StructFields with explicit nullable
[ https://issues.apache.org/jira/browse/SPARK-20536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20536: Assignee: Apache Spark > Extend ColumnName to create StructFields with explicit nullable > --- > > Key: SPARK-20536 > URL: https://issues.apache.org/jira/browse/SPARK-20536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Trivial > > {{ColumnName}} defines methods to create {{StructFields}}. > It'd be very user-friendly if there were methods to create {{StructFields}} > with explicit {{nullable}} property (currently implicitly {{true}}). > That could look as follows: > {code} > // E.g. def int: StructField = StructField(name, IntegerType) > def int(nullable: Boolean): StructField = StructField(name, IntegerType, > nullable) > // or (untested) > def int(nullable: Boolean): StructField = int.copy(nullable = nullable) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-23533: Assignee: Li Yuanjian > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.4.0 > > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23533) Add support for changing ContinuousDataReader's startOffset
[ https://issues.apache.org/jira/browse/SPARK-23533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-23533. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20689 [https://github.com/apache/spark/pull/20689] > Add support for changing ContinuousDataReader's startOffset > --- > > Key: SPARK-23533 > URL: https://issues.apache.org/jira/browse/SPARK-23533 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Major > Fix For: 2.4.0 > > > As discussion in [https://github.com/apache/spark/pull/20675], we need add a > new interface `ContinuousDataReaderFactory` to support the requirements of > setting start offset in Continuous Processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-23614: Component/s: (was: Spark Core) SQL > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Priority: Major > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23614: Assignee: (was: Apache Spark) > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Priority: Major > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23614: Assignee: Apache Spark > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Assignee: Apache Spark >Priority: Major > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399985#comment-16399985 ] Apache Spark commented on SPARK-23614: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20831 > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Priority: Major > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23677) Selecting columns from joined DataFrames with the same origin yields wrong results
[ https://issues.apache.org/jira/browse/SPARK-23677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16399969#comment-16399969 ] Takeshi Yamamuro commented on SPARK-23677: -- You mean this ticket? SPARK-14948. I think this is a well-known issue. > Selecting columns from joined DataFrames with the same origin yields wrong > results > -- > > Key: SPARK-23677 > URL: https://issues.apache.org/jira/browse/SPARK-23677 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Martin Mauch >Priority: Major > > When trying to join two DataFrames with the same origin DataFrame and later > selecting columns from the join, Spark can't distinguish between the columns > and gives a wrong (or at least very surprising) result. One can work around > this using expr. > Here is a minimal example: > > {code:java} > import spark.implicits._ > val edf = Seq((1), (2), (3), (4), (5)).toDF("num") > val big = edf.where(edf("num") > 2).alias("big") > val small = edf.where(edf("num") < 4).alias("small") > small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), > big("num")).show() > // +---+---+ > // |num|num| > // +---+---+ > // | 2| 2| > // | 3| 3| > // +—+—+ > small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), > expr("big.num")).show() > // +---+---+ > // |num|num| > // +---+---+ > // | 2| 3| > // | 3| 4| > // +---+---+ > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org