[jira] [Created] (SPARK-25187) Revisit the life cycle of ReadSupport instances.
Wenchen Fan created SPARK-25187: --- Summary: Revisit the life cycle of ReadSupport instances. Key: SPARK-25187 URL: https://issues.apache.org/jira/browse/SPARK-25187 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Currently the life cycle is bound to the batch/stream query. This fits streaming very well but may not be perfect for batch source. We can also consider to let {{ReadSupport.newScanConfigBuilder}} take {{DataSourceOptions}} as parameter, if we decide to change the life cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25186) Stabilize Data Source V2 API
Wenchen Fan created SPARK-25186: --- Summary: Stabilize Data Source V2 API Key: SPARK-25186 URL: https://issues.apache.org/jira/browse/SPARK-25186 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet/ORC
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588414#comment-16588414 ] Apache Spark commented on SPARK-25132: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/22183 > Case-insensitive field resolution when reading from Parquet/ORC > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25155) Streaming from storage doesn't work when no directories exists
[ https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gil Vernik resolved SPARK-25155. Resolution: Cannot Reproduce > Streaming from storage doesn't work when no directories exists > -- > > Key: SPARK-25155 > URL: https://issues.apache.org/jira/browse/SPARK-25155 > Project: Spark > Issue Type: Bug > Components: DStreams, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Gil Vernik >Priority: Minor > > I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` > method `findNewFiles`. > Streaming for the giving path suppose to pickup new files only ( based on the > previous run timestamp ). However the code in Spark will first obtain > directories, then for each directory will find new files. Here is the > relevant code: > *val* directoryFilter = *new* PathFilter > { *override def* accept(path: Path): Boolean = > fs.getFileStatus(path).isDirectory } > *val* directories = fs.globStatus(directoryPath, > directoryFilter).map(_.getPath) > *val* newFiles = directories.flatMap(dir => > fs.listStatus(dir, newFileFilter).map(_.getPath.toString)) > > This is not optimized, as it always requires two accesses. In addition this > seems to be buggy > I have an S3 bucket “mydata” with objects “a.csv”, “b.csv”. I noticed that > fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 > directories and so “a.csv”, “b.csv” were not picked by Spark. > I tried to make path as “[s3a://mydata/*]” and it didn't worked also. > I experienced the same problematic behavior with the file system when tried > to stream from “/Users/streaming/*” > I suggest to change the code in Spark so it will perform first list without > directoryFilter, which seems not needed at all. The code could be > *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath) > The flow would be ( for each entry in directoriesOrfiles ) > * If data object: Spark will apply newFileFilter on the returned objects > * If directory: then the existing code will perform additional listing at > the directory level > This way it will pick up files from the root of path and the content of > directories -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25155) Streaming from storage doesn't work when no directories exists
[ https://issues.apache.org/jira/browse/SPARK-25155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588412#comment-16588412 ] Gil Vernik commented on SPARK-25155: [~ste...@apache.org] thanks for the input. This one [https://github.com/apache/spark/pull/17745] is perfect! But you also created [https://github.com/apache/spark/pull/14731], what is the difference between both? The [https://github.com/apache/spark/pull/17745] is so small and simple, and yet provides great benefit [~dongjoon] Somehow I can't reproduce the issue i saw the with master branch. Need to run more experiments. Closig this issue for now. > Streaming from storage doesn't work when no directories exists > -- > > Key: SPARK-25155 > URL: https://issues.apache.org/jira/browse/SPARK-25155 > Project: Spark > Issue Type: Bug > Components: DStreams, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Gil Vernik >Priority: Minor > > I have an issue related `org.apache.spark.streaming.dstream.FileInputDStream` > method `findNewFiles`. > Streaming for the giving path suppose to pickup new files only ( based on the > previous run timestamp ). However the code in Spark will first obtain > directories, then for each directory will find new files. Here is the > relevant code: > *val* directoryFilter = *new* PathFilter > { *override def* accept(path: Path): Boolean = > fs.getFileStatus(path).isDirectory } > *val* directories = fs.globStatus(directoryPath, > directoryFilter).map(_.getPath) > *val* newFiles = directories.flatMap(dir => > fs.listStatus(dir, newFileFilter).map(_.getPath.toString)) > > This is not optimized, as it always requires two accesses. In addition this > seems to be buggy > I have an S3 bucket “mydata” with objects “a.csv”, “b.csv”. I noticed that > fs.globStatus(“[s3a://mydata/], directoryFilter).map(_.getPath) returned 0 > directories and so “a.csv”, “b.csv” were not picked by Spark. > I tried to make path as “[s3a://mydata/*]” and it didn't worked also. > I experienced the same problematic behavior with the file system when tried > to stream from “/Users/streaming/*” > I suggest to change the code in Spark so it will perform first list without > directoryFilter, which seems not needed at all. The code could be > *val* directoriesOrfiles = fs.globStatus(directoryPath).map(_.getPath) > The flow would be ( for each entry in directoriesOrfiles ) > * If data object: Spark will apply newFileFilter on the returned objects > * If directory: then the existing code will perform additional listing at > the directory level > This way it will pick up files from the root of path and the content of > directories -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25159) json schema inference should only trigger one job
[ https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25159: Fix Version/s: 2.4.0 > json schema inference should only trigger one job > - > > Key: SPARK-25159 > URL: https://issues.apache.org/jira/browse/SPARK-25159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25159) json schema inference should only trigger one job
[ https://issues.apache.org/jira/browse/SPARK-25159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25159. - Resolution: Fixed Target Version/s: 2.4.0 > json schema inference should only trigger one job > - > > Key: SPARK-25159 > URL: https://issues.apache.org/jira/browse/SPARK-25159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25140) Add optional logging to UnsafeProjection.create when it falls back to interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-25140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25140. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Add optional logging to UnsafeProjection.create when it falls back to > interpreted mode > -- > > Key: SPARK-25140 > URL: https://issues.apache.org/jira/browse/SPARK-25140 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.4.0 > > > SPARK-23711 implemented a nice graceful handling of allowing UnsafeProjection > to fall back to an interpreter mode when codegen fails. That makes Spark much > more usable even when codegen is unable to handle the given query. > But in its current form, the fallback handling can also be a mystery in terms > of performance cliffs. Users may be left wondering why a query runs fine with > some expressions, but then with just one extra expression the performance > goes 2x, 3x (or more) slower. > It'd be nice to have optional logging of the fallback behavior, so that for > users that care about monitoring performance cliffs, they can opt-in to log > when a fallback to interpreter mode was taken. i.e. at > https://github.com/apache/spark/blob/a40ffc656d62372da85e0fa932b67207839e7fde/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala#L183 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25185) CBO rowcount statistics doesn't work for partitioned parquet external table
Amit created SPARK-25185: Summary: CBO rowcount statistics doesn't work for partitioned parquet external table Key: SPARK-25185 URL: https://issues.apache.org/jira/browse/SPARK-25185 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Affects Versions: 2.3.0, 2.2.1 Environment: Created a dummy partitioned data with partition column on string type col1=a and col1=b added csv data-> read through spark -> created partitioned external table-> msck repair table to load partition. Did analyze on all columns and partition column as well. ~println(spark.sql("select * from test_p where e='1a'").queryExecution.toStringWithStats)~ ~val op = spark.sql("select * from test_p where e='1a'").queryExecution.optimizedPlan~ // e is the partitioned column ~val stat = op.stats(spark.sessionState.conf)~ ~print(stat.rowCount)~ Created the same way in parquet the rowcount comes up correctly in case of csv but in parquet it shows as None. Reporter: Amit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588389#comment-16588389 ] Jungtaek Lim commented on SPARK-24763: -- [~tdas] Got it. Thanks for the input. > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
[ https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25184: Assignee: (was: Apache Spark) > Flaky test: FlatMapGroupsWithState "streaming with processing time timeout" > --- > > Key: SPARK-25184 > URL: https://issues.apache.org/jira/browse/SPARK-25184 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Tathagata Das >Priority: Minor > > {code} > Assert on query failed: Check total state rows = List(1), updated state rows > = List(2): Array() did not equal List(1) incorrect total rows, recent > progresses: > { > "id" : "3598002e-0120-4937-8a36-226e0af992b6", > "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55", > "name" : null, > "timestamp" : "1970-01-01T00:00:12.000Z", > "batchId" : 3, > "numInputRows" : 0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 0, > "triggerExecution" : 0 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "MemoryStream[value#474622]", > "startOffset" : 2, > "endOffset" : 2, > "numInputRows" : 0 > } ], > "sink" : { > "description" : "MemorySink" > } > } > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428) > > org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762) > == Progress == > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: a >AdvanceManualClock(1000) >CheckNewAnswer: [a,1] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1000) >CheckNewAnswer: [b,1] >AssertOnQuery(, Check total state rows = List(2), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1) >CheckNewAnswer: [a,-1],[b,2] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >StopStream > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: c >AdvanceManualClock(11000) >CheckNewAnswer: [b,-1],[c,1] > => AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >AdvanceManualClock(12000) >AssertOnQuery(, ) >AssertOnQuery(, name) >CheckNewAnswer: [c,-1] >AssertOnQuery(, Check total state rows = List(0), updated state > rows = List(0)) > == Stream == > Output Mode: Update > Stream state: {MemoryStream[value#474622]: 3} > Thread state: alive > Thread stack trace: java.lang.Object.wait(Native Method) > org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61) > org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34) > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65) > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166) > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293) > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203) > == Sink == > 0: [a,1] > 1: [b,1] > 2: [a,-1] [b,2] > 3: [b,-1] [c,1] > == Plan == > == Parsed Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) > AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, > true]))._2,
[jira] [Assigned] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
[ https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25184: Assignee: Apache Spark > Flaky test: FlatMapGroupsWithState "streaming with processing time timeout" > --- > > Key: SPARK-25184 > URL: https://issues.apache.org/jira/browse/SPARK-25184 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Minor > > {code} > Assert on query failed: Check total state rows = List(1), updated state rows > = List(2): Array() did not equal List(1) incorrect total rows, recent > progresses: > { > "id" : "3598002e-0120-4937-8a36-226e0af992b6", > "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55", > "name" : null, > "timestamp" : "1970-01-01T00:00:12.000Z", > "batchId" : 3, > "numInputRows" : 0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 0, > "triggerExecution" : 0 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "MemoryStream[value#474622]", > "startOffset" : 2, > "endOffset" : 2, > "numInputRows" : 0 > } ], > "sink" : { > "description" : "MemorySink" > } > } > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428) > > org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762) > == Progress == > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: a >AdvanceManualClock(1000) >CheckNewAnswer: [a,1] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1000) >CheckNewAnswer: [b,1] >AssertOnQuery(, Check total state rows = List(2), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1) >CheckNewAnswer: [a,-1],[b,2] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >StopStream > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: c >AdvanceManualClock(11000) >CheckNewAnswer: [b,-1],[c,1] > => AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >AdvanceManualClock(12000) >AssertOnQuery(, ) >AssertOnQuery(, name) >CheckNewAnswer: [c,-1] >AssertOnQuery(, Check total state rows = List(0), updated state > rows = List(0)) > == Stream == > Output Mode: Update > Stream state: {MemoryStream[value#474622]: 3} > Thread state: alive > Thread stack trace: java.lang.Object.wait(Native Method) > org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61) > org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34) > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65) > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166) > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293) > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203) > == Sink == > 0: [a,1] > 1: [b,1] > 2: [a,-1] [b,2] > 3: [b,-1] [c,1] > == Plan == > == Parsed Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) > AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, > StringType, fromString, assertnotnull(assertnotnull(input[0,
[jira] [Commented] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
[ https://issues.apache.org/jira/browse/SPARK-25184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588385#comment-16588385 ] Apache Spark commented on SPARK-25184: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/22182 > Flaky test: FlatMapGroupsWithState "streaming with processing time timeout" > --- > > Key: SPARK-25184 > URL: https://issues.apache.org/jira/browse/SPARK-25184 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Tathagata Das >Priority: Minor > > {code} > Assert on query failed: Check total state rows = List(1), updated state rows > = List(2): Array() did not equal List(1) incorrect total rows, recent > progresses: > { > "id" : "3598002e-0120-4937-8a36-226e0af992b6", > "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55", > "name" : null, > "timestamp" : "1970-01-01T00:00:12.000Z", > "batchId" : 3, > "numInputRows" : 0, > "durationMs" : { > "getEndOffset" : 0, > "setOffsetRange" : 0, > "triggerExecution" : 0 > }, > "stateOperators" : [ ], > "sources" : [ { > "description" : "MemoryStream[value#474622]", > "startOffset" : 2, > "endOffset" : 2, > "numInputRows" : 0 > } ], > "sink" : { > "description" : "MemorySink" > } > } > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55) > > org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428) > > org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775) > > org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762) > == Progress == > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: a >AdvanceManualClock(1000) >CheckNewAnswer: [a,1] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1000) >CheckNewAnswer: [b,1] >AssertOnQuery(, Check total state rows = List(2), updated state > rows = List(1)) >AddData to MemoryStream[value#474622]: b >AdvanceManualClock(1) >CheckNewAnswer: [a,-1],[b,2] >AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >StopStream > > StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) >AddData to MemoryStream[value#474622]: c >AdvanceManualClock(11000) >CheckNewAnswer: [b,-1],[c,1] > => AssertOnQuery(, Check total state rows = List(1), updated state > rows = List(2)) >AdvanceManualClock(12000) >AssertOnQuery(, ) >AssertOnQuery(, name) >CheckNewAnswer: [c,-1] >AssertOnQuery(, Check total state rows = List(0), updated state > rows = List(0)) > == Stream == > Output Mode: Update > Stream state: {MemoryStream[value#474622]: 3} > Thread state: alive > Thread stack trace: java.lang.Object.wait(Native Method) > org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61) > org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34) > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65) > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166) > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293) > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203) > == Sink == > 0: [a,1] > 1: [b,1] > 2: [a,-1] [b,2] > 3: [b,-1] [c,1] > == Plan == > == Parsed Logical Plan == > SerializeFromObject [staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) > AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, >
[jira] [Created] (SPARK-25184) Flaky test: FlatMapGroupsWithState "streaming with processing time timeout"
Tathagata Das created SPARK-25184: - Summary: Flaky test: FlatMapGroupsWithState "streaming with processing time timeout" Key: SPARK-25184 URL: https://issues.apache.org/jira/browse/SPARK-25184 Project: Spark Issue Type: Test Components: Structured Streaming Affects Versions: 2.3.2 Reporter: Tathagata Das {code} Assert on query failed: Check total state rows = List(1), updated state rows = List(2): Array() did not equal List(1) incorrect total rows, recent progresses: { "id" : "3598002e-0120-4937-8a36-226e0af992b6", "runId" : "e7efe911-72fb-48aa-ba35-775057eabe55", "name" : null, "timestamp" : "1970-01-01T00:00:12.000Z", "batchId" : 3, "numInputRows" : 0, "durationMs" : { "getEndOffset" : 0, "setOffsetRange" : 0, "triggerExecution" : 0 }, "stateOperators" : [ ], "sources" : [ { "description" : "MemoryStream[value#474622]", "startOffset" : 2, "endOffset" : 2, "numInputRows" : 0 } ], "sink" : { "description" : "MemorySink" } } org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:55) org.apache.spark.sql.streaming.StateStoreMetricsTest$$anonfun$assertNumStateRows$1.apply(StateStoreMetricsTest.scala:33) org.apache.spark.sql.streaming.StreamTest$$anonfun$executeAction$1$11.apply$mcZ$sp(StreamTest.scala:657) org.apache.spark.sql.streaming.StreamTest$class.verify$1(StreamTest.scala:428) org.apache.spark.sql.streaming.StreamTest$class.executeAction$1(StreamTest.scala:657) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:775) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:762) == Progress == StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) AddData to MemoryStream[value#474622]: a AdvanceManualClock(1000) CheckNewAnswer: [a,1] AssertOnQuery(, Check total state rows = List(1), updated state rows = List(1)) AddData to MemoryStream[value#474622]: b AdvanceManualClock(1000) CheckNewAnswer: [b,1] AssertOnQuery(, Check total state rows = List(2), updated state rows = List(1)) AddData to MemoryStream[value#474622]: b AdvanceManualClock(1) CheckNewAnswer: [a,-1],[b,2] AssertOnQuery(, Check total state rows = List(1), updated state rows = List(2)) StopStream StartStream(ProcessingTime(1000),org.apache.spark.sql.streaming.util.StreamManualClock@5a12704,Map(),null) AddData to MemoryStream[value#474622]: c AdvanceManualClock(11000) CheckNewAnswer: [b,-1],[c,1] => AssertOnQuery(, Check total state rows = List(1), updated state rows = List(2)) AdvanceManualClock(12000) AssertOnQuery(, ) AssertOnQuery(, name) CheckNewAnswer: [c,-1] AssertOnQuery(, Check total state rows = List(0), updated state rows = List(0)) == Stream == Output Mode: Update Stream state: {MemoryStream[value#474622]: 3} Thread state: alive Thread stack trace: java.lang.Object.wait(Native Method) org.apache.spark.util.ManualClock.waitTillTime(ManualClock.scala:61) org.apache.spark.sql.streaming.util.StreamManualClock.waitTillTime(StreamManualClock.scala:34) org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:65) org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:166) org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:293) org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:203) == Sink == 0: [a,1] 1: [b,1] 2: [a,-1] [b,2] 3: [b,-1] [c,1] == Plan == == Parsed Logical Plan == SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#474630, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, true, false) AS _2#474631] +- FlatMapGroupsWithState , cast(value#474625 as string).toString, cast(value#474622 as string).toString, [value#474625], [value#474622], obj#474629: scala.Tuple2, class[count[0]: bigint], Update, false, ProcessingTimeTimeout +- AppendColumns , class java.lang.String, [StructField(value,StringType,true)], cast(value#474622 as string).toString, [staticinvoke(class org.apache.spark.unsafe.types.UTF8String,
[jira] [Assigned] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
[ https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25163: Assignee: Apache Spark > Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with > compression > -- > > Key: SPARK-25163 > URL: https://issues.apache.org/jira/browse/SPARK-25163 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > I saw it failed multiple times on Jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
[ https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25163: Assignee: (was: Apache Spark) > Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with > compression > -- > > Key: SPARK-25163 > URL: https://issues.apache.org/jira/browse/SPARK-25163 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > I saw it failed multiple times on Jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25163) Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with compression
[ https://issues.apache.org/jira/browse/SPARK-25163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588296#comment-16588296 ] Apache Spark commented on SPARK-25163: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22181 > Flaky test: o.a.s.util.collection.ExternalAppendOnlyMapSuite.spilling with > compression > -- > > Key: SPARK-25163 > URL: https://issues.apache.org/jira/browse/SPARK-25163 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > I saw it failed multiple times on Jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/4813/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling_with_compression/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message
[ https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25174: Assignee: Apache Spark > ApplicationMaster suspends when unregistering itself from RM with extreme > large diagnostic message > -- > > Key: SPARK-25174 > URL: https://issues.apache.org/jira/browse/SPARK-25174 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is > not about the issue in SPARK-18016 but the side-effect which it brings. When > SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the > exception contains extreme large error information. > {code:java} > ERROR yarn.ApplicationMaster: User class threw exception: > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown > past JVM limit of 0x > /* 001 */ public java.lang.Object generate(Object[] references) { > > /* 395656 */ mutableRow.update(0, value); > /* 395657 */ } > /* 395658 */ > /* 395659 */ return mutableRow; > /* 395660 */ } > /* 395661 */ } > {code} > The above codegen text is included in the final message for AM to wave > goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for > YARN-6125 not covering the unregisterApplicationMaster's message truncation. > We also create an Jira on YARN Side > https://issues.apache.org/jira/browse/YARN-8691 > Although SPARK-18016 fixed already, there are maybe other uncaught exceptions > will cause this problem. I guess that we should limit the error message's > size sent to RM while unregistering AM . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message
[ https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588283#comment-16588283 ] Apache Spark commented on SPARK-25174: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/22180 > ApplicationMaster suspends when unregistering itself from RM with extreme > large diagnostic message > -- > > Key: SPARK-25174 > URL: https://issues.apache.org/jira/browse/SPARK-25174 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Kent Yao >Priority: Major > > We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is > not about the issue in SPARK-18016 but the side-effect which it brings. When > SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the > exception contains extreme large error information. > {code:java} > ERROR yarn.ApplicationMaster: User class threw exception: > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown > past JVM limit of 0x > /* 001 */ public java.lang.Object generate(Object[] references) { > > /* 395656 */ mutableRow.update(0, value); > /* 395657 */ } > /* 395658 */ > /* 395659 */ return mutableRow; > /* 395660 */ } > /* 395661 */ } > {code} > The above codegen text is included in the final message for AM to wave > goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for > YARN-6125 not covering the unregisterApplicationMaster's message truncation. > We also create an Jira on YARN Side > https://issues.apache.org/jira/browse/YARN-8691 > Although SPARK-18016 fixed already, there are maybe other uncaught exceptions > will cause this problem. I guess that we should limit the error message's > size sent to RM while unregistering AM . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25174) ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message
[ https://issues.apache.org/jira/browse/SPARK-25174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25174: Assignee: (was: Apache Spark) > ApplicationMaster suspends when unregistering itself from RM with extreme > large diagnostic message > -- > > Key: SPARK-25174 > URL: https://issues.apache.org/jira/browse/SPARK-25174 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.1 >Reporter: Kent Yao >Priority: Major > > We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is > not about the issue in SPARK-18016 but the side-effect which it brings. When > SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the > exception contains extreme large error information. > {code:java} > ERROR yarn.ApplicationMaster: User class threw exception: > java.lang.RuntimeException: Error while decoding: > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown > past JVM limit of 0x > /* 001 */ public java.lang.Object generate(Object[] references) { > > /* 395656 */ mutableRow.update(0, value); > /* 395657 */ } > /* 395658 */ > /* 395659 */ return mutableRow; > /* 395660 */ } > /* 395661 */ } > {code} > The above codegen text is included in the final message for AM to wave > goodbye to RM, while it ends up crashing the rm's ZKRMStateStore for > YARN-6125 not covering the unregisterApplicationMaster's message truncation. > We also create an Jira on YARN Side > https://issues.apache.org/jira/browse/YARN-8691 > Although SPARK-18016 fixed already, there are maybe other uncaught exceptions > will cause this problem. I guess that we should limit the error message's > size sent to RM while unregistering AM . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588234#comment-16588234 ] Tathagata Das commented on SPARK-24763: --- They will be. The merge script always puts the major version (i.e. 3.0.0) there. Those will be redirected to 2.4.0 as well when we make the 2.4.0 release. On Tue, Aug 21, 2018 at 3:39 PM, Jungtaek Lim (JIRA) > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23131) Stackoverflow using ML and Kryo serializer
[ https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588232#comment-16588232 ] Apache Spark commented on SPARK-23131: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22179 > Stackoverflow using ML and Kryo serializer > -- > > Key: SPARK-23131 > URL: https://issues.apache.org/jira/browse/SPARK-23131 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.2.0 >Reporter: Peigen >Priority: Minor > > When trying to use GeneralizedLinearRegression model and set SparkConf to use > KryoSerializer(JavaSerializer is fine) > It causes StackOverflowException > {quote}Exception in thread "dispatcher-event-loop-34" > java.lang.StackOverflowError > at java.util.HashMap.hash(HashMap.java:338) > at java.util.HashMap.get(HashMap.java:556) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > {quote} > This is very likely to be > [https://github.com/EsotericSoftware/kryo/issues/341] > Upgrade Kryo to 4.0+ probably could fix this > > Wish for upgrade Kryo version for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25127) DataSourceV2: Remove SupportsPushDownCatalystFilters
[ https://issues.apache.org/jira/browse/SPARK-25127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-25127: --- Assignee: Reynold Xin > DataSourceV2: Remove SupportsPushDownCatalystFilters > > > Key: SPARK-25127 > URL: https://issues.apache.org/jira/browse/SPARK-25127 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.4.0 >Reporter: Ryan Blue >Assignee: Reynold Xin >Priority: Major > > Discussion about adding TableCatalog on the dev list focused around whether > Expression should be used in the public DataSourceV2 API, with > SupportsPushDownCatalystFilters as an example of where it is already exposed. > The early consensus is that Expression should not be exposed in the public > API. > From [~rxin]: > bq. I completely disagree with using Expression in critical public APIs that > we expect a lot of developers to use . . . If we are depending on Expressions > on the more common APIs in dsv2 already, we should revisit that. > The main use of this API is to pass Expression to FileFormat classes that > used Expression instead of Filter. External sources also use it for more > complex push-down, like {{to_date(ts) = '2018-05-13'}}, but those uses can be > done with Analyzer rules or when translating to Filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25183) Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise
Steve Loughran created SPARK-25183: -- Summary: Spark HiveServer2 registers shutdown hook with JVM, not ShutdownHookManager; race conditions can arise Key: SPARK-25183 URL: https://issues.apache.org/jira/browse/SPARK-25183 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Steve Loughran Spark's HiveServer2 registers a shutdown hook with the JVM {{Runtime.addShutdownHook()}} which can happen in parallel with the ShutdownHookManager sequence of spark & Hadoop, which operate the shutdowns in an ordered sequence. This has some risks * FS shutdown before rename of logs completes, SPARK-6933 * Delays of rename on object stores may block FS close operation, which, on clusters with timeouts hooks (HADOOP-12950) of FileSystem.closeAll() can force a kill of that shutdown hook and other problems. General outcome: logs aren't present. Proposed fix: * register hook with {{org.apache.spark.util.ShutdownHookManager}} * HADOOP-15679 to make shutdown wait time configurable, so O(data) renames don't trigger timeouts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588168#comment-16588168 ] Bruce Robbins commented on SPARK-25164: --- [~viirya] Sure. I will try to get something up by tonight or tomorrow morning (Pacific Time). > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25119) stages in wrong order within job page DAG chart
[ https://issues.apache.org/jira/browse/SPARK-25119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588160#comment-16588160 ] Yunjian Zhang commented on SPARK-25119: --- create PR as below [https://github.com/apache/spark/pull/22177] > stages in wrong order within job page DAG chart > --- > > Key: SPARK-25119 > URL: https://issues.apache.org/jira/browse/SPARK-25119 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Yunjian Zhang >Priority: Minor > Attachments: Screen Shot 2018-08-14 at 3.35.34 PM.png > > > {color:#33}multiple stages for same job are shown with wrong order in DAG > Visualization of job page.{color} > e.g. > stage27 stage19 stage20 stage24 stage21 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588159#comment-16588159 ] Liang-Chi Hsieh edited comment on SPARK-25164 at 8/21/18 11:30 PM: --- This looks easy and good to have. [~bersprockets] Do you want to submit a PR for this? was (Author: viirya): This is easy and looks good to have. [~bersprockets] Do you want to submit a PR for this? > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25164) Parquet reader builds entire list of columns once for each column
[ https://issues.apache.org/jira/browse/SPARK-25164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588159#comment-16588159 ] Liang-Chi Hsieh commented on SPARK-25164: - This is easy and looks good to have. [~bersprockets] Do you want to submit a PR for this? > Parquet reader builds entire list of columns once for each column > - > > Key: SPARK-25164 > URL: https://issues.apache.org/jira/browse/SPARK-25164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > {{VectorizedParquetRecordReader.initializeInternal}} loops through each > column, and for each column it calls > {noformat} > requestedSchema.getColumns().get(i) > {noformat} > However, {{MessageType.getColumns}} will build the entire column list from > getPaths(0). > {noformat} > public List getColumns() { > List paths = this.getPaths(0); > List columns = new > ArrayList(paths.size()); > for (String[] path : paths) { > // TODO: optimize this > > PrimitiveType primitiveType = getType(path).asPrimitiveType(); > columns.add(new ColumnDescriptor( > path, > primitiveType, > getMaxRepetitionLevel(path), > getMaxDefinitionLevel(path))); > } > return columns; > } > {noformat} > This means that for each parquet file, this routine indirectly iterates > colCount*colCount times. > This is actually not particularly noticeable unless you have: > - many parquet files > - many columns > To verify that this is an issue, I created a 1 million record parquet table > with 6000 columns of type double and 67 files (so initializeInternal is > called 67 times). I ran the following query: > {noformat} > sql("select * from 6000_1m_double where id1 = 1").collect > {noformat} > I used Spark from the master branch. I had 8 executor threads. The filter > returns only a few thousand records. The query ran (on average) for 6.4 > minutes. > Then I cached the column list at the top of {{initializeInternal}} as follows: > {noformat} > List columnCache = requestedSchema.getColumns(); > {noformat} > Then I changed {{initializeInternal}} to use {{columnCache}} rather than > {{requestedSchema.getColumns()}}. > With the column cache variable, the same query runs in 5 minutes. So with my > simple query, you save %22 of time by not rebuilding the column list for each > column. > You get additional savings with a paths cache variable, now saving 34% in > total on the above query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25181) Block Manager master and slave thread pools are unbounded
[ https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25181: Assignee: (was: Apache Spark) > Block Manager master and slave thread pools are unbounded > - > > Key: SPARK-25181 > URL: https://issues.apache.org/jira/browse/SPARK-25181 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have > thread pools with unbounded numbers of threads. In certain cases, this can > lead to driver OOM errors. We should add an upper bound on the number of > threads in these thread pools; this should not break any existing behavior > because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25181) Block Manager master and slave thread pools are unbounded
[ https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25181: Assignee: Apache Spark > Block Manager master and slave thread pools are unbounded > - > > Key: SPARK-25181 > URL: https://issues.apache.org/jira/browse/SPARK-25181 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Assignee: Apache Spark >Priority: Major > > Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have > thread pools with unbounded numbers of threads. In certain cases, this can > lead to driver OOM errors. We should add an upper bound on the number of > threads in these thread pools; this should not break any existing behavior > because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25181) Block Manager master and slave thread pools are unbounded
[ https://issues.apache.org/jira/browse/SPARK-25181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588154#comment-16588154 ] Apache Spark commented on SPARK-25181: -- User 'mukulmurthy' has created a pull request for this issue: https://github.com/apache/spark/pull/22176 > Block Manager master and slave thread pools are unbounded > - > > Key: SPARK-25181 > URL: https://issues.apache.org/jira/browse/SPARK-25181 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have > thread pools with unbounded numbers of threads. In certain cases, this can > lead to driver OOM errors. We should add an upper bound on the number of > threads in these thread pools; this should not break any existing behavior > because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25182) Block Manager master and slave thread pools are unbounded
[ https://issues.apache.org/jira/browse/SPARK-25182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Murthy resolved SPARK-25182. -- Resolution: Duplicate Target Version/s: (was: 2.4.0) > Block Manager master and slave thread pools are unbounded > - > > Key: SPARK-25182 > URL: https://issues.apache.org/jira/browse/SPARK-25182 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have > thread pools with unbounded numbers of threads. In certain cases, this can > lead to driver OOM errors. We should add an upper bound on the number of > threads in these thread pools; this should not break any existing behavior > because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25182) Block Manager master and slave thread pools are unbounded
Mukul Murthy created SPARK-25182: Summary: Block Manager master and slave thread pools are unbounded Key: SPARK-25182 URL: https://issues.apache.org/jira/browse/SPARK-25182 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Mukul Murthy Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have thread pools with unbounded numbers of threads. In certain cases, this can lead to driver OOM errors. We should add an upper bound on the number of threads in these thread pools; this should not break any existing behavior because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25181) Block Manager master and slave thread pools are unbounded
Mukul Murthy created SPARK-25181: Summary: Block Manager master and slave thread pools are unbounded Key: SPARK-25181 URL: https://issues.apache.org/jira/browse/SPARK-25181 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Mukul Murthy Currently, BlockManagerMasterEndpoint and BlockManagerSlaveEndpoint both have thread pools with unbounded numbers of threads. In certain cases, this can lead to driver OOM errors. We should add an upper bound on the number of threads in these thread pools; this should not break any existing behavior because they still have queues of size Integer.MAX_VALUE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588121#comment-16588121 ] Xiao Li commented on SPARK-25114: - Let us update the fix version after the fix of 2.2 is merged > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-25114. - Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.4.0 2.3.2 > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Blocker > Labels: correctness > Fix For: 2.3.2, 2.4.0 > > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25095) Python support for BarrierTaskContext
[ https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-25095. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22085 [https://github.com/apache/spark/pull/22085] > Python support for BarrierTaskContext > - > > Key: SPARK-25095 > URL: https://issues.apache.org/jira/browse/SPARK-25095 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > > Enable call `BarrierTaskContext.barrier()` from python side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25095) Python support for BarrierTaskContext
[ https://issues.apache.org/jira/browse/SPARK-25095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-25095: - Assignee: Jiang Xingbo > Python support for BarrierTaskContext > - > > Key: SPARK-25095 > URL: https://issues.apache.org/jira/browse/SPARK-25095 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > > Enable call `BarrierTaskContext.barrier()` from python side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24564) Add test suite for RecordBinaryComparator
[ https://issues.apache.org/jira/browse/SPARK-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588114#comment-16588114 ] Apache Spark commented on SPARK-24564: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/22079 > Add test suite for RecordBinaryComparator > - > > Key: SPARK-24564 > URL: https://issues.apache.org/jira/browse/SPARK-24564 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25114) RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[ https://issues.apache.org/jira/browse/SPARK-25114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588115#comment-16588115 ] Apache Spark commented on SPARK-25114: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/22079 > RecordBinaryComparator may return wrong result when subtraction between two > words is divisible by Integer.MAX_VALUE > --- > > Key: SPARK-25114 > URL: https://issues.apache.org/jira/browse/SPARK-25114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo >Priority: Blocker > Labels: correctness > > It is possible for two objects to be unequal and yet we consider them as > equal within RecordBinaryComparator, if the long values are separated by > Int.MaxValue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.
[ https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587956#comment-16587956 ] Dilip Biswal edited comment on SPARK-25168 at 8/21/18 10:47 PM: [~cloud_fan] OK.. Let me close this then since we are unable to do anything about this at present and since nothing is seriously impacted. Perhaps we can consider making normalizedPlan private so callers are not handed out unresolved plans ? was (Author: dkbiswal): [~cloud_fan] OK.. Let me close this then since we are unable to do anything about this at present and since nothing is seriously impacted. Perhaps we can consider making normalizedPlan private so callers are not handled out unresolved plans ? > PlanTest.comparePlans may make a supplied resolved plan unresolved. > --- > > Key: SPARK-25168 > URL: https://issues.apache.org/jira/browse/SPARK-25168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Minor > > Hello, > I came across this behaviour while writing unit test for one of my PR. We > use the utlility class PlanTest for writing plan verification tests. > Currently when we call comparePlans and the resolved input plan contains a > Join operator, the plan becomes unresolved after the call to > `normalizePlan(normalizeExprIds(plan1))`. The reason for this is that, after > cleaning the expression ids, duplicateResolved returns false making the Join > operator unresolved. > {code} > def duplicateResolved: Boolean = > left.outputSet.intersect(right.outputSet).isEmpty > // Joins are only resolved if they don't introduce ambiguous expression ids. > // NaturalJoin should be ready for resolution only if everything else is > resolved here > lazy val resolvedExceptNatural: Boolean = { > childrenResolved && > expressions.forall(_.resolved) && > duplicateResolved && > condition.forall(_.dataType == BooleanType) > } > {code} > Please note that, plan verification actually works fine. It just looked > awkward to compare two unresolved plan for equality. > I am opening this ticket to discuss if it is an okay behaviour. If its an > tolerable behaviour then we can close the ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588101#comment-16588101 ] Jungtaek Lim commented on SPARK-24763: -- [~tdas] One question regarding fix version: I guess we still don't cut the Spark 2.4 branch, so I expect everything on master will be included to Spark 2.4. Do I understand correctly? Because we have 4 issues marked fix version to only 3.0.0, so curious which version they're being included. https://issues.apache.org/jira/browse/SPARK-24763?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%203.0.0 > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24441) Expose total estimated size of states in HDFSBackedStateStoreProvider
[ https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-24441: - Assignee: Jungtaek Lim > Expose total estimated size of states in HDFSBackedStateStoreProvider > - > > Key: SPARK-24441 > URL: https://issues.apache.org/jira/browse/SPARK-24441 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > While Spark exposes state metrics for single state, Spark still doesn't > expose overall memory usage of state (loadedMaps) in > HDFSBackedStateStoreProvider. > The rationalize of the patch is that state backed by > HDFSBackedStateStoreProvider will consume more memory than the number what we > can get from query status due to caching multiple versions of states. The > memory footprint to be much larger than query status reports in situations > where the state store is getting a lot of updates: while shallow-copying map > incurs additional small memory usages due to the size of map entities and > references, but row objects will still be shared across the versions. If > there're lots of updates between batches, less row objects will be shared and > more row objects will exist in memory consuming much memory then what we > expect. > It would be better to expose it as well so that end users can determine > actual memory usage for state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24441) Expose total estimated size of states in HDFSBackedStateStoreProvider
[ https://issues.apache.org/jira/browse/SPARK-24441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24441. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21469 [https://github.com/apache/spark/pull/21469] > Expose total estimated size of states in HDFSBackedStateStoreProvider > - > > Key: SPARK-24441 > URL: https://issues.apache.org/jira/browse/SPARK-24441 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > While Spark exposes state metrics for single state, Spark still doesn't > expose overall memory usage of state (loadedMaps) in > HDFSBackedStateStoreProvider. > The rationalize of the patch is that state backed by > HDFSBackedStateStoreProvider will consume more memory than the number what we > can get from query status due to caching multiple versions of states. The > memory footprint to be much larger than query status reports in situations > where the state store is getting a lot of updates: while shallow-copying map > incurs additional small memory usages due to the size of map entities and > references, but row objects will still be shared across the versions. If > there're lots of updates between batches, less row objects will be shared and > more row objects will exist in memory consuming much memory then what we > expect. > It would be better to expose it as well so that end users can determine > actual memory usage for state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-24763: - Assignee: Jungtaek Lim (was: Tathagata Das) > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-24763: - Assignee: Tathagata Das > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24763) Remove redundant key data from value in streaming aggregation
[ https://issues.apache.org/jira/browse/SPARK-24763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-24763. --- Resolution: Done Fix Version/s: 3.0.0 2.4.0 > Remove redundant key data from value in streaming aggregation > - > > Key: SPARK-24763 > URL: https://issues.apache.org/jira/browse/SPARK-24763 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.0, 3.0.0 > > > Key/Value of state in streaming aggregation is formatted as below: > * key: UnsafeRow containing group-by fields > * value: UnsafeRow containing key fields and another fields for aggregation > results > which data for key is stored to both key and value. > This is to avoid doing projection row to value while storing, and joining key > and value to restore origin row to boost performance, but while doing a > simple benchmark test, I found it not much helpful compared to "project and > join". (will paste test result in comment) > So I would propose a new option: remove redundant in stateful aggregation. > I'm avoiding to modify default behavior of stateful aggregation, because > state value will not be compatible between current and option enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt
[ https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-25149. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22139 [https://github.com/apache/spark/pull/22139] > Personalized PageRank raises an error if vertexIDs are > MaxInt > --- > > Key: SPARK-25149 > URL: https://issues.apache.org/jira/browse/SPARK-25149 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Looking at the implementation I think we don't actually need this check. The > current implementation indexes the sparse vectors used and returned by the > method are index by the _position_ of the vertexId in `sources` not by the > vertex ID itself. We should remove this check and add a test to confirm the > implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt
[ https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-25149: - Assignee: Bago Amirbekian > Personalized PageRank raises an error if vertexIDs are > MaxInt > --- > > Key: SPARK-25149 > URL: https://issues.apache.org/jira/browse/SPARK-25149 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > > Looking at the implementation I think we don't actually need this check. The > current implementation indexes the sparse vectors used and returned by the > method are index by the _position_ of the vertexId in `sources` not by the > vertex ID itself. We should remove this check and add a test to confirm the > implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt
[ https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-25149: -- Summary: Personalized PageRank raises an error if vertexIDs are > MaxInt (was: Personalized Page Rank raises an error if vertexIDs are > MaxInt) > Personalized PageRank raises an error if vertexIDs are > MaxInt > --- > > Key: SPARK-25149 > URL: https://issues.apache.org/jira/browse/SPARK-25149 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Priority: Major > > Looking at the implementation I think we don't actually need this check. The > current implementation indexes the sparse vectors used and returned by the > method are index by the _position_ of the vertexId in `sources` not by the > vertex ID itself. We should remove this check and add a test to confirm the > implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24307: Labels: release-notes (was: releasenotes) > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in memory). > A drawback to this approach is that
[jira] [Updated] (SPARK-24307) Support sending messages over 2GB from memory
[ https://issues.apache.org/jira/browse/SPARK-24307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24307: Labels: releasenotes (was: ) > Support sending messages over 2GB from memory > - > > Key: SPARK-24307 > URL: https://issues.apache.org/jira/browse/SPARK-24307 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > > Spark's networking layer supports sending messages backed by a {{FileRegion}} > or a {{ByteBuf}}. Sending large FileRegion's works, as netty supports large > FileRegions. However, {{ByteBuf}} is limited to 2GB. This is particularly > a problem for sending large datasets that are already in memory, eg. cached > RDD blocks. > eg. if you try to replicate a block stored in memory that is over 2 GB, you > will see an exception like: > {noformat} > 18/05/16 12:40:57 ERROR client.TransportClient: Failed to send RPC > 7420542363232096629 to xyz.com/172.31.113.213:44358: > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > io.netty.handler.codec.EncoderException: java.lang.IndexOutOfBoundsException: > readerIndex: 0, writerIndex: -1294617291 (expected: 0 <= readerIndex <= > writerIndex <= capacity(-1294617291)) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:816) > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:723) > at > io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:302) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730) > at > io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081) > at > io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128) > at > io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IndexOutOfBoundsException: readerIndex: 0, writerIndex: > -1294617291 (expected: 0 <= readerIndex <= writerIndex <= > capacity(-1294617291)) > at io.netty.buffer.AbstractByteBuf.setIndex(AbstractByteBuf.java:129) > at > io.netty.buffer.CompositeByteBuf.setIndex(CompositeByteBuf.java:1688) > at io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:110) > at io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:359) > at > org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:87) > at > org.apache.spark.storage.ByteBufferBlockData.toNetty(BlockManager.scala:95) > at > org.apache.spark.storage.BlockManagerManagedBuffer.convertToNetty(BlockManagerManagedBuffer.scala:52) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:58) > at > org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) > ... 17 more > {noformat} > A simple solution to this is to create a "FileRegion" which is backed by a > {{ChunkedByteBuffer}} (spark's existing datastructure to support blocks > 2GB > in memory). > A drawback to this approach is that blocks that
[jira] [Commented] (SPARK-25050) Handle more than two types in avro union types when writing avro files
[ https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588068#comment-16588068 ] DB Tsai commented on SPARK-25050: - We handle more than two types in avro when reading into spark, but not writing out avro files with specified schema. > Handle more than two types in avro union types when writing avro files > -- > > Key: SPARK-25050 > URL: https://issues.apache.org/jira/browse/SPARK-25050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25050) Handle more than two types in avro union types when writing avro files
[ https://issues.apache.org/jira/browse/SPARK-25050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-25050: Summary: Handle more than two types in avro union types when writing avro files (was: Handle more than two types in avro union types) > Handle more than two types in avro union types when writing avro files > -- > > Key: SPARK-25050 > URL: https://issues.apache.org/jira/browse/SPARK-25050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: DB Tsai >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588053#comment-16588053 ] Steve Loughran commented on SPARK-6305: --- 1. exclusion of log4j 1.x you can only safely exclude it if your dependencies don't expect it on their CP. Otherwise: stack traces. See: HADOOP-12956 HBASE-10092 for how these have moved/are slowly moving off commons logging to SLF4J for their APIs, so you can swap in log4j 2.x. Once they are all on SLF4J, then the code-level switch should be straightforward, leaving on the configuration problem #4. 2: use of low level log4j in tests. Usually done to either look for log entries in output, or to turn back logging. If you can move that, then it's possible --but it will be a one-way trip 3. Spark logging use of log4j because spark s"strings" already handle closure stuff, key bits of the log4j APIs in java 8 aren't so compelling, but there are nice things in the new runtime, which is why that move to SLF4J has been going on everywhere. It's just a slow job because it touches pretty much every single .java file in every single JAR which spark imports. 4. "There are a few log4.properties file that would have to be converted to the log4j2 format." This is actually one of the biggest barriers to switching to log4j 2. in production: all the log4j.properties files *everywhere* will need to move, and all cluster management tools which work with those files are going to need rework & re-releases, unless/until logj4 has that promised support for the old files. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
[ https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588035#comment-16588035 ] Steve Loughran commented on SPARK-25180: FWIW, there was no in-progress data at the dest store, so either task commit process failed before the actual output committer's commitTask() method was called, or the abort operation cleaned up before that notification message failed to send. Either way: no in-flight data to clean up. > Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname > fails > --- > > Key: SPARK-25180 > URL: https://issues.apache.org/jira/browse/SPARK-25180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: mac laptop running on a corporate guest wifi, presumably > a wifi with odd DNS settings. >Reporter: Steve Loughran >Priority: Minor > > trying to save work on spark standalone can fail if netty RPC cannot > determine the hostname. While that's a valid failure on a real cluster, in > standalone falling back to localhost rather than inferred "hw13176.lan" value > may be the better option. > note also, the abort* call failed; NPE. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25162) Kubernetes 'in-cluster' client mode and value of spark.driver.host
[ https://issues.apache.org/jira/browse/SPARK-25162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588014#comment-16588014 ] Yinan Li commented on SPARK-25162: -- We actually moved away from using the IP address of the driver pod to set {{spark.driver.host}}, to using a headless service to give the driver pod a FQDN name and set {{spark.driver.host}} to the FQDN name. Internally, we set {{spark.driver.bindAddress}} to the value of environment variable {{SPARK_DRIVER_BIND_ADDRESS}} which gets its value from the IP address of the pod using the downward API. We could do the same for {{spark.kubernetes.driver.pod.name}} as you suggested for in-cluster client mode. > Kubernetes 'in-cluster' client mode and value of spark.driver.host > -- > > Key: SPARK-25162 > URL: https://issues.apache.org/jira/browse/SPARK-25162 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 > Environment: A java program, deployed to kubernetes, that establishes > a Spark Context in client mode. > Not using spark-submit. > Kubernetes 1.10 > AWS EKS > > >Reporter: James Carter >Priority: Minor > > When creating Kubernetes scheduler 'in-cluster' using client mode, the value > for spark.driver.host can be derived from the IP address of the driver pod. > I observed that the value of _spark.driver.host_ defaulted to the value of > _spark.kubernetes.driver.pod.name_, which is not a valid hostname. This > caused the executors to fail to establish a connection back to the driver. > As a work around, in my configuration I pass the driver's pod name _and_ the > driver's ip address to ensure that executors can establish a connection with > the driver. > _spark.kubernetes.driver.pod.name_ := env.valueFrom.fieldRef.fieldPath: > metadata.name > _spark.driver.host_ := env.valueFrom.fieldRef.fieldPath: status.podIp > e.g. > Deployment: > {noformat} > env: > - name: DRIVER_POD_NAME > valueFrom: > fieldRef: > fieldPath: metadata.name > - name: DRIVER_POD_IP > valueFrom: > fieldRef: > fieldPath: status.podIP > {noformat} > > Application Properties: > {noformat} > config[spark.kubernetes.driver.pod.name]: ${DRIVER_POD_NAME} > config[spark.driver.host]: ${DRIVER_POD_IP} > {noformat} > > BasicExecutorFeatureStep.scala: > {code:java} > private val driverUrl = RpcEndpointAddress( > kubernetesConf.get("spark.driver.host"), > kubernetesConf.sparkConf.getInt("spark.driver.port", DEFAULT_DRIVER_PORT), > CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString > {code} > > Ideally only _spark.kubernetes.driver.pod.name_ would need be provided in > this deployment scenario. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
[ https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588012#comment-16588012 ] Steve Loughran commented on SPARK-25180: Netty converts the UnknownHostException into an IOE in the process of adding the hostname info, so it'd be hard for a caller to catch the exception and react meaningfully to it. > Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname > fails > --- > > Key: SPARK-25180 > URL: https://issues.apache.org/jira/browse/SPARK-25180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: mac laptop running on a corporate guest wifi, presumably > a wifi with odd DNS settings. >Reporter: Steve Loughran >Priority: Minor > > trying to save work on spark standalone can fail if netty RPC cannot > determine the hostname. While that's a valid failure on a real cluster, in > standalone falling back to localhost rather than inferred "hw13176.lan" value > may be the better option. > note also, the abort* call failed; NPE. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt
[ https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-25149: -- Shepherd: Joseph K. Bradley > Personalized Page Rank raises an error if vertexIDs are > MaxInt > > > Key: SPARK-25149 > URL: https://issues.apache.org/jira/browse/SPARK-25149 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Priority: Major > > Looking at the implementation I think we don't actually need this check. The > current implementation indexes the sparse vectors used and returned by the > method are index by the _position_ of the vertexId in `sources` not by the > vertex ID itself. We should remove this check and add a test to confirm the > implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
[ https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588005#comment-16588005 ] Steve Loughran commented on SPARK-25180: Stack {code} scala> text("hello all!") res10: String = " _ _ _ _ _ _ | |__ ___ | | | | ___ __ _ | | | | | | | '_ \ / _ \ | | | | / _ \ / _` | | | | | | | | | | | | __/ | | | | | (_) | | (_| | | | | | |_| |_| |_| \___| |_| |_| \___/ \__,_| |_| |_| (_) " scala> val landsatCsvGZOnS3 = path("s3a://landsat-pds/scene_list.gz") landsatCsvGZOnS3: org.apache.hadoop.fs.Path = s3a://landsat-pds/scene_list.gz scala> scala> val landsatCsvGZ = path("file:///tmp/scene_list.gz") landsatCsvGZ: org.apache.hadoop.fs.Path = file:/tmp/scene_list.gz scala> copyFile(landsatCsvGZOnS3, landsatCsvGZ, conf, false) 18/08/21 11:00:56 INFO FluentPropertyBeanIntrospector: Error when creating PropertyDescriptor for public final void org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)! Ignoring this property. 18/08/21 11:00:56 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties 18/08/21 11:00:56 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 18/08/21 11:00:56 INFO MetricsSystemImpl: s3a-file-system metrics system started 18/08/21 11:00:59 INFO deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key 18/08/21 11:00:59 INFO ObjectStoreOperations: Copying s3a://landsat-pds/scene_list.gz to file:/tmp/scene_list.gz (44534 KB) 18/08/21 11:04:36 INFO ObjectStoreOperations: Copy Duration = 216.47322015 seconds 18/08/21 11:04:36 INFO ObjectStoreOperations: Effective copy bandwidth = 205.72521612207376 KiB/s res11: Boolean = true scala> val csvSchema = LandsatIO.buildCsvSchema() csvSchema: org.apache.spark.sql.types.StructType = StructType(StructField(entityId,StringType,true), StructField(acquisitionDate,StringType,true), StructField(cloudCover,StringType,true), StructField(processingLevel,StringType,true), StructField(path,StringType,true), StructField(row,StringType,true), StructField(min_lat,StringType,true), StructField(min_lon,StringType,true), StructField(max_lat,StringType,true), StructField(max_lon,StringType,true), StructField(download_url,StringType,true)) scala> val csvDataFrame = LandsatIO.addLandsatColumns( | spark.read.options(LandsatIO.CsvOptions). | schema(csvSchema).csv(landsatCsvGZ.toUri.toString)) 18/08/21 11:34:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/stevel/Projects/sparkwork/spark/spark-warehouse'). 18/08/21 11:34:24 INFO SharedState: Warehouse path is 'file:/Users/stevel/Projects/sparkwork/spark/spark-warehouse'. 18/08/21 11:34:24 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint csvDataFrame: org.apache.spark.sql.DataFrame = [id: string, acquisitionDate: timestamp ... 12 more fields] scala> val filteredCSV = csvDataFrame. | sample(false, 0.01d). | filter("cloudCover < 0").cache() 18/08/21 11:34:44 INFO FileSourceStrategy: Pruning directories with: 18/08/21 11:34:44 INFO FileSourceStrategy: Post-Scan Filters: 18/08/21 11:34:44 INFO FileSourceStrategy: Output Data Schema: struct 18/08/21 11:34:44 INFO FileSourceScanExec: Pushed Filters: 18/08/21 11:34:45 INFO CodeGenerator: Code generated in 239.359919 ms 18/08/21 11:34:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 347.6 KB, free 366.0 MB) 18/08/21 11:34:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 29.3 KB, free 365.9 MB) 18/08/21 11:34:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hw13176.lan:52731 (size: 29.3 KB, free: 366.3 MB) 18/08/21 11:34:45 INFO SparkContext: Created broadcast 0 from cache at :46 18/08/21 11:34:45 INFO FileSourceScanExec: Planning scan with bin packing, max size: 6224701 bytes, open cost is considered as scanning 4194304 bytes. filteredCSV: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, acquisitionDate: timestamp ... 12 more fields] scala> ireland :43: error: not found: value ireland ireland ^ scala> val ireland = new Path("s3a://hwdev-steve-ireland-new/") ireland: org.apache.hadoop.fs.Path = s3a://hwdev-steve-ireland-new/ scala> val orcData = new Path(ireland, "orcData") orcData: org.apache.hadoop.fs.Path = s3a://hwdev-steve-ireland-new/orcData scala> logDuration("write to ORC S3A") { | filteredCSV.write. | partitionBy("year", "month"). | mode(SaveMode.Append).format("orc").save(orcData.toString) | } 18/08/21 11:35:44 INFO ObjectStoreOperations: Starting write to ORC S3A 18/08/21 11:35:50 INFO
[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
[ https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16588001#comment-16588001 ] Steve Loughran commented on SPARK-25180: code snippet was some trivial CSV => ORC with both src and dest on s3a URLs {code} filteredCSV.write. partitionBy("year", "month"). mode(SaveMode.Append).format("orc").save(orcData.toString) {code} > Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname > fails > --- > > Key: SPARK-25180 > URL: https://issues.apache.org/jira/browse/SPARK-25180 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 > Environment: mac laptop running on a corporate guest wifi, presumably > a wifi with odd DNS settings. >Reporter: Steve Loughran >Priority: Minor > > trying to save work on spark standalone can fail if netty RPC cannot > determine the hostname. While that's a valid failure on a real cluster, in > standalone falling back to localhost rather than inferred "hw13176.lan" value > may be the better option. > note also, the abort* call failed; NPE. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails
Steve Loughran created SPARK-25180: -- Summary: Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails Key: SPARK-25180 URL: https://issues.apache.org/jira/browse/SPARK-25180 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: mac laptop running on a corporate guest wifi, presumably a wifi with odd DNS settings. Reporter: Steve Loughran trying to save work on spark standalone can fail if netty RPC cannot determine the hostname. While that's a valid failure on a real cluster, in standalone falling back to localhost rather than inferred "hw13176.lan" value may be the better option. note also, the abort* call failed; NPE. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587962#comment-16587962 ] Sean Owen commented on SPARK-6305: -- It was a mess. Excluding dependencies is an ongoing issue because other dependencies change their dependencies. However, I think tools like the enforcer plugin can help detect this. I think the code has to use some specific logging framework to do its work, but it can be shaded to keep it out of the user classpath. This could be a good goal for Spark 3.0. However i am also not sure how much trouble the existing log4j 1.x causes users? Yes, it does entail updating users and tests' log configuration, which is why it might be something for 3.0. But sure, if there's no real downside and some upsides to doing it, I'm not against it. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25168) PlanTest.comparePlans may make a supplied resolved plan unresolved.
[ https://issues.apache.org/jira/browse/SPARK-25168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587956#comment-16587956 ] Dilip Biswal commented on SPARK-25168: -- [~cloud_fan] OK.. Let me close this then since we are unable to do anything about this at present and since nothing is seriously impacted. Perhaps we can consider making normalizedPlan private so callers are not handled out unresolved plans ? > PlanTest.comparePlans may make a supplied resolved plan unresolved. > --- > > Key: SPARK-25168 > URL: https://issues.apache.org/jira/browse/SPARK-25168 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Minor > > Hello, > I came across this behaviour while writing unit test for one of my PR. We > use the utlility class PlanTest for writing plan verification tests. > Currently when we call comparePlans and the resolved input plan contains a > Join operator, the plan becomes unresolved after the call to > `normalizePlan(normalizeExprIds(plan1))`. The reason for this is that, after > cleaning the expression ids, duplicateResolved returns false making the Join > operator unresolved. > {code} > def duplicateResolved: Boolean = > left.outputSet.intersect(right.outputSet).isEmpty > // Joins are only resolved if they don't introduce ambiguous expression ids. > // NaturalJoin should be ready for resolution only if everything else is > resolved here > lazy val resolvedExceptNatural: Boolean = { > childrenResolved && > expressions.forall(_.resolved) && > duplicateResolved && > condition.forall(_.dataType == BooleanType) > } > {code} > Please note that, plan verification actually works fine. It just looked > awkward to compare two unresolved plan for equality. > I am opening this ticket to discuss if it is an okay behaviour. If its an > tolerable behaviour then we can close the ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587928#comment-16587928 ] Alexander commented on SPARK-7768: -- It's been a while since this had any activity. What is the difficulty level of getting UDTRegistration up to a level where it can become public? I noticed that you need to use spark-inner types in org.apache.spark.unsafe like UTF8String instead of String etc... but this is a minor inconvenience compared to the annoyance-level of not being able to use even things like custom-enum types (think [Ennumeratum|https://github.com/lloydmeta/enumeratum] for instance) inside of records. This is probably the most misunderstood part of Spark in general. For instance, vaquarkhan's [gigantic post|https://github.com/vaquarkhan/vk-wiki-notes/wiki/Apache-Spark-custom-Encoder-example] about wrangling with this issue. Other people have written custom encoders to solve particular use-cases e.g. [here|https://typelevel.org/frameless/Injection.html], and [here|https://github.com/gennady-lebedev/spark-enum-encoder]. There is a general cacaphony of fustration and cluenessness because you have to dig into ExpressionEncoder in order to be able to understand why it's happening. Come on guys! Spark is probably the most awesome analytics engine in the world. Why can't we solve this problem together??? > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897 ] Kazuaki Ishizaki commented on SPARK-25178: -- [~rednaxelafx] Thank you for opening a JIRA entry :) [~smilegator] I can take this. > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-25179: --- Assignee: Bryan Cutler > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > binary type support requires pyarrow 0.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25179: Description: binary type support requires pyarrow 0.10.0. (was: binary type support requires pyarrow 0.10.0) > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > > binary type support requires pyarrow 0.10.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587884#comment-16587884 ] Xiao Li commented on SPARK-25179: - Thanks! It is not urgent as long as it is documented before we publish the doc for spark 2.4 > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > > binary type support requires pyarrow 0.10.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587882#comment-16587882 ] Bryan Cutler commented on SPARK-25179: -- I can work on this, probably can't get to it right away tho > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > > binary type support requires pyarrow 0.10.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-25179: - Description: binary type support requires pyarrow 0.10.0 > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > > binary type support requires pyarrow 0.10.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22779) ConfigEntry's default value should actually be a value
[ https://issues.apache.org/jira/browse/SPARK-22779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587869#comment-16587869 ] Apache Spark commented on SPARK-22779: -- User 'GregOwen' has created a pull request for this issue: https://github.com/apache/spark/pull/22174 > ConfigEntry's default value should actually be a value > -- > > Key: SPARK-22779 > URL: https://issues.apache.org/jira/browse/SPARK-22779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Reynold Xin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > ConfigEntry's config value right now shows a human readable message. In some > places in SQL we actually rely on default value for real to be setting the > values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6236) Support caching blocks larger than 2G
[ https://issues.apache.org/jira/browse/SPARK-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-6236. - Resolution: Fixed Fix Version/s: 2.4.0 As mentioned above, this was covered elsewhere, parts of this even before spark 2.4. See parent SPARK-6235 for more details. > Support caching blocks larger than 2G > - > > Key: SPARK-6236 > URL: https://issues.apache.org/jira/browse/SPARK-6236 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Priority: Major > Fix For: 2.4.0 > > > Due to the use java.nio.ByteBuffer, BlockManager does not support blocks > larger than 2G. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24961) sort operation causes out of memory
[ https://issues.apache.org/jira/browse/SPARK-24961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587850#comment-16587850 ] Markus Breuer commented on SPARK-24961: --- Some notes to your last comment: Reducing number of partitions is no option. When reducing number of partitions the size per partition grows up. A partition is a unit of work per task. Smaller partitions will reduce a tasks memory consumptions. Spark has a default for 200 partitions and is a good idea to increase number amount of partitions for very large data sets. Is this correct? Any idea why the proposal to modify {{spark.sql.execution.rangeExchange.sampleSizePerPartition}} has no effect? > sort operation causes out of memory > > > Key: SPARK-24961 > URL: https://issues.apache.org/jira/browse/SPARK-24961 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.1 > Environment: Java 1.8u144+ > Windows 10 > Spark 2.3.1 in local mode > -Xms4g -Xmx4g > optional: -XX:+UseParallelOldGC >Reporter: Markus Breuer >Priority: Major > > A sort operation on large rdd - which does not fit in memory - causes out of > memory exception. I made the effect reproducable by an sample, the sample > creates large object of about 2mb size. When saving result the oom occurs. I > tried several StorageLevels, but if memory is included (MEMORY_AND_DISK, > MEMORY_AND_DISK_SER, none) application runs in out of memory. Only DISK_ONLY > seems to work. > When replacing sort() with sortWithinPartitions() no StorageLevel is required > and application succeeds. > {code:java} > package de.bytefusion.examples; > import breeze.storage.Storage; > import de.bytefusion.utils.Options; > import org.apache.hadoop.io.MapFile; > import org.apache.hadoop.io.SequenceFile; > import org.apache.hadoop.io.Text; > import org.apache.hadoop.mapred.SequenceFileOutputFormat; > import org.apache.spark.api.java.JavaRDD; > import org.apache.spark.api.java.JavaSparkContext; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.RowFactory; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.types.DataTypes; > import org.apache.spark.sql.types.StructType; > import org.apache.spark.storage.StorageLevel; > import scala.Tuple2; > import static org.apache.spark.sql.functions.*; > import java.util.ArrayList; > import java.util.List; > import java.util.UUID; > import java.util.stream.Collectors; > import java.util.stream.IntStream; > public class Example3 { > public static void main(String... args) { > // create spark session > SparkSession spark = SparkSession.builder() > .appName("example1") > .master("local[4]") > .config("spark.driver.maxResultSize","1g") > .config("spark.driver.memory","512m") > .config("spark.executor.memory","512m") > .config("spark.local.dir","d:/temp/spark-tmp") > .getOrCreate(); > JavaSparkContext sc = > JavaSparkContext.fromSparkContext(spark.sparkContext()); > // base to generate huge data > List list = new ArrayList<>(); > for (int val = 1; val < 1; val++) { > int valueOf = Integer.valueOf(val); > list.add(valueOf); > } > // create simple rdd of int > JavaRDD rdd = sc.parallelize(list,200); > // use map to create large object per row > JavaRDD rowRDD = > rdd > .map(value -> > RowFactory.create(String.valueOf(value), > createLongText(UUID.randomUUID().toString(), 2 * 1024 * 1024))) > // no persist => out of memory exception on write() > // persist MEMORY_AND_DISK => out of memory exception > on write() > // persist MEMORY_AND_DISK_SER => out of memory > exception on write() > // persist(StorageLevel.DISK_ONLY()) > ; > StructType type = new StructType(); > type = type > .add("c1", DataTypes.StringType) > .add( "c2", DataTypes.StringType ); > Dataset df = spark.createDataFrame(rowRDD, type); > // works > df.show(); > df = df > .sort(col("c1").asc() ) > ; > df.explain(); > // takes a lot of time but works > df.show(); > // OutOfMemoryError: java heap space > df > .write() > .mode("overwrite") > .csv("d:/temp/my.csv"); > // OutOfMemoryError: java heap space > df > .toJavaRDD() > .mapToPair(row -> new Tuple2(new Text(row.getString(0)), new > Text(
[jira] [Commented] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587847#comment-16587847 ] Xiao Li commented on SPARK-25179: - cc [~bryanc] > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25179) Document the features that require Pyarrow 0.10
[ https://issues.apache.org/jira/browse/SPARK-25179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-25179: Issue Type: Documentation (was: Improvement) > Document the features that require Pyarrow 0.10 > --- > > Key: SPARK-25179 > URL: https://issues.apache.org/jira/browse/SPARK-25179 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.4.0 > Environment: Document the features that require Pyarrow 0.10 . For > example, https://github.com/apache/spark/pull/20725 >Reporter: Xiao Li >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25179) Document the features that require Pyarrow 0.10
Xiao Li created SPARK-25179: --- Summary: Document the features that require Pyarrow 0.10 Key: SPARK-25179 URL: https://issues.apache.org/jira/browse/SPARK-25179 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Environment: Document the features that require Pyarrow 0.10 . For example, https://github.com/apache/spark/pull/20725 Reporter: Xiao Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587841#comment-16587841 ] Xiao Li commented on SPARK-25178: - cc [~kiszk] Do you have a bandwidth to take this? > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24296) Support replicating blocks larger than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24296. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21451 [https://github.com/apache/spark/pull/21451] > Support replicating blocks larger than 2 GB > --- > > Key: SPARK-24296 > URL: https://issues.apache.org/jira/browse/SPARK-24296 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > Fix For: 2.4.0 > > > Replicating blocks send the entire block data in one frame. This results in > a failure on the receiving end for blocks larger than 2GB. > We should change block replication to send the block data as a stream when > the block is large (building on the network changes in SPARK-6237). This can > use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate > as a stream, the same as we do for fetching shuffle blocks and fetching > remote RDD blocks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24296) Support replicating blocks larger than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-24296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24296: -- Assignee: Imran Rashid > Support replicating blocks larger than 2 GB > --- > > Key: SPARK-24296 > URL: https://issues.apache.org/jira/browse/SPARK-24296 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Major > > Replicating blocks send the entire block data in one frame. This results in > a failure on the receiving end for blocks larger than 2GB. > We should change block replication to send the block data as a stream when > the block is large (building on the network changes in SPARK-6237). This can > use the conf spark.maxRemoteBlockSizeFetchToMem to decided when to replicate > as a stream, the same as we do for fetching shuffle blocks and fetching > remote RDD blocks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587820#comment-16587820 ] Kris Mok commented on SPARK-25178: -- FYI: I've just come across this behavior and raised a ticket so that we don't forget about it, but I'm not working on this one (yet). > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
Kris Mok created SPARK-25178: Summary: Use dummy name for xxxHashMapGenerator key/value schema field Key: SPARK-25178 URL: https://issues.apache.org/jira/browse/SPARK-25178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok Following SPARK-18952 and SPARK-22273, this ticket proposes to change the generated field name of the keySchema / valueSchema to a dummy name instead of using {{key.name}}. In previous discussion from SPARK-18952's PR [1], it was already suggested that the field names were being used, so it's not worth capturing the strings as reference objects here. Josh suggested merging the original fix as-is due to backportability / pickability concerns. Now that we're coming up to a new release, this can be revisited. [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587814#comment-16587814 ] Chris Martin commented on SPARK-6305: - [~srowen] I've taken a look at this and I'd like to sync up to see your perspective (given that you already looked at this). From what I can see: 1) As you say there's a load of pom manipulation to do in order to exclude log4j1 and related components from dependencies. 2) There are a bunch of unit tests that rely on hooking into log4j internals in order to capture log output and make assertions against them. 3) The org.apache.spark.internal.Logging trait has some fairly low level log4j logic. 4) There are a few log4.properties file that would have to be converted to the log4j2 format. Of these 1) is a pain but it's fairly mechanical and I would hope we could write some sort of automated check to tell us if log4j1 is still being brought in. 2) is fairly easily solvable; I have some code to do this. 3) worries me as this class is doing some fairly hairy stuff and I'm not sure of the use cases- it would be good to have a chat about this. 4) is simple enough as far as spark goes, but the fly in the ointment is that all existing user log configuration would need to be changed. thoughts? Chris > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24335) Dataset.map schema not applied in some cases
[ https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587785#comment-16587785 ] Apache Spark commented on SPARK-24335: -- User 'redsanket' has created a pull request for this issue: https://github.com/apache/spark/pull/22173 > Dataset.map schema not applied in some cases > > > Key: SPARK-24335 > URL: https://issues.apache.org/jira/browse/SPARK-24335 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.3.0 >Reporter: Robert Reid >Priority: Major > > In the following code an {color:#808080}UnsupportedOperationException{color} > is thrown in the filter() call just after the Dateset.map() call unless > withWatermark() is added between them. The error reports > `{color:#808080}fieldIndex on a Row without schema is undefined{color}`. I > expect the map() method to have applied the schema and for it to be > accessible in filter(). Without the extra withWatermark() call my debugger > reports that the `row` objects in the filter lambda are `GenericRow`. With > the watermark call it reports that they are `GenericRowWithSchema`. > I should add that I'm new to working with Structured Streaming. So if I'm > overlooking some implied dependency please fill me in. > I'm encountering this in new code for a new production job. The presented > code is distilled down to demonstrate the problem. While the problem can be > worked around simply by adding withWatermark() I'm concerned that this will > leave the code in a fragile state. With this simplified code if this error > occurs again it will be easy to identify what change led to the error. But > in the code I'm writing, with this functionality delegated to other classes, > it is (and has been) very challenging to identify the cause. > > {code:java} > public static void main(String[] args) { > SparkSession sparkSession = > SparkSession.builder().master("local").getOrCreate(); > sparkSession.conf().set( > "spark.sql.streaming.checkpointLocation", > "hdfs://localhost:9000/search_relevance/checkpoint" // for spark > 2.3 > // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // > for spark 2.1 > ); > StructType inSchema = DataTypes.createStructType( > new StructField[] { > DataTypes.createStructField("id", DataTypes.StringType > , false), > DataTypes.createStructField("ts", DataTypes.TimestampType > , false), > DataTypes.createStructField("f1", DataTypes.LongType > , true) > } > ); > Dataset rawSet = sparkSession.sqlContext().readStream() > .format("rate") > .option("rowsPerSecond", 1) > .load() > .map( (MapFunction) raw -> { > Object[] fields = new Object[3]; > fields[0] = "id1"; > fields[1] = raw.getAs("timestamp"); > fields[2] = raw.getAs("value"); > return RowFactory.create(fields); > }, > RowEncoder.apply(inSchema) > ) > // If withWatermark() is included above the filter() line then > this works. Without it we get: > //Caused by: java.lang.UnsupportedOperationException: > fieldIndex on a Row without schema is undefined. > // at the row.getAs() call. > // .withWatermark("ts", "10 seconds") // <-- This is required > for row.getAs("f1") to work ??? > .filter((FilterFunction) row -> !row.getAs("f1").equals(0L)) > .withWatermark("ts", "10 seconds") > ; > StreamingQuery streamingQuery = rawSet > .select("*") > .writeStream() > .format("console") > .outputMode("append") > .start(); > try { > streamingQuery.awaitTermination(30_000); > } catch (StreamingQueryException e) { > System.out.println("Caught exception at 'awaitTermination':"); > e.printStackTrace(); > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587770#comment-16587770 ] Bryan Cutler commented on SPARK-23874: -- Yes, I still would recommend users upgrade pyarrow if possible. There were a large number of fixes and improvements made since 0.8.0. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25173) fail to pass kerberos authentification on executers
[ https://issues.apache.org/jira/browse/SPARK-25173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25173. Resolution: Duplicate > fail to pass kerberos authentification on executers > --- > > Key: SPARK-25173 > URL: https://issues.apache.org/jira/browse/SPARK-25173 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 >Reporter: sun zhiwei >Priority: Major > > Hi All, > for some reasons,I must use jdbc to connect hive server2 on > executors,but when i test my program,it get the following log: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) > at > org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203) > at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:168) > at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) > at java.sql.DriverManager.getConnection(DriverManager.java:571) > at java.sql.DriverManager.getConnection(DriverManager.java:187) > at > com.hypers.bigdata.util.SparkHandler$.handleTable2(SparkHandler.scala:204) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:196) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:107) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:106) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt) > at > sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121) > at > sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) > at > sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) > ... 31 more > > I think it might because the executors can not pass the kerberos > authentification,so here's my question: > how could I make the executors pass kerberos authentification? > any suggestions would be prossiblely usefully,thx. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Resolved] (SPARK-25172) kererbos issue when use jdbc to connect hive server2 on executors
[ https://issues.apache.org/jira/browse/SPARK-25172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25172. Resolution: Invalid Please use the mailing list for questions: http://spark.apache.org/community.html > kererbos issue when use jdbc to connect hive server2 on executors > - > > Key: SPARK-25172 > URL: https://issues.apache.org/jira/browse/SPARK-25172 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 >Reporter: sun zhiwei >Priority: Major > > Hi All, > for some reasons,I must use jdbc to connect hive server2 on > executors,but when I test my program,it get the following logs: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) > at > org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94) > at > org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271) > at > org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) > at > org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49) > at > org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203) > at org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:168) > at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105) > at java.sql.DriverManager.getConnection(DriverManager.java:571) > at java.sql.DriverManager.getConnection(DriverManager.java:187) > at > com.hypers.bigdata.util.SparkHandler$.handleTable2(SparkHandler.scala:204) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:196) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1$$anonfun$apply$2.apply(AutoSpark.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:107) > at > com.hypers.bigdata.spark.AutoSpark$$anonfun$setupSsc$1.apply(AutoSpark.scala:106) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1888) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt) > at > sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) > at > sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121) > at > sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) > at > sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) > at > sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193) > ... 31 more > > I think it might because the executors can not pass the kerberos > authentification,so here's my question: > how could I make the executors pass kerberos authentification? > any suggestions would be prossiblely usefully,thx. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587702#comment-16587702 ] Xiao Li commented on SPARK-23874: - Great! Then, users can keep using pyarrow 0.8 and get the fixes we need in Spark; otherwise, they have to upgrade numpy. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-25161: - Assignee: Jiang Xingbo > Fix several bugs in failure handling of barrier execution mode > -- > > Key: SPARK-25161 > URL: https://issues.apache.org/jira/browse/SPARK-25161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > > Fix several bugs in failure handling of barrier execution mode: > * Mark TaskSet for a barrier stage as zombie when a task attempt fails; > * Multiple barrier task failures from a single barrier stage should not > trigger multiple stage retries; > * Barrier task failure from a previous failed stage attempt should not > trigger stage retry; > * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25161) Fix several bugs in failure handling of barrier execution mode
[ https://issues.apache.org/jira/browse/SPARK-25161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-25161. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22158 [https://github.com/apache/spark/pull/22158] > Fix several bugs in failure handling of barrier execution mode > -- > > Key: SPARK-25161 > URL: https://issues.apache.org/jira/browse/SPARK-25161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.4.0 > > > Fix several bugs in failure handling of barrier execution mode: > * Mark TaskSet for a barrier stage as zombie when a task attempt fails; > * Multiple barrier task failures from a single barrier stage should not > trigger multiple stage retries; > * Barrier task failure from a previous failed stage attempt should not > trigger stage retry; > * Fail the job when a task from a barrier ResultStage failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25177: Assignee: Apache Spark > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Assignee: Apache Spark >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | b | c > ---++--- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25177: Assignee: (was: Apache Spark) > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | b | c > ---++--- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587578#comment-16587578 ] Apache Spark commented on SPARK-25177: -- User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/22171 > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | b | c > ---++--- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25177: - Description: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() | a | b | c | | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in scientific notation| |1.000|1.00|1.| Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | b | c ---++--- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql was: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() | a | b | c | | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in scientific notation| |1.000|1.00|1.| Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---+--+- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | b | c > ---++--- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25177: - Description: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() | a | b | c | | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in scientific notation| |1.000|1.00|1.| Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---++- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql was: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() +-+--++ | a | b | c | +-+---+---+ | 0E-7 |0.00 | 0E-8 | //If scale > 6, zero is displayed in scientific notation |1.000|1.00|1.| +-++--+ Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---+---+-- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | . b | c > ---++- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25177: - Description: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() | a | b | c | | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in scientific notation| |1.000|1.00|1.| Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---+--+- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql was: If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() | a | b | c | | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in scientific notation| |1.000|1.00|1.| Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---++- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > | a | b | c | > | 0E-7 |0.00| 0E-8 |//If scale > 6, zero is displayed in > scientific notation| > |1.000|1.00|1.| > > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | . b | c > ---+--+- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25177) When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation
[ https://issues.apache.org/jira/browse/SPARK-25177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25177: - Summary: When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation (was: When dataframe decimal type column having a scale higher than 6, 0 values are shown in scientific notation) > When dataframe decimal type column having scale higher than 6, 0 values are > shown in scientific notation > > > Key: SPARK-25177 > URL: https://issues.apache.org/jira/browse/SPARK-25177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > If scale of decimal type is > 6 , 0 value will be shown in scientific > notation and hence, when the dataframe output is saved to external database, > it fails due to scientific notation on "0" values. > Eg: In Spark > -- > spark.sql("create table test (a decimal(10,7), b decimal(10,6), c > decimal(10,8))") > spark.sql("insert into test values(0, 0,0)") > spark.sql("insert into test values(1, 1, 1)") > spark.table("test").show() > +-+--++ > | a | b | c | > +-+---+---+ > | 0E-7 |0.00 | 0E-8 | //If scale > 6, zero is displayed > in scientific notation > |1.000|1.00|1.| > +-++--+ > Eg: In Postgress > -- > CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); > INSERT INTO Testdec VALUES (0,0,0); > INSERT INTO Testdec VALUES (1,1,1); > select * from Testdec; > Result: > a | . b | c > ---+---+-- > 0.000 | 0.00 | 0. > 1.000 | 1.00 | 1. > We can make spark SQL result consistent with other Databases like Postgresql > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25177) When dataframe decimal type column having a scale higher than 6, 0 values are shown in scientific notation
Vinod KC created SPARK-25177: Summary: When dataframe decimal type column having a scale higher than 6, 0 values are shown in scientific notation Key: SPARK-25177 URL: https://issues.apache.org/jira/browse/SPARK-25177 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Vinod KC If scale of decimal type is > 6 , 0 value will be shown in scientific notation and hence, when the dataframe output is saved to external database, it fails due to scientific notation on "0" values. Eg: In Spark -- spark.sql("create table test (a decimal(10,7), b decimal(10,6), c decimal(10,8))") spark.sql("insert into test values(0, 0,0)") spark.sql("insert into test values(1, 1, 1)") spark.table("test").show() +-+--++ | a | b | c | +-+---+---+ | 0E-7 |0.00 | 0E-8 | //If scale > 6, zero is displayed in scientific notation |1.000|1.00|1.| +-++--+ Eg: In Postgress -- CREATE TABLE Testdec (a DECIMAL(10,7), b DECIMAL(10,6), c DECIMAL(10,8)); INSERT INTO Testdec VALUES (0,0,0); INSERT INTO Testdec VALUES (1,1,1); select * from Testdec; Result: a | . b | c ---+---+-- 0.000 | 0.00 | 0. 1.000 | 1.00 | 1. We can make spark SQL result consistent with other Databases like Postgresql -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Pryakhin updated SPARK-25176: - Description: I'm using the latest spark version spark-core_2.11:2.3.1 which transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer contains an issue [1,2] which results in throwing ClassCastExceptions when serialising parameterised type hierarchy. This issue has been fixed in kryo version 4.0.0 [3]. It would be great to have this update in Spark as well. Could you please upgrade the version of com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? You can find a simple test to reproduce the issue [4]. [1] https://github.com/EsotericSoftware/kryo/issues/384 [2] https://github.com/EsotericSoftware/kryo/issues/377 [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance was: I'm using the latest spark version spark-core_2.11:2.3.1 which transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer contains an issue [1,2] which results in throwing ClassCastExceptions when serialising parameterised type hierarchy. This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this update in Spark as well. Could you please upgrade the version of com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? You can find a simple test to reproduce the issue [4]. [1] https://github.com/EsotericSoftware/kryo/issues/384 [2] https://github.com/EsotericSoftware/kryo/issues/377 [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Critical > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Pryakhin updated SPARK-25176: - Description: I'm using the latest spark version spark-core_2.11:2.3.1 which transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer contains an issue [1,2] which results in throwing ClassCastExceptions when serialising parameterised type hierarchy. This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this update in Spark as well. Could you please upgrade the version of com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? You can find a simple test to reproduce the issue [4]. [1] https://github.com/EsotericSoftware/kryo/issues/384 [2] https://github.com/EsotericSoftware/kryo/issues/377 [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance was: I'm using the latest spark version spark-core_2.11:2.3.1 which transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer contains an issue [1,2] which results in throwing ClassCastExceptions when serialising parameterised type hierarchy. This issue was fixed in kryo version 4.0.0 [3]. It would be great to have this update in Spark as well. Could you please upgrade the version of com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? You can find a simple test to reproduce the issue [4]. the following command runs the test against kryo version 3.0.3 ./gradlew run -PkryoVersion=3.0.3 the following command runs the test against kryo version 4.0.0 ./gradlew run -PkryoVersion=4.0.0 [1] https://github.com/EsotericSoftware/kryo/issues/384 [2] https://github.com/EsotericSoftware/kryo/issues/377 [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Critical > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue was fixed in kryo version 4.0.0 [3]. It would be great to have > this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org