[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh
[ https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929438#comment-15929438 ] Hyukjin Kwon commented on SPARK-19288: -- FWIW, for me it has been fine. Mac OS 10.12.3 & KTS & R version 3.2.3. It has been fine for Windows via AppVeyor so far. > Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in > R/run-tests.sh > -- > > Key: SPARK-19288 > URL: https://issues.apache.org/jira/browse/SPARK-19288 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL, Tests >Affects Versions: 2.0.1 > Environment: Ubuntu 16.04, X86_64, ppc64le >Reporter: Nirman Narang > > Full log here. > {code:title=R/run-tests.sh|borderStyle=solid} > Loading required package: methods > Attaching package: 'SparkR' > The following object is masked from 'package:testthat': > describe > The following objects are masked from 'package:stats': > cov, filter, lag, na.omit, predict, sd, var, window > The following objects are masked from 'package:base': > as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind, > sample, subset, summary, transform, union > functions on binary files : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > > binary functions : ... > broadcast variables : .. > functions in client.R : . > test functions in sparkR.R : .Re-using existing Spark Context. Call > sparkR.session.stop() or restart R to create a new Spark Context > ... > include R packages : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > JVM API : .. > MLlib functions : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > .SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem > columnStore to file. allocated memory: 65,622 > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] > BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, > RLE] > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, > list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, > encodings: [PLAIN, RLE] > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for > [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: > [PLAIN, BIT_PACKED] > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Jan 19, 2017 5:40:55 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem > columnStore to file. allocated memory: 49 > Jan 19, 2017 5:40:55 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, > list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: > [PLAIN, RLE] > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jan 19, 2017 5:40:55 PM INFO:
[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh
[ https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929433#comment-15929433 ] Miao Wang commented on SPARK-19288: --- I think it only happens at local build. I had another similar issue due to building hive once. So if you comment out this one, it will pass. I don't think Jenkins will suffer this issue. > Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in > R/run-tests.sh > -- > > Key: SPARK-19288 > URL: https://issues.apache.org/jira/browse/SPARK-19288 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL, Tests >Affects Versions: 2.0.1 > Environment: Ubuntu 16.04, X86_64, ppc64le >Reporter: Nirman Narang > > Full log here. > {code:title=R/run-tests.sh|borderStyle=solid} > Loading required package: methods > Attaching package: 'SparkR' > The following object is masked from 'package:testthat': > describe > The following objects are masked from 'package:stats': > cov, filter, lag, na.omit, predict, sd, var, window > The following objects are masked from 'package:base': > as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind, > sample, subset, summary, transform, union > functions on binary files : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > > binary functions : ... > broadcast variables : .. > functions in client.R : . > test functions in sparkR.R : .Re-using existing Spark Context. Call > sparkR.session.stop() or restart R to create a new Spark Context > ... > include R packages : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > JVM API : .. > MLlib functions : Spark package found in SPARK_HOME: > /var/lib/jenkins/workspace/Sparkv2.0.1/spark > .SLF4J: Failed to load class > "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem > columnStore to file. allocated memory: 65,622 > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] > BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, > RLE] > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, > list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, > encodings: [PLAIN, RLE] > Jan 19, 2017 5:40:54 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for > [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: > [PLAIN, BIT_PACKED] > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet block size to 134217728 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet page size to 1048576 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Parquet dictionary page size to 1048576 > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Dictionary is on > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Validation is off > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: > Writer version is: PARQUET_1_0 > Jan 19, 2017 5:40:55 PM INFO: > org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem > columnStore to file. allocated memory: 49 > Jan 19, 2017 5:40:55 PM INFO: > org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, > list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: > [PLAIN, RLE] > Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression:
[jira] [Comment Edited] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929429#comment-15929429 ] Miao Wang edited comment on SPARK-19827 at 3/17/17 5:15 AM: Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper is not merged yet. Please wait for https://github.com/apache/spark/pull/15770 merged before submitting PR. cc [~felixcheung] was (Author: wm624): Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper is not merged yet. Please wait for https://github.com/apache/spark/pull/15770 merged before submitting PR. > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19827) spark.ml R API for PIC
[ https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929429#comment-15929429 ] Miao Wang commented on SPARK-19827: --- Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper is not merged yet. Please wait for https://github.com/apache/spark/pull/15770 merged before submitting PR. > spark.ml R API for PIC > -- > > Key: SPARK-19827 > URL: https://issues.apache.org/jira/browse/SPARK-19827 > Project: Spark > Issue Type: Sub-task > Components: ML, SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929422#comment-15929422 ] Kay Ousterhout commented on SPARK-19990: Thanks [~windpiger]! > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) > at > org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) > at > org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at >
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit in SPARK-19235 https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at >
[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM: --- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. it is not related with SPARK-19763 I will fix this by providing a new test dir which contain the test files in sql/ , and the test case use this file path. thanks~ was (Author: windpiger): the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at >
[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409 ] Song Jun commented on SPARK-19990: -- the root cause is [the csvfile path in this test case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703] is "jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv", which will failed when new Path() [new Path in datasource.scala |https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344] and the cars.csv are stored in module core's resources. after we merge the HiveDDLSuit and DDLSuit https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if we test module hive, we will run the DDLSuit in the core module, and this will cause that we get the illegal path like 'jar:file:/xxx' above. > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) > at > org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) > at > org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at >
[jira] [Comment Edited] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929399#comment-15929399 ] Kazuaki Ishizaki edited comment on SPARK-19984 at 3/17/17 4:23 AM: --- This problem occurs since Spark generates {{.compare(UTF8String)}} method for {{long}} primitive type. It should not be generated. I am investigating why it occurs from the log. Can you post the code while it may not always reproduce this problem? was (Author: kiszk): This problem occurs since Spark generates {.compare()} method for {long} primitive type. It should not be generated. I am investigating why it occurs from the log. Can you post the code while it may not always reproduce this problem? > ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java' > - > > Key: SPARK-19984 > URL: https://issues.apache.org/jira/browse/SPARK-19984 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Andrey Yakovenko > > I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. > This is not permanent error, next time i run it could disappear. > Unfortunately i don't know how to reproduce the issue. As you can see from > the log my logic is pretty complicated. > Here is a part of log i've got (container_1489514660953_0015_01_01) > {code} > 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 151, Column 29: A method named "compare" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private scala.collection.Iterator[] inputs; > /* 008 */ private boolean agg_initAgg; > /* 009 */ private boolean agg_bufIsNull; > /* 010 */ private long agg_bufValue; > /* 011 */ private boolean agg_initAgg1; > /* 012 */ private boolean agg_bufIsNull1; > /* 013 */ private long agg_bufValue1; > /* 014 */ private scala.collection.Iterator smj_leftInput; > /* 015 */ private scala.collection.Iterator smj_rightInput; > /* 016 */ private InternalRow smj_leftRow; > /* 017 */ private InternalRow smj_rightRow; > /* 018 */ private UTF8String smj_value2; > /* 019 */ private java.util.ArrayList smj_matches; > /* 020 */ private UTF8String smj_value3; > /* 021 */ private UTF8String smj_value4; > /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric > smj_numOutputRows; > /* 023 */ private UnsafeRow smj_result; > /* 024 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; > /* 025 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > smj_rowWriter; > /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows; > /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime; > /* 028 */ private UnsafeRow agg_result; > /* 029 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; > /* 030 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter; > /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows1; > /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime1; > /* 033 */ private UnsafeRow agg_result1; > /* 034 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; > /* 035 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter1; > /* 036 */ > /* 037 */ public GeneratedIterator(Object[] references) { > /* 038 */ this.references = references; > /* 039 */ } > /* 040 */ > /* 041 */ public void init(int index, scala.collection.Iterator[] inputs) { > /* 042 */ partitionIndex = index; > /* 043 */ this.inputs = inputs; > /* 044 */ wholestagecodegen_init_0(); > /* 045 */ wholestagecodegen_init_1(); > /* 046 */ > /* 047 */ } > /* 048 */ > /* 049 */ private void wholestagecodegen_init_0() { > /* 050 */ agg_initAgg = false; > /* 051 */ > /* 052 */ agg_initAgg1 = false; > /* 053 */ > /* 054 */ smj_leftInput = inputs[0]; > /* 055 */ smj_rightInput = inputs[1]; > /* 056 */ > /* 057 */ smj_rightRow = null; > /* 058 */ > /* 059 */ smj_matches = new java.util.ArrayList(); > /* 060 */ > /* 061 */
[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
[ https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929399#comment-15929399 ] Kazuaki Ishizaki commented on SPARK-19984: -- This problem occurs since Spark generates {.compare()} method for {long} primitive type. It should not be generated. I am investigating why it occurs from the log. Can you post the code while it may not always reproduce this problem? > ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java' > - > > Key: SPARK-19984 > URL: https://issues.apache.org/jira/browse/SPARK-19984 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Andrey Yakovenko > > I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. > This is not permanent error, next time i run it could disappear. > Unfortunately i don't know how to reproduce the issue. As you can see from > the log my logic is pretty complicated. > Here is a part of log i've got (container_1489514660953_0015_01_01) > {code} > 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 151, Column 29: A method named "compare" is not declared in any enclosing > class nor any supertype, nor through a static import > /* 001 */ public Object generate(Object[] references) { > /* 002 */ return new GeneratedIterator(references); > /* 003 */ } > /* 004 */ > /* 005 */ final class GeneratedIterator extends > org.apache.spark.sql.execution.BufferedRowIterator { > /* 006 */ private Object[] references; > /* 007 */ private scala.collection.Iterator[] inputs; > /* 008 */ private boolean agg_initAgg; > /* 009 */ private boolean agg_bufIsNull; > /* 010 */ private long agg_bufValue; > /* 011 */ private boolean agg_initAgg1; > /* 012 */ private boolean agg_bufIsNull1; > /* 013 */ private long agg_bufValue1; > /* 014 */ private scala.collection.Iterator smj_leftInput; > /* 015 */ private scala.collection.Iterator smj_rightInput; > /* 016 */ private InternalRow smj_leftRow; > /* 017 */ private InternalRow smj_rightRow; > /* 018 */ private UTF8String smj_value2; > /* 019 */ private java.util.ArrayList smj_matches; > /* 020 */ private UTF8String smj_value3; > /* 021 */ private UTF8String smj_value4; > /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric > smj_numOutputRows; > /* 023 */ private UnsafeRow smj_result; > /* 024 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; > /* 025 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > smj_rowWriter; > /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows; > /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime; > /* 028 */ private UnsafeRow agg_result; > /* 029 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; > /* 030 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter; > /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_numOutputRows1; > /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric > agg_aggTime1; > /* 033 */ private UnsafeRow agg_result1; > /* 034 */ private > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; > /* 035 */ private > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter > agg_rowWriter1; > /* 036 */ > /* 037 */ public GeneratedIterator(Object[] references) { > /* 038 */ this.references = references; > /* 039 */ } > /* 040 */ > /* 041 */ public void init(int index, scala.collection.Iterator[] inputs) { > /* 042 */ partitionIndex = index; > /* 043 */ this.inputs = inputs; > /* 044 */ wholestagecodegen_init_0(); > /* 045 */ wholestagecodegen_init_1(); > /* 046 */ > /* 047 */ } > /* 048 */ > /* 049 */ private void wholestagecodegen_init_0() { > /* 050 */ agg_initAgg = false; > /* 051 */ > /* 052 */ agg_initAgg1 = false; > /* 053 */ > /* 054 */ smj_leftInput = inputs[0]; > /* 055 */ smj_rightInput = inputs[1]; > /* 056 */ > /* 057 */ smj_rightRow = null; > /* 058 */ > /* 059 */ smj_matches = new java.util.ArrayList(); > /* 060 */ > /* 061 */ this.smj_numOutputRows = > (org.apache.spark.sql.execution.metric.SQLMetric) references[0]; > /* 062 */ smj_result = new UnsafeRow(2); > /* 063 */ this.smj_holder = new > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, > 64); > /* 064 */ this.smj_rowWriter = new >
[jira] [Created] (SPARK-19991) FileSegmentManagedBuffer performance improvement.
Guoqiang Li created SPARK-19991: --- Summary: FileSegmentManagedBuffer performance improvement. Key: SPARK-19991 URL: https://issues.apache.org/jira/browse/SPARK-19991 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.1.0, 2.0.2 Reporter: Guoqiang Li Priority: Minor When we do not set the value of the configuration items `{{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, each call to the cFileSegmentManagedBuffer.nioByteBuffer or FileSegmentManagedBuffer.createInputStream method creates a NoSuchElementException instance. This is a more time-consuming operation. The shuffle-server thread`s stack: {noformat} "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 nid=0x28d12 runnable [0x7f71af93e000] java.lang.Thread.State: RUNNABLE at java.lang.Throwable.fillInStackTrace(Native Method) at java.lang.Throwable.fillInStackTrace(Throwable.java:783) - locked <0x0007a930f080> (a java.util.NoSuchElementException) at java.lang.Throwable.(Throwable.java:265) at java.lang.Exception.(Exception.java:66) at java.lang.RuntimeException.(RuntimeException.java:62) at java.util.NoSuchElementException.(NoSuchElementException.java:57) at org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38) at org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31) at org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50) at org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157) at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) at org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835) at org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017) at org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at
[jira] [Updated] (SPARK-19991) FileSegmentManagedBuffer performance improvement.
[ https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-19991: Description: When we do not set the value of the configuration items {{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, each call to the cFileSegmentManagedBuffer.nioByteBuffer or FileSegmentManagedBuffer.createInputStream method creates a NoSuchElementException instance. This is a more time-consuming operation. The shuffle-server thread`s stack: {noformat} "shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 nid=0x28d12 runnable [0x7f71af93e000] java.lang.Thread.State: RUNNABLE at java.lang.Throwable.fillInStackTrace(Native Method) at java.lang.Throwable.fillInStackTrace(Throwable.java:783) - locked <0x0007a930f080> (a java.util.NoSuchElementException) at java.lang.Throwable.(Throwable.java:265) at java.lang.Exception.(Exception.java:66) at java.lang.RuntimeException.(RuntimeException.java:62) at java.util.NoSuchElementException.(NoSuchElementException.java:57) at org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38) at org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31) at org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50) at org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157) at org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54) at org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33) at org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835) at org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017) at org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256) at org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194) at org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346) at org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367) at org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353) at
[jira] [Updated] (SPARK-19736) refreshByPath should clear all cached plans with the specified path
[ https://issues.apache.org/jira/browse/SPARK-19736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19736: Fix Version/s: 2.1.1 > refreshByPath should clear all cached plans with the specified path > --- > > Key: SPARK-19736 > URL: https://issues.apache.org/jira/browse/SPARK-19736 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.1, 2.2.0 > > > Catalog.refreshByPath can refresh the cache entry and the associated metadata > for all dataframes (if any), that contain the given data source path. > However, CacheManager.invalidateCachedPath doesn't clear all cached plans > with the specified path. It causes some strange behaviors reported in > SPARK-15678. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19093) Cached tables are not used in SubqueryExpression
[ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19093: Fix Version/s: 2.1.1 > Cached tables are not used in SubqueryExpression > > > Key: SPARK-19093 > URL: https://issues.apache.org/jira/browse/SPARK-19093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Josh Rosen >Assignee: Dilip Biswal > Fix For: 2.1.1, 2.2.0 > > > See reproduction at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html > Consider the following: > {code} > Seq(("a", "b"), ("c", "d")) > .toDS > .write > .parquet("/tmp/rows") > val df = spark.read.parquet("/tmp/rows") > df.cache() > df.count() > df.createOrReplaceTempView("rows") > spark.sql(""" > select * from rows cross join rows > """).explain(true) > spark.sql(""" > select * from rows where not exists (select * from rows) > """).explain(true) > {code} > In both plans, I'd expect that both sides of the joins would read from the > cached table for both the cross join and anti join, but the left anti join > produces the following plan which only reads the left side from cache and > reads the right side via a regular non-cahced scan: > {code} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter NOT exists#3994 >: +- 'Project [*] >: +- 'UnresolvedRelation `rows` >+- 'UnresolvedRelation `rows` > == Analyzed Logical Plan == > _1: string, _2: string > Project [_1#3775, _2#3776] > +- Filter NOT predicate-subquery#3994 [] >: +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >: +- Project [_1#3775, _2#3776] >:+- SubqueryAlias rows >: +- Relation[_1#3775,_2#3776] parquet >+- SubqueryAlias rows > +- Relation[_1#3775,_2#3776] parquet > == Optimized Logical Plan == > Join LeftAnti > :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >+- Relation[_1#3775,_2#3776] parquet > == Physical Plan == > BroadcastNestedLoopJoin BuildRight, LeftAnti > :- InMemoryTableScan [_1#3775, _2#3776] > : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- BroadcastExchange IdentityBroadcastMode >+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-18549: Fix Version/s: 2.1.1 > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19765: Fix Version/s: 2.1.1 > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > Fix For: 2.1.1, 2.2.0 > > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python
[ https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929342#comment-15929342 ] Yong Tang commented on SPARK-19975: --- Created a PR for that: https://github.com/apache/spark/pull/17328 Please take a look. > Add map_keys and map_values functions to Python > - > > Key: SPARK-19975 > URL: https://issues.apache.org/jira/browse/SPARK-19975 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Maciej Bryński > > We have `map_keys` and `map_values` functions in SQL. > There is no Python equivalent functions for that. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19990: --- Description: This test seems to be failing consistently on all of the maven builds: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using and is possibly caused by SPARK-19763. Here's a stack trace for the failure: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) at org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) at org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929334#comment-15929334 ] Kay Ousterhout commented on SPARK-19988: With some help from [~joshrosen] I spent some time digging into this and found: (1) if you look at the failures, they're all from the maven build. In fact, 100% of the maven builds shown there fail (and none of the SBT ones). This is weird because this is also failing on the PR builder, which uses SBT. (2) The maven build failures are all accompanied by 3 other tests; the group of 4 tests seems to consistently fail together. 3 tests fail with errors similar to this one (saying that some database does not exist). The 4th test, org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using, fails with a more real error. I filed SPARK-19990 for that issue. (3) A commit right around the time the tests started failing: https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9#diff-b7094baa12601424a5d19cb930e3402fR46 added code to remove all of the databases after each test. I wonder if that's somehow getting run concurrently or asynchronously in the maven build (after the HiveCataloguedDDLSuite fails), which is why the error in the DDLSuite causes the other tests to fail saying that a database can't be found. I have extremely limited knowledge of both (a) how the maven tests are executed and (b) the SQL code so it's possible these are totally unrelated issues. None of this explains why the test is failing in the PR builder, where the failures have been isolated to this test. > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
Kay Ousterhout created SPARK-19990: -- Summary: Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using Key: SPARK-19990 URL: https://issues.apache.org/jira/browse/SPARK-19990 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Kay Ousterhout This test seems to be failing consistently on all of the maven builds: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using and is possibly caused by SPARK-19763. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19987) Pass all filters into FileIndex
[ https://issues.apache.org/jira/browse/SPARK-19987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19987. - Resolution: Fixed Fix Version/s: 2.2.0 > Pass all filters into FileIndex > --- > > Key: SPARK-19987 > URL: https://issues.apache.org/jira/browse/SPARK-19987 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > This is a tiny teeny refactoring to pass data filters also to the FileIndex, > so FileIndex can have a more global view on predicates. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929296#comment-15929296 ] Wenchen Fan commented on SPARK-19982: - yea makes sense, the test harness should hold the {{this}} reference. I have no idea why this can go wrong, maybe we should just move the test class to top-level so it's not an inner class anymore. > JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate > an encoder for inner class" > - > > Key: SPARK-19982 > URL: https://issues.apache.org/jira/browse/SPARK-19982 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren > Labels: flaky-test > > JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: > Unable to generate an encoder for inner class > `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to > the scope that this class was defined in. Try moving this class out of its > parent class. > From https://spark-tests.appspot.com/test-logs/35475788 > [~vanzin] looked into this back in October and reported: > I ran this test in a loop (both alone and with the rest of the spark-sql > tests) and never got a failure. I even used the same JDK as Jenkins > (1.7.0_51). > Also looked at the code and nothing seems wrong. The errors is when an entry > with the parent class name is missing from the map kept in OuterScopes.scala, > but the test populates that map in its first line. So it doesn't look like a > race nor some issue with weak references (the map uses weak values). > public void testJavaBeanEncoder() { > OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929275#comment-15929275 ] Michael Armbrust commented on SPARK-19982: -- I'm not sure if changing weak to strong references will change [anything|http://stackoverflow.com/questions/299659/what-is-the-difference-between-a-soft-reference-and-a-weak-reference-in-java]. It seems like there must be another handle to {{this}} since the test harness is actively executing it. So either way it shouldn't be available for garbage collection, or am I missing something? > JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate > an encoder for inner class" > - > > Key: SPARK-19982 > URL: https://issues.apache.org/jira/browse/SPARK-19982 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren > Labels: flaky-test > > JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: > Unable to generate an encoder for inner class > `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to > the scope that this class was defined in. Try moving this class out of its > parent class. > From https://spark-tests.appspot.com/test-logs/35475788 > [~vanzin] looked into this back in October and reported: > I ran this test in a loop (both alone and with the rest of the spark-sql > tests) and never got a failure. I even used the same JDK as Jenkins > (1.7.0_51). > Also looked at the code and nothing seems wrong. The errors is when an entry > with the parent class name is missing from the map kept in OuterScopes.scala, > but the test populates that map in its first line. So it doesn't look like a > race nor some issue with weak references (the map uses weak values). > public void testJavaBeanEncoder() { > OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929272#comment-15929272 ] Hyukjin Kwon commented on SPARK-18789: -- Do you mind if I ask a simple code for this? Pseudocode is fine. (I am just trying to verify this) > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at >
[jira] [Commented] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929271#comment-15929271 ] yuhao yang commented on SPARK-19969: Looks like jira stops auto binding with PR. https://github.com/apache/spark/pull/17324 > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929259#comment-15929259 ] Harish commented on SPARK-18789: When you create the DF (dynamic) withough knowing the type of the column then you cant define the schema. In my case i am not knowing the type of a column. When you dont define the column type and if the entire column in None then i am getting this error message. i hope i am clear. > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at >
[jira] [Updated] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19964: --- Summary: Flaky test: SparkSubmitSuite fails due to Timeout (was: SparkSubmitSuite fails due to Timeout) > Flaky test: SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257 ] Kay Ousterhout edited comment on SPARK-19964 at 3/17/17 12:54 AM: -- [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars (I added flaky to the name which is I suspect the source of confusion) was (Author: kayousterhout): [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars > Flaky test: SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257 ] Kay Ousterhout commented on SPARK-19964: [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars > SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929250#comment-15929250 ] Wenchen Fan commented on SPARK-19982: - I think this is caused by weak references, a GC may happen right after users call `OuterScopes.addOuterScope(this);` Shall we use soft reference? cc [~marmbrus] > JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate > an encoder for inner class" > - > > Key: SPARK-19982 > URL: https://issues.apache.org/jira/browse/SPARK-19982 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren > Labels: flaky-test > > JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: > Unable to generate an encoder for inner class > `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to > the scope that this class was defined in. Try moving this class out of its > parent class. > From https://spark-tests.appspot.com/test-logs/35475788 > [~vanzin] looked into this back in October and reported: > I ran this test in a loop (both alone and with the rest of the spark-sql > tests) and never got a failure. I even used the same JDK as Jenkins > (1.7.0_51). > Also looked at the code and nothing seems wrong. The errors is when an entry > with the parent class name is missing from the map kept in OuterScopes.scala, > but the test populates that map in its first line. So it doesn't look like a > race nor some issue with weak references (the map uses weak values). > public void testJavaBeanEncoder() { > OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19635. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17110 [https://github.com/apache/spark/pull/17110] > Feature parity for Chi-square hypothesis testing in MLlib > - > > Key: SPARK-19635 > URL: https://issues.apache.org/jira/browse/SPARK-19635 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Joseph K. Bradley > Fix For: 2.2.0 > > > This ticket tracks porting the functionality of > spark.mllib.Statistics.chiSqTest over to spark.ml. > Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19988: --- Component/s: SQL > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
Kay Ousterhout created SPARK-19989: -- Summary: Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite Key: SPARK-19989 URL: https://issues.apache.org/jira/browse/SPARK-19989 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Kay Ousterhout Priority: Minor This test failed recently here: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/ And based on Josh's dashboard (https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions), seems to fail a few times every month. Here's the full error from the most recent failure: Error Message org.scalatest.exceptions.TestFailedException: Error adding data: replication factor: 1 larger than available brokers: 0 kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173) org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903) org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92) scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494) == Progress ==AssertOnQuery(, )CheckAnswer: StopStream StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map()) AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]StopStream StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map()) AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(9, 10, 11, 12, 13, 14), message = )CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]StopStream AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(), message = ) => AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(15), message = Add topic stress7) AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition) AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(23, 24), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = Add topic stress9)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(25, 26, 27, 28, 29, 30, 31, 32, 33), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(34, 35, 36, 37, 38, 39), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(40, 41, 42, 43), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(44), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(45, 46, 47, 48, 49, 50, 51, 52), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(53, 54, 55), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(56, 57, 58, 59, 60, 61, 62, 63), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(64, 65, 66, 67, 68, 69, 70), message = ) StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@65068637,Map()) AddKafkaData(topics = Set(stress4, stress6, stress2,
[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929107#comment-15929107 ] Herman van Hovell commented on SPARK-19988: --- It is probably some other test changing the current database to {{db2}}. This is super annoying to debug, and the only solution I see is that we fix the database names in the test. > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929084#comment-15929084 ] Hyukjin Kwon commented on SPARK-18789: -- It seems it goes failed in schema inference. {code} >>> data = [ ... ["a", 1, None], ... ["b", 1, None], ... ["c", 1, None], ... ["d", 1, None], ... ] >>> df = spark.createDataFrame(data) Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/sql/session.py", line 526, in createDataFrame rdd, schema = self._createFromLocal(map(prepare, data), schema) File ".../spark/python/pyspark/sql/session.py", line 390, in _createFromLocal struct = self._inferSchemaFromList(data) File ".../spark/python/pyspark/sql/session.py", line 324, in _inferSchemaFromList raise ValueError("Some of types cannot be determined after inferring") ValueError: Some of types cannot be determined after inferring {code} that's why I specified the schema. Did I maybe misunderstand your comment? > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at >
[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-19988: - Attachment: trimmed-unit-test.log Attaching a trimmed version of the unit-test.log file, though nothing looks particularly notable in it to me. also I tried running this test on my laptop in a loop, and it passed 100 times in a row. > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
Imran Rashid created SPARK-19988: Summary: Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive Key: SPARK-19988 URL: https://issues.apache.org/jira/browse/SPARK-19988 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.2.0 Reporter: Imran Rashid "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 days here: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive eg. https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ {noformat} sbt.ForkMain$ForkError: org.apache.spark.sql.execution.QueryExecutionException: FAILED: SemanticException [Error 10072]: Database does not exist: db2 at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) at org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19987) Pass all filters into FileIndex
[ https://issues.apache.org/jira/browse/SPARK-19987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-19987: Description: This is a tiny teeny refactoring to pass data filters also to the FileIndex, so FileIndex can have a more global view on predicates. was:This is a tiny teeny refactoring to pass data filters also to the FileIndex. > Pass all filters into FileIndex > --- > > Key: SPARK-19987 > URL: https://issues.apache.org/jira/browse/SPARK-19987 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > This is a tiny teeny refactoring to pass data filters also to the FileIndex, > so FileIndex can have a more global view on predicates. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-19982: - Labels: flaky-test (was: ) > JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate > an encoder for inner class" > - > > Key: SPARK-19982 > URL: https://issues.apache.org/jira/browse/SPARK-19982 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren > Labels: flaky-test > > JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: > Unable to generate an encoder for inner class > `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to > the scope that this class was defined in. Try moving this class out of its > parent class. > From https://spark-tests.appspot.com/test-logs/35475788 > [~vanzin] looked into this back in October and reported: > I ran this test in a loop (both alone and with the rest of the spark-sql > tests) and never got a failure. I even used the same JDK as Jenkins > (1.7.0_51). > Also looked at the code and nothing seems wrong. The errors is when an entry > with the parent class name is missing from the map kept in OuterScopes.scala, > but the test populates that map in its first line. So it doesn't look like a > race nor some issue with weak references (the map uses weak values). > public void testJavaBeanEncoder() { > OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19987) Pass all filters into FileIndex
Reynold Xin created SPARK-19987: --- Summary: Pass all filters into FileIndex Key: SPARK-19987 URL: https://issues.apache.org/jira/browse/SPARK-19987 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin This is a tiny teeny refactoring to pass data filters also to the FileIndex. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928928#comment-15928928 ] Drew Robb commented on SPARK-12664: --- This feature is also very important to me. I'm considering working on it myself and will post here if I begin that. > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Yanbo Liang > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928926#comment-15928926 ] Harish commented on SPARK-18789: In your example you are defining the schema first and then loading the data. Which works. Try to create the DF without defining the schema (column type). > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
[jira] [Updated] (SPARK-19985) Some ML Models error when copy or do not set parent
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-19985: - Description: Some ML Models fail when copied due to not having a default constructor and implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set the parent when the model is copied. These models were missing the normal check that tests for these in the test suites. Models with issues are: * RFormlaModel * MultilayerPerceptronClassificationModel * BucketedRandomProjectionLSHModel * MinHashLSH was:Some ML Models fail when copied due to not having a default constructor and implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set the parent when the model is copied. These models were missing the normal check that tests for these in the test suites. > Some ML Models error when copy or do not set parent > --- > > Key: SPARK-19985 > URL: https://issues.apache.org/jira/browse/SPARK-19985 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bryan Cutler > > Some ML Models fail when copied due to not having a default constructor and > implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set > the parent when the model is copied. These models were missing the normal > check that tests for these in the test suites. > Models with issues are: > * RFormlaModel > * MultilayerPerceptronClassificationModel > * BucketedRandomProjectionLSHModel > * MinHashLSH -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19986) Make pyspark.streaming.tests.CheckpointTests more stable
Shixiong Zhu created SPARK-19986: Summary: Make pyspark.streaming.tests.CheckpointTests more stable Key: SPARK-19986 URL: https://issues.apache.org/jira/browse/SPARK-19986 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.1.0 Reporter: Shixiong Zhu Sometimes, CheckpointTests will hang because the streaming jobs are too slow and cannot catch up. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19985) Some ML Models error when copy or do not set parent
[ https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928924#comment-15928924 ] Bryan Cutler commented on SPARK-19985: -- I'll fix this > Some ML Models error when copy or do not set parent > --- > > Key: SPARK-19985 > URL: https://issues.apache.org/jira/browse/SPARK-19985 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bryan Cutler > > Some ML Models fail when copied due to not having a default constructor and > implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set > the parent when the model is copied. These models were missing the normal > check that tests for these in the test suites. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19985) Some ML Models error when copy or do not set parent
Bryan Cutler created SPARK-19985: Summary: Some ML Models error when copy or do not set parent Key: SPARK-19985 URL: https://issues.apache.org/jira/browse/SPARK-19985 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Bryan Cutler Some ML Models fail when copied due to not having a default constructor and implementing {{copy}} with {{defaultCopy}}. Other cases do not properly set the parent when the model is copied. These models were missing the normal check that tests for these in the test suites. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'
Andrey Yakovenko created SPARK-19984: Summary: ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' Key: SPARK-19984 URL: https://issues.apache.org/jira/browse/SPARK-19984 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.1.0 Reporter: Andrey Yakovenko I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. This is not permanent error, next time i run it could disappear. Unfortunately i don't know how to reproduce the issue. As you can see from the log my logic is pretty complicated. Here is a part of log i've got (container_1489514660953_0015_01_01) {code} 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 151, Column 29: A method named "compare" is not declared in any enclosing class nor any supertype, nor through a static import /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator[] inputs; /* 008 */ private boolean agg_initAgg; /* 009 */ private boolean agg_bufIsNull; /* 010 */ private long agg_bufValue; /* 011 */ private boolean agg_initAgg1; /* 012 */ private boolean agg_bufIsNull1; /* 013 */ private long agg_bufValue1; /* 014 */ private scala.collection.Iterator smj_leftInput; /* 015 */ private scala.collection.Iterator smj_rightInput; /* 016 */ private InternalRow smj_leftRow; /* 017 */ private InternalRow smj_rightRow; /* 018 */ private UTF8String smj_value2; /* 019 */ private java.util.ArrayList smj_matches; /* 020 */ private UTF8String smj_value3; /* 021 */ private UTF8String smj_value4; /* 022 */ private org.apache.spark.sql.execution.metric.SQLMetric smj_numOutputRows; /* 023 */ private UnsafeRow smj_result; /* 024 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder; /* 025 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter smj_rowWriter; /* 026 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_numOutputRows; /* 027 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_aggTime; /* 028 */ private UnsafeRow agg_result; /* 029 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder; /* 030 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter; /* 031 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_numOutputRows1; /* 032 */ private org.apache.spark.sql.execution.metric.SQLMetric agg_aggTime1; /* 033 */ private UnsafeRow agg_result1; /* 034 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1; /* 035 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter1; /* 036 */ /* 037 */ public GeneratedIterator(Object[] references) { /* 038 */ this.references = references; /* 039 */ } /* 040 */ /* 041 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 042 */ partitionIndex = index; /* 043 */ this.inputs = inputs; /* 044 */ wholestagecodegen_init_0(); /* 045 */ wholestagecodegen_init_1(); /* 046 */ /* 047 */ } /* 048 */ /* 049 */ private void wholestagecodegen_init_0() { /* 050 */ agg_initAgg = false; /* 051 */ /* 052 */ agg_initAgg1 = false; /* 053 */ /* 054 */ smj_leftInput = inputs[0]; /* 055 */ smj_rightInput = inputs[1]; /* 056 */ /* 057 */ smj_rightRow = null; /* 058 */ /* 059 */ smj_matches = new java.util.ArrayList(); /* 060 */ /* 061 */ this.smj_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[0]; /* 062 */ smj_result = new UnsafeRow(2); /* 063 */ this.smj_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, 64); /* 064 */ this.smj_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(smj_holder, 2); /* 065 */ this.agg_numOutputRows = (org.apache.spark.sql.execution.metric.SQLMetric) references[1]; /* 066 */ this.agg_aggTime = (org.apache.spark.sql.execution.metric.SQLMetric) references[2]; /* 067 */ agg_result = new UnsafeRow(1); /* 068 */ this.agg_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, 0); /* 069 */ this.agg_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder, 1); /* 070 */ this.agg_numOutputRows1 = (org.apache.spark.sql.execution.metric.SQLMetric) references[3]; /* 071 */ this.agg_aggTime1 =
[jira] [Commented] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928854#comment-15928854 ] Nick Pentreath commented on SPARK-19969: Ok - I can help on it but probably only some time next week. > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928827#comment-15928827 ] Maciej Szymkiewicz commented on SPARK-19899: [~mlnick] For some reason SparkQA recognized the PR but it is not reflected on JIRA :/ So manually: https://github.com/apache/spark/pull/17321 > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928802#comment-15928802 ] Hyukjin Kwon commented on SPARK-19713: -- > please suggest what you think the title should be Describe the problem you met in a one line. The title, {{saveAsTable}} is not helpful. Imagine you manage JIRAs and see a JIRA, for example, {{cache}}. Likewise, saveAsTable what? the title is obviously not complete and does not describe the problem. Check other JIRAs https://issues.apache.org/jira/browse/SPARK-19713?jql=text%20~%20%22saveAsTable%22 > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928780#comment-15928780 ] Hyukjin Kwon commented on SPARK-18789: -- Hm, do you mind if I ask a reproducer? {code} from pyspark.sql import Row from pyspark.sql.types import * data = [ ["a", 1, None], ["b", 1, None], ["c", 1, None], ["d", 1, None], ] schema = StructType([ StructField("col1", StringType(), True), StructField("col2", IntegerType(), True), StructField("col3", StringType(), True)]) df = spark.createDataFrame(data, schema) df.write.format("orc").save("hdfs://localhost:9000/tmp/squares", mode='overwrite') spark.read.orc("hdfs://localhost:9000/tmp/squares").show() {code} produces {code} ++++ |col1|col2|col3| ++++ | a| 1|null| | b| 1|null| | c| 1|null| | d| 1|null| ++++ {code} This seems working fine. > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at >
[jira] [Comment Edited] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778 ] Eric Maynard edited comment on SPARK-19713 at 3/16/17 8:08 PM: --- Not really relevant here, but to address: >1. Hive will not be able to create the table as the folder already exists You absolutely can construct a Hive external table on top of an existing folder. >2. Hive cannot drop the table because the spark has not updated HiveMetaStore The canonical solution to this is to run `MSCK REPAIR TABLE myTable;` in Hive. was (Author: emaynard1121): Not really relevant here, but to address: >2. Hive cannot drop the table because the spark has not updated HiveMetaStore The canonical solution to this is to run `MSCK REPAIR TABLE myTable;` in Hive. > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778 ] Eric Maynard commented on SPARK-19713: -- Not really relevant here, but to address: >2. Hive cannot drop the table because the spark has not updated HiveMetaStore The canonical solution to this is to run `MSCK REPAIR TABLE myTable;` in Hive. > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19721) Good error message for version mismatch in log files
[ https://issues.apache.org/jira/browse/SPARK-19721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-19721: Assignee: Liwei Lin > Good error message for version mismatch in log files > > > Key: SPARK-19721 > URL: https://issues.apache.org/jira/browse/SPARK-19721 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Liwei Lin >Priority: Blocker > Fix For: 2.2.0 > > > There are several places where we write out version identifiers in various > logs for structured streaming (usually {{v1}}). However, in the places where > we check for this, we throw a confusing error message. Instead, we should do > the following: > - Find all of the places where we do this kind of check. > - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was > produced by a newer version of Spark and cannot be read by this version. > Please upgrade" > - for anything else throw an error saying the file is malformed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19721) Good error message for version mismatch in log files
[ https://issues.apache.org/jira/browse/SPARK-19721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19721: - Fix Version/s: 2.2.0 > Good error message for version mismatch in log files > > > Key: SPARK-19721 > URL: https://issues.apache.org/jira/browse/SPARK-19721 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Priority: Blocker > Fix For: 2.2.0 > > > There are several places where we write out version identifiers in various > logs for structured streaming (usually {{v1}}). However, in the places where > we check for this, we throw a confusing error message. Instead, we should do > the following: > - Find all of the places where we do this kind of check. > - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was > produced by a newer version of Spark and cannot be read by this version. > Please upgrade" > - for anything else throw an error saying the file is malformed. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928737#comment-15928737 ] Balaram R Gadiraju commented on SPARK-19713: @Hyukjin Kwon : please suggest what you think the title should be > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928738#comment-15928738 ] Hyukjin Kwon commented on SPARK-18789: -- Doh, I am sorry. Let me try to test again and will open. I thought the script describes how to reproduce. Thanks for pointing this out. Let me try this in the current master soon. > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at >
[jira] [Commented] (SPARK-19713) saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928734#comment-15928734 ] Balaram R Gadiraju commented on SPARK-19713: The issue is not only in spark, because when the folder is created and spark ends with error. we are not able to create or drop table even in hive as hive needs to create the folder in order to create the table. 1. Hive will not be able to create the table as the folder already exists. 2. Hive cannot drop the table because the spark has not updated HiveMetaStore (there is not table in hive to drop) This causes the folder to be locked until you run "hdfs dfs -rm -r /data/hive/databases/testdb.db/brokentable" Does everyone think this is not a issue ? > saveAsTable > --- > > Key: SPARK-19713 > URL: https://issues.apache.org/jira/browse/SPARK-19713 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Balaram R Gadiraju > > Hi, > I just observed that when we use dataframe.saveAsTable("table") -- In > oldversions > and dataframe.write.saveAsTable("table") -- in the newer versions > When using the method “df3.saveAsTable("brokentable")” in > scale code. This creates a folder in hdfs and doesn’t update hive-metastore > that it plans to create the table. So if anything goes wrong in between the > folder still exists and hive is not aware of the folder creation. This will > block the users from creating the table “brokentable” as the folder already > exists, we can remove the folder using “hadoop fs –rmr > /data/hive/databases/testdb.db/brokentable”. So below is the workaround > which will enable to you to continue the development work. > Current Code: > val df3 = sqlContext.sql("select * fromtesttable") > df3.saveAsTable("brokentable") > THE WORKAROUND: > By registering the DataFrame as table and then using sql command to load the > data will resolve the issue. EX: > val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3") > sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
[ https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928660#comment-15928660 ] Jose Soltren commented on SPARK-19982: -- [~cloud_fan] added this test as some work related to SPARK-11954. Wenchen, do you have any thoughts as to why this might be failing intermittently? Likely not, but I wonder if this has anything to do with outerScopes being a lazy val in object OuterScopes. Then, possibly, in very rare instances, Analyzer.scala:ResolveNewInstance could hit the (outer == null) branch and throw this exception. We run all the Spark unit tests about a dozen times a night and this has failed on average twice a month since last May (which is as far back as my data goes). > JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate > an encoder for inner class" > - > > Key: SPARK-19982 > URL: https://issues.apache.org/jira/browse/SPARK-19982 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Jose Soltren > > JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: > Unable to generate an encoder for inner class > `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to > the scope that this class was defined in. Try moving this class out of its > parent class. > From https://spark-tests.appspot.com/test-logs/35475788 > [~vanzin] looked into this back in October and reported: > I ran this test in a loop (both alone and with the rest of the spark-sql > tests) and never got a failure. I even used the same JDK as Jenkins > (1.7.0_51). > Also looked at the code and nothing seems wrong. The errors is when an entry > with the parent class name is missing from the map kept in OuterScopes.scala, > but the test populates that map in its first line. So it doesn't look like a > race nor some issue with weak references (the map uses weak values). > public void testJavaBeanEncoder() { > OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
[ https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928657#comment-15928657 ] Shixiong Zhu commented on SPARK-19965: -- [~lwlin] I think we can just ignore “_spark_metadata” in InMemoryFileIndex. Could you try it? > DataFrame batch reader may fail to infer partitions when reading > FileStreamSink's output > > > Key: SPARK-19965 > URL: https://issues.apache.org/jira/browse/SPARK-19965 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu > > Reproducer > {code} > test("partitioned writing and batch reading with 'basePath'") { > val inputData = MemoryStream[Int] > val ds = inputData.toDS() > val outputDir = Utils.createTempDir(namePrefix = > "stream.output").getCanonicalPath > val checkpointDir = Utils.createTempDir(namePrefix = > "stream.checkpoint").getCanonicalPath > var query: StreamingQuery = null > try { > query = > ds.map(i => (i, i * 1000)) > .toDF("id", "value") > .writeStream > .partitionBy("id") > .option("checkpointLocation", checkpointDir) > .format("parquet") > .start(outputDir) > inputData.addData(1, 2, 3) > failAfter(streamingTimeout) { > query.processAllAvailable() > } > spark.read.option("basePath", outputDir).parquet(outputDir + > "/*").show() > } finally { > if (query != null) { > query.stop() > } > } > } > {code} > Stack trace > {code} > [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** > (3 seconds, 928 milliseconds) > [info] java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths: > [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 > [info] > ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata > [info] > [info] If provided paths are partition directories, please set "basePath" in > the options of the data source to specify the root directory of the table. If > there are multiple root directories, please load them separately and then > union them. > [info] at scala.Predef$.assert(Predef.scala:170) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) > [info] at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > [info] at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19983) Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'
Rajkumar created SPARK-19983: Summary: Getting ValidationFailureSemanticException on 'INSERT OVEWRITE' Key: SPARK-19983 URL: https://issues.apache.org/jira/browse/SPARK-19983 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Rajkumar Priority: Blocker Hi, I am creating a DataFrame and registering that DataFrame as temp table using df.createOrReplaceTempView('mytable'). After that I try to write the content from 'mytable' into Hive table(It has partition) using the following query insert overwrite table myhivedb.myhivetable partition(testdate) // ( 1) : Note here : I have a partition named 'testdate' select Field1, Field2, ... TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) have the same name from mytable when I execute this query, I am getting the following error Exception in thread "main" org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: Partition spec {testdate=, TestDate=2013-01-01} Looks like I am getting this error because of the same field names ; ie testdate(the partition in Hive) & TestDate (The field in temp table 'mytable') Whereas if my fieldname 'TestDate' is different, the query executes successuflly. Example... insert overwrite table myhivedb.myhivetable partition(testdate) select Field1, Field2, ... myDate //Note here : The field name is 'myDate' & not 'TestDate' from mytable -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 6:39 PM: -- I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mentioned by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() ? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} was (Author: tomas pranckevicius): I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 6:40 PM: -- I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mentioned by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could you please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() ? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} was (Author: tomas pranckevicius): I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mentioned by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() ? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-19945: --- Assignee: Song Jun > Add test suite for SessionCatalog with HiveExternalCatalog > -- > > Key: SPARK-19945 > URL: https://issues.apache.org/jira/browse/SPARK-19945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.2.0 > > > Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite > for HiveExternalCatalog. > And there are some ddl function is not proper to test in > ExternalCatalogSuite, because some logic are not full implement in > ExternalCatalog, these ddl functions are full implement in SessionCatalog, it > is better to test it in SessionCatalogSuite > So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19945. - Resolution: Fixed Fix Version/s: 2.2.0 > Add test suite for SessionCatalog with HiveExternalCatalog > -- > > Key: SPARK-19945 > URL: https://issues.apache.org/jira/browse/SPARK-19945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.2.0 > > > Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite > for HiveExternalCatalog. > And there are some ddl function is not proper to test in > ExternalCatalogSuite, because some logic are not full implement in > ExternalCatalog, these ddl functions are full implement in SessionCatalog, it > is better to test it in SessionCatalogSuite > So we should add a test suite for SessionCatalog with HiveExternalCatalog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"
Jose Soltren created SPARK-19982: Summary: JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class" Key: SPARK-19982 URL: https://issues.apache.org/jira/browse/SPARK-19982 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.0 Reporter: Jose Soltren JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below: Unable to generate an encoder for inner class `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to the scope that this class was defined in. Try moving this class out of its parent class. >From https://spark-tests.appspot.com/test-logs/35475788 [~vanzin] looked into this back in October and reported: I ran this test in a loop (both alone and with the rest of the spark-sql tests) and never got a failure. I even used the same JDK as Jenkins (1.7.0_51). Also looked at the code and nothing seems wrong. The errors is when an entry with the parent class name is missing from the map kept in OuterScopes.scala, but the test populates that map in its first line. So it doesn't look like a race nor some issue with weak references (the map uses weak values). public void testJavaBeanEncoder() { OuterScopes.addOuterScope(this); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928549#comment-15928549 ] yuhao yang edited comment on SPARK-19969 at 3/16/17 6:10 PM: - Not really. But I can start on it now if needed. was (Author: yuhaoyan): Not really. But I can start on it now. > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928549#comment-15928549 ] yuhao yang commented on SPARK-19969: Not really. But I can start on it now. > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928545#comment-15928545 ] Nick Pentreath commented on SPARK-19979: I wonder if this fits in as a sort of sub-task of SPARK-19071? cc [~bryanc] as it relates to your work on SPARK-19357. > [MLLIB] Multiple Estimators/Pipelines In CrossValidator > --- > > Key: SPARK-19979 > URL: https://issues.apache.org/jira/browse/SPARK-19979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: David Leifker > > Update CrossValidator and TrainValidationSplit to be able to accept multiple > pipelines and grid parameters for testing different algorithms and/or being > able to better control tuning combinations. Maintains backwards compatible > API and reads legacy serialized objects. > The same could be done using an external iterative approach. Build different > pipelines, throwing each into a CrossValidator, and then taking the best > model from each of those CrossValidators. Then finally picking the best from > those. This is the initial approach I explored. It resulted in a lot of > boiler plate code that felt like it shouldn't need to exist if the api simply > allowed for arrays of estimators and their parameters. > A couple advantages to this implementation to consider come from keeping the > functional interface to the CrossValidator. > 1. The caching of the folds is better utilized. An external iterative > approach creates a new set of k folds for each CrossValidator fit and the > folds are discarded after each CrossValidator run. In this implementation a > single set of k folds is created and cached for all of the pipelines. > 2. A potential advantage of using this implementation is for future > parallelization of the pipelines within the CrossValdiator. It is of course > possible to handle the parallelization outside of the CrossValidator here > too, however I believe there is already work in progress to parallelize the > grid parameters and that could be extended to multiple pipelines. > Both of those behind-the-scene optimizations are possible because of > providing the CrossValidator with the data and the complete set of > pipelines/estimators to evaluate up front allowing one to abstract away the > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928537#comment-15928537 ] Nick Pentreath commented on SPARK-19969: No haven't done the doc or examples - I seem to recall you had already done some work on that? > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:40 PM: -- I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} was (Author: tomas pranckevicius): I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:40 PM: -- I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} was (Author: tomas pranckevicius): I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746 ] Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:39 PM: -- I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? {code} def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 {code} was (Author: tomas pranckevicius): I am looking as well to the solution of this pyspark crash for the large data set issue on windows. I have read several posts and spent few days on this problem. I am happy to see that there is a solution mention by Shea Parkes and I am trying to get it working by changing rdd.py, but it still does not provide the positive outcome. Could please write more details on the change that has to be done in the proposed bandaid of exhausting the iterator at the end of takeUpToNumLeft() by changing rdd.py file? def takeUpToNumLeft(): iterator = iter(iterator) taken = 0 while taken < left: yield next(iterator) taken += 1 > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception
[ https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19329: Fix Version/s: 2.1.1 > after alter a datasource table's location to a not exist location and then > insert data throw Exception > -- > > Key: SPARK-19329 > URL: https://issues.apache.org/jira/browse/SPARK-19329 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.1.1, 2.2.0 > > > spark.sql("create table t(a string, b int) using parquet") > spark.sql(s"alter table t set location '$notexistedlocation'") > spark.sql("insert into table t select 'c', 1") > this will throw an exception: > com.google.common.util.concurrent.UncheckedExecutionException: > org.apache.spark.sql.AnalysisException: Path does not exist: > $notexistedlocation; > at > com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814) > at > com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122) > at > org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928443#comment-15928443 ] yuhao yang commented on SPARK-19969: Thanks for the great help with Imputer, [~mlnick] Have you started on this? > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14438) Cross-publish Breeze for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928356#comment-15928356 ] Kirill chebba Chebunin commented on SPARK-14438: 2.12 support was added for version 0.13 in https://github.com/scalanlp/breeze/issues/604 > Cross-publish Breeze for Scala 2.12 > --- > > Key: SPARK-14438 > URL: https://issues.apache.org/jira/browse/SPARK-14438 > Project: Spark > Issue Type: Sub-task > Components: Build, Project Infra >Reporter: Josh Rosen > > Spark relies on Breeze (https://github.com/scalanlp/breeze), so we'll need to > cross-publish that for Scala 2.12. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns
Allen George created SPARK-19981: Summary: Sort-Merge join inserts shuffles when joining dataframes with aliased columns Key: SPARK-19981 URL: https://issues.apache.org/jira/browse/SPARK-19981 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Allen George Performing a sort-merge join with two dataframes - each of which has the join column aliased - causes Spark to insert an unnecessary shuffle. Consider the scala test code below, which should be equivalent to the following SQL. {code:SQL} SELECT * FROM (SELECT number AS aliased from df1) t1 LEFT JOIN (SELECT number AS aliased from df2) t2 ON t1.aliased = t2.aliased {code} {code:scala} private case class OneItem(number: Long) private case class TwoItem(number: Long, value: String) test("join with aliases should not trigger shuffle") { val df1 = sqlContext.createDataFrame( Seq( OneItem(0), OneItem(2), OneItem(4) ) ) val partitionedDf1 = df1.repartition(10, col("number")) partitionedDf1.createOrReplaceTempView("df1") partitionedDf1.cache() partitionedDf1.count() val df2 = sqlContext.createDataFrame( Seq( TwoItem(0, "zero"), TwoItem(2, "two"), TwoItem(4, "four") ) ) val partitionedDf2 = df2.repartition(10, col("number")) partitionedDf2.createOrReplaceTempView("df2") partitionedDf2.cache() partitionedDf2.count() val fromDf1 = sqlContext.sql("SELECT number from df1") val fromDf2 = sqlContext.sql("SELECT number from df2") val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased") val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased") aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") } {code} Both the SQL and the Scala code generate a query-plan where an extra exchange is inserted before performing the sort-merge join. This exchange changes the partitioning from {{HashPartitioning("number", 10)}} for each frame being joined into {{HashPartitioning("aliased", 5)}}. I would have expected that since it's a simple column aliasing, and both frames have exactly the same partitioning that the initial frames. {noformat} *Project [args=[aliased#267L]][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 5)%NONNULL)][outOrder=List(aliased#267L ASC%NONNULL)][output=List(aliased#267:bigint%NONNULL)] +- *SortMergeJoin [args=[aliased#267L], [aliased#270L], Inner][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 5)%NONNULL)][outOrder=List(aliased#267L ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL, aliased#270:bigint%NONNULL)] :- *Sort [args=[aliased#267L ASC], false, 0][outPart=HashPartitioning(5, aliased#267:bigint%NONNULL)][outOrder=List(aliased#267L ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL)] : +- Exchange [args=hashpartitioning(aliased#267L, 5)%NONNULL][outPart=HashPartitioning(5, aliased#267:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)] : +- *Project [args=[number#198L AS aliased#267L]][outPart=HashPartitioning(10, number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)] :+- InMemoryTableScan [args=[number#198L]][outPart=HashPartitioning(10, number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#198:bigint%NONNULL)] : : +- InMemoryRelation [number#198L], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), false[Statistics(24,false)][output=List(number#198:bigint%NONNULL)] : : : +- Exchange [args=hashpartitioning(number#198L, 10)%NONNULL][outPart=HashPartitioning(10, number#198:bigint%NONNULL)][outOrder=List()][output=List(number#198:bigint%NONNULL)] : : : +- LocalTableScan [args=[number#198L]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(number#198:bigint%NONNULL)] +- *Sort [args=[aliased#270L ASC], false, 0][outPart=HashPartitioning(5, aliased#270:bigint%NONNULL)][outOrder=List(aliased#270L ASC%NONNULL)][output=ArrayBuffer(aliased#270:bigint%NONNULL)] +- Exchange [args=hashpartitioning(aliased#270L, 5)%NONNULL][outPart=HashPartitioning(5, aliased#270:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)] +- *Project [args=[number#223L AS aliased#270L]][outPart=HashPartitioning(10, number#223:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)] +- InMemoryTableScan [args=[number#223L]][outPart=HashPartitioning(10, number#223:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#223:bigint%NONNULL)] : +- InMemoryRelation [number#223L, value#224], true, 1, StorageLevel(disk, memory,
[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.
[ https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-19980: - Description: Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct] scala> df.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df.map(x => x)(encoder).show() ++--+ |name| stuff| ++--+ | v1|[null]| ++--+ {code} After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself. Doing the same with case classes preserves the nulls: {code} scala> case class ScalaStuff(name: String) defined class ScalaStuff scala> case class ScalaOuter(name: String, stuff: ScalaStuff) defined class ScalaOuter scala> val encoder2 = Encoders.product[ScalaOuter] encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, stuff[0]: struct] scala> val schema2 = encoder2.schema schema2: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema2.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: struct] scala> df2.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df2.map(x => x).show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ {code} stuff.json: {code} {"name":"v1", "stuff":null } {code} was: Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} And test code: {code} val encoder = Encoders.bean(classOf[Outer]) val schema = encoder.schema schema.printTreeString val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df.show() df.map(x => x)(encoder).show() {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct] scala> df.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df.map(x => x)(encoder).show() ++--+ |name| stuff| ++--+ | v1|[null]| ++--+ {code} After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself. Doing the same with case classes preserves the nulls: {code} scala> case class ScalaStuff(name: String) defined class ScalaStuff
[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.
[ https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michel Lemay updated SPARK-19980: - Description: Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} And test code: {code} val encoder = Encoders.bean(classOf[Outer]) val schema = encoder.schema schema.printTreeString val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df.show() df.map(x => x)(encoder).show() {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct] scala> df.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df.map(x => x)(encoder).show() ++--+ |name| stuff| ++--+ | v1|[null]| ++--+ {code} After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself. Doing the same with case classes preserves the nulls: {code} scala> case class ScalaStuff(name: String) defined class ScalaStuff scala> case class ScalaOuter(name: String, stuff: ScalaStuff) defined class ScalaOuter scala> val encoder2 = Encoders.product[ScalaOuter] encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, stuff[0]: struct] scala> val schema2 = encoder2.schema schema2: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema2.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: struct] scala> df2.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df2.map(x => x).show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ {code} stuff.json: {code} {"name":"v1", "stuff":null } {code} was: Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} And test code: {code} val encoder = Encoders.bean(classOf[Outer]) val schema = encoder.schema schema.printTreeString val df = spark.read.schema(schema).json("d:\\stuff.json").as[Outer](encoder) df.show() df.map(x => x)(encoder).show() {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct] scala> df.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df.map(x => x)(encoder).show() ++--+ |name| stuff| ++--+ | v1|[null]| ++--+ {code}
[jira] [Created] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.
Michel Lemay created SPARK-19980: Summary: Basic Dataset transformation on POJOs does not preserves nulls. Key: SPARK-19980 URL: https://issues.apache.org/jira/browse/SPARK-19980 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Michel Lemay Applying an identity map transformation on a statically typed Dataset with a POJO produces an unexpected result. Given POJOs: {code} public class Stuff implements Serializable { private String name; public void setName(String name) { this.name = name; } public String getName() { return name; } } public class Outer implements Serializable { private String name; private Stuff stuff; public void setName(String name) { this.name = name; } public String getName() { return name; } public void setStuff(Stuff stuff) { this.stuff = stuff; } public Stuff getStuff() { return stuff; } } {code} And test code: {code} val encoder = Encoders.bean(classOf[Outer]) val schema = encoder.schema schema.printTreeString val df = spark.read.schema(schema).json("d:\\stuff.json").as[Outer](encoder) df.show() df.map(x => x)(encoder).show() {code} Produces the result: {code} scala> val encoder = Encoders.bean(classOf[Outer]) encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, stuff[0]: struct] scala> val schema = encoder.schema schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder) df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: struct] scala> df.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df.map(x => x)(encoder).show() ++--+ |name| stuff| ++--+ | v1|[null]| ++--+ {code} After identity transformation, `stuff` becomes an object with null values inside it instead of staying null itself. Doing the same with case classes preserves the nulls: {code} scala> case class ScalaStuff(name: String) defined class ScalaStuff scala> case class ScalaOuter(name: String, stuff: ScalaStuff) defined class ScalaOuter scala> val encoder2 = Encoders.product[ScalaOuter] encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, stuff[0]: struct] scala> val schema2 = encoder2.schema schema2: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(stuff,StructType(StructField(name,StringType,true)),true)) scala> schema2.printTreeString root |-- name: string (nullable = true) |-- stuff: struct (nullable = true) ||-- name: string (nullable = true) scala> scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter] df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: struct] scala> df2.show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ scala> df2.map(x => x).show() ++-+ |name|stuff| ++-+ | v1| null| ++-+ {code} stuff.json: {code} {"name":"v1", "stuff":null } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception
[ https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928213#comment-15928213 ] Eugen Prokhorenko commented on SPARK-18789: --- Just wanted to mention that initial problem involves saving null values (the python script above doesn't have null columns in the df). > Save Data frame with Null column-- exception > > > Key: SPARK-18789 > URL: https://issues.apache.org/jira/browse/SPARK-18789 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Harish > > I am trying to save a DF to HDFS which is having 1 column is NULL(no data). > col1 col2 col3 > a 1 null > b 1 null > c1null > d 1 null > code : df.write.format("orc").save(path, mode='overwrite') > Error: > java.lang.IllegalArgumentException: Error: type expected at the position 49 > of 'string:string:string:double:string:double:string:null' but 'null' is > found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 > times; aborting job > 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in > stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage > 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: > type expected at the position 49 of > 'string:string:string:double:string:double:string:null' but 'null' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765) > at > org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104) > at > org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182) > at > org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225) > at > org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247) > at >
[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196 ] Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:45 PM: -- One thing to add is that the same application will not have this issue when there is only one stream, even this one stream has a much higher load (10x). The issue seems to have something to do with multiple streams. was (Author: rayqiu): One thing to add is that the same application will not have this issue when there is only one stream, even this one stream has a much higher load. The issue seems to have something to do with multiple streams. > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15040) PySpark impl for ml.feature.Imputer
[ https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928180#comment-15928180 ] Nick Pentreath commented on SPARK-15040: Sorry, I did not see your comment - I opened a [PR|https://github.com/apache/spark/pull/17316] already. > PySpark impl for ml.feature.Imputer > --- > > Key: SPARK-15040 > URL: https://issues.apache.org/jira/browse/SPARK-15040 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > PySpark impl for ml.feature.Imputer. > This need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196 ] Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:45 PM: -- One thing to add is that the same application will not have this issue when there is only one stream, even this one stream has a much higher load. The issue seems to have something to do with multiple streams. was (Author: rayqiu): One thing to add is that the same application will not have this issue when there is only one stream, even this one stream has a much high load. The issue seems to have something to do with multiple streams. > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196 ] Ray Qiu commented on SPARK-19977: - One thing to add is that the same application will not have this issue when there is only one stream, even this one stream has a much high load. The issue seems to have something to do with multiple streams. > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928193#comment-15928193 ] Nick Pentreath commented on SPARK-19899: +1 on {{itemsCol}} - feel free to send a PR :) > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928195#comment-15928195 ] David Leifker commented on SPARK-19979: --- My apologies, I got a little ahead of this with a proposed PR here https://github.com/apache/spark/pull/17306 > [MLLIB] Multiple Estimators/Pipelines In CrossValidator > --- > > Key: SPARK-19979 > URL: https://issues.apache.org/jira/browse/SPARK-19979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: David Leifker > > Update CrossValidator and TrainValidationSplit to be able to accept multiple > pipelines and grid parameters for testing different algorithms and/or being > able to better control tuning combinations. Maintains backwards compatible > API and reads legacy serialized objects. > The same could be done using an external iterative approach. Build different > pipelines, throwing each into a CrossValidator, and then taking the best > model from each of those CrossValidators. Then finally picking the best from > those. This is the initial approach I explored. It resulted in a lot of > boiler plate code that felt like it shouldn't need to exist if the api simply > allowed for arrays of estimators and their parameters. > A couple advantages to this implementation to consider come from keeping the > functional interface to the CrossValidator. > 1. The caching of the folds is better utilized. An external iterative > approach creates a new set of k folds for each CrossValidator fit and the > folds are discarded after each CrossValidator run. In this implementation a > single set of k folds is created and cached for all of the pipelines. > 2. A potential advantage of using this implementation is for future > parallelization of the pipelines within the CrossValdiator. It is of course > possible to handle the parallelization outside of the CrossValidator here > too, however I believe there is already work in progress to parallelize the > grid parameters and that could be extended to multiple pipelines. > Both of those behind-the-scene optimizations are possible because of > providing the CrossValidator with the data and the complete set of > pipelines/estimators to evaluate up front allowing one to abstract away the > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator
David Leifker created SPARK-19979: - Summary: [MLLIB] Multiple Estimators/Pipelines In CrossValidator Key: SPARK-19979 URL: https://issues.apache.org/jira/browse/SPARK-19979 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.0 Reporter: David Leifker Update CrossValidator and TrainValidationSplit to be able to accept multiple pipelines and grid parameters for testing different algorithms and/or being able to better control tuning combinations. Maintains backwards compatible API and reads legacy serialized objects. The same could be done using an external iterative approach. Build different pipelines, throwing each into a CrossValidator, and then taking the best model from each of those CrossValidators. Then finally picking the best from those. This is the initial approach I explored. It resulted in a lot of boiler plate code that felt like it shouldn't need to exist if the api simply allowed for arrays of estimators and their parameters. A couple advantages to this implementation to consider come from keeping the functional interface to the CrossValidator. 1. The caching of the folds is better utilized. An external iterative approach creates a new set of k folds for each CrossValidator fit and the folds are discarded after each CrossValidator run. In this implementation a single set of k folds is created and cached for all of the pipelines. 2. A potential advantage of using this implementation is for future parallelization of the pipelines within the CrossValdiator. It is of course possible to handle the parallelization outside of the CrossValidator here too, however I believe there is already work in progress to parallelize the grid parameters and that could be extended to multiple pipelines. Both of those behind-the-scene optimizations are possible because of providing the CrossValidator with the data and the complete set of pipelines/estimators to evaluate up front allowing one to abstract away the implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179 ] Shea Parkes edited comment on SPARK-12261 at 3/16/17 2:38 PM: -- I simply added the following to the end: {code} for _ in iterator: pass {code} This will run through the rest of iterator (until the StopIteration exception like normal). Depending how you're making pyspark importable, you might need to make this change inside a zipped copy of pyspark as well (e.g. in the binary distributions downloadable from Spark's home page). was (Author: shea.parkes): I simply added the following to the end: {code} for _ in iterator: pass {code} This will run through the rest of iterator (until the StopIteration exception like normal). Depending how you're making pyspark importable, you might need to make this change inside a zipped copy of pyspark as well (e.g. in the binary distributions available). > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179 ] Shea Parkes edited comment on SPARK-12261 at 3/16/17 2:38 PM: -- I simply added the following to the end: {code} for _ in iterator: pass {code} This will run through the rest of iterator (until the StopIteration exception like normal). Depending how you're making pyspark importable, you might need to make this change inside a zipped copy of pyspark as well (e.g. in the binary distributions available). was (Author: shea.parkes): I simply added the following to the end: for _ in iterator: pass This will run through the rest of iterator (until the StopIteration exception like normal). Depending how you're making pyspark importable, you might need to make this change inside a zipped copy of pyspark as well (e.g. in the binary distributions available). > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179 ] Shea Parkes commented on SPARK-12261: - I simply added the following to the end: for _ in iterator: pass This will run through the rest of iterator (until the StopIteration exception like normal). Depending how you're making pyspark importable, you might need to make this change inside a zipped copy of pyspark as well (e.g. in the binary distributions available). > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10764) Add optional caching to Pipelines
[ https://issues.apache.org/jira/browse/SPARK-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928169#comment-15928169 ] Sachin Tyagi commented on SPARK-10764: -- Hi, I want to take a stab at it. Here's how I am trying to approach it. A PipelineStage can be marked to persist its output DataFrame by calling a persist(storageLevel, columnExprs). This will result in two cases: * For Transformers -- their output DF should be marked to persist. * For Estimators -- the output of their models should be marked to persist. For example, {code:title=Example.scala|borderStyle=solid} val tokenizer = ... // CountVectorizer estimator's Model should persist its output DF (and only those columns passed in args to persist) so that the LDA iterations can run on the smaller persisted dataframe. val countVectorizer = new CountVectorizer() .setInputCol("words") .setOutputCol("features") .setVocabSize(1000) .persist(StorageLevel.MEMORY_AND_DISK, "doc_id", "features") val lda = ... val pipelineModel = new Pipeline().setStages(Array(tokenizer, countVectorizer, lda)) {code} Also, there should be a way use the fitted pipeline to transform an already persisted dataframe if that dataframe was persisted as part of some stage during fit(). Else we end up doing unnecessary work in some cases (tokenizing and countvectorizing the input dataframe again to get topic distributions, in above example). Instead in such a case, only the necessary stages should be invoked to transform. {code:title=Continue.scala|borderStyle=solid} // The pipeline model should be able to identify whether the passed DF was persisted as part of some stage and then run only needed stages. In this case, the model should run only the LDA stage. pipelineModel.transform(countVectorizer.getCacheDF()) // This should run all stages pipelineModel.tranform(unpersistedDF) {code} In my mind, this can be achieved by modifying the PipelineStage, Pipeline and PipelineModel classes. Specifically, their transform and transformSchema methods. And obviously, by creating the appropriate persist() method(s) on PipelineStage. Please let me know your comments on this approach. Specifically, if you see any issues or things that need to be taken care of. I can submitted a PR soon to see how it looks. > Add optional caching to Pipelines > - > > Key: SPARK-10764 > URL: https://issues.apache.org/jira/browse/SPARK-10764 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > We need to explore how to cache DataFrames during the execution of Pipelines. > It's a hard problem in general to handle automatically or manually, so we > should start with some design discussions about: > * How to control it manually > * Whether & how to handle it automatically > * API changes needed for each -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928164#comment-15928164 ] Ray Qiu commented on SPARK-19977: - Not really. Many of the batches are empty RDDs, and the scheduler delay still in the range of seconds. This only happen after a few hours of running the application. Everything works initially. > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928164#comment-15928164 ] Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:28 PM: -- Not really. Many of the batches are empty RDDs, and the scheduler delay is still in the range of seconds. This only happen after a few hours of running the application. Everything works initially. was (Author: rayqiu): Not really. Many of the batches are empty RDDs, and the scheduler delay still in the range of seconds. This only happen after a few hours of running the application. Everything works initially. > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30 seconds in Spark Streaming application > -- > > Key: SPARK-19977 > URL: https://issues.apache.org/jira/browse/SPARK-19977 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Ray Qiu > > Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from > 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka > direct streams are processed. These kafka streams are processed separately > (not combined via union). > It causes the task processing time to increase greatly and eventually stops > working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging
[ https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19946. --- Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.2.0 > DebugFilesystem.assertNoOpenStreams should report the open streams to help > debugging > > > Key: SPARK-19946 > URL: https://issues.apache.org/jira/browse/SPARK-19946 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu >Priority: Minor > Fix For: 2.2.0 > > > In DebugFilesystem.assertNoOpenStreams if there are open streams an exception > is thrown showing the number of open streams. This doesn't help much to debug > where the open streams were leaked. > The exception should also report where the stream was leaked. This can be > done through a cause exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928123#comment-15928123 ] yu peng commented on SPARK-19962: - yeah, exactly.. i would love to use FeatureHasher when i have a lot of features :) and DictVectorizer is kind of nice to keep all the mapping for me so i can play with my classifier/regressor weights with meaningful explain :) i would like to contribute a pr if you guys think it's worth it.. > add DictVectorizor for DataFrame > > > Key: SPARK-19962 > URL: https://issues.apache.org/jira/browse/SPARK-19962 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: yu peng > Labels: features > > it's really useful to have something like > sklearn.feature_extraction.DictVectorizor > Since out features lives in json/data frame like format and > classifier/regressors only take vector input. so there is a gap between them. > something like > ``` > df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', > hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us', > hobbies=['sing']), ]) > df.show() > |age|gender|country|hobbies| > |1|male|cn|[sing, dance]| > |3|female|us|[sing]| > import DictVectorizor > vec = DictVectorizor() > matrix = vec.fit_transform(df) > matrix.show() > |features| > |[1, 0, 1, 0, 1, 1, 1]| > |[3, 1, 0, 1, 0, 1, 1]| > vec.show() > |feature_name| feature_dimension| > |age|0| > |gender=female|1| > |gender=male|2| > |country=us|3| > |country=cn|4| > |hobbies=sing|5| > |hobbies=dance|6| > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19932) Disallow a case that might cause OOM for steaming deduplication
[ https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-19932: -- Summary: Disallow a case that might cause OOM for steaming deduplication (was: Disallow a case that might case OOM for steaming deduplication) > Disallow a case that might cause OOM for steaming deduplication > --- > > Key: SPARK-19932 > URL: https://issues.apache.org/jira/browse/SPARK-19932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Liwei Lin > > {code} > spark >.readStream // schema: (word, eventTime), like ("a", 10), > ("a", 11), ("b", 12) ... >... >.withWatermark("eventTime", "10 seconds") >.dropDuplicates("word") // note: "eventTime" is not part of the key > columns >... > {code} > As shown above, right now if watermark is specified for a streaming > dropDuplicates query, but not specified as the key columns, then we'll still > get the correct answer, but the state just keeps growing and will never get > cleaned up. > The reason is, the watermark attribute is not part of the key of the state > store in this case. We're not saving event time information in the state > store. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize
[ https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-6678. - Resolution: Not A Problem I am resolving this as this code path has been radically changed. I still would like to help if you face this issue. Please reopen this if you meet this issue. Let's verify this together. > select count(DISTINCT C_UID) from parquetdir may be can optimize > > > Key: SPARK-6678 > URL: https://issues.apache.org/jira/browse/SPARK-6678 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Littlestar >Priority: Minor > > 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique > C_UID). > I run the following sql, may be RDD.collect is very slow > select count(DISTINCT C_UID) from parquetdir > select count(DISTINCT C_UID) from parquetdir > collect at SparkPlan.scala:83 +details > org.apache.spark.rdd.RDD.collect(RDD.scala:813) > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815) > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:606) > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:415) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source) > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19932) Disallow a case that might case OOM for steaming deduplication
[ https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liwei Lin updated SPARK-19932: -- Summary: Disallow a case that might case OOM for steaming deduplication (was: Also save event time into StateStore for certain cases) > Disallow a case that might case OOM for steaming deduplication > -- > > Key: SPARK-19932 > URL: https://issues.apache.org/jira/browse/SPARK-19932 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Liwei Lin > > {code} > spark >.readStream // schema: (word, eventTime), like ("a", 10), > ("a", 11), ("b", 12) ... >... >.withWatermark("eventTime", "10 seconds") >.dropDuplicates("word") // note: "eventTime" is not part of the key > columns >... > {code} > As shown above, right now if watermark is specified for a streaming > dropDuplicates query, but not specified as the key columns, then we'll still > get the correct answer, but the state just keeps growing and will never get > cleaned up. > The reason is, the watermark attribute is not part of the key of the state > store in this case. We're not saving event time information in the state > store. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)
[ https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928065#comment-15928065 ] Hyukjin Kwon commented on SPARK-18579: -- I submitted a PR for this https://github.com/apache/spark/pull/17310 but it seems not adding a link automatically for now. > spark-csv strips whitespace (pyspark) > -- > > Key: SPARK-18579 > URL: https://issues.apache.org/jira/browse/SPARK-18579 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.2 >Reporter: Adrian Bridgett >Priority: Minor > > ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV > reader (and defaults to false). > However these are not supported options on the CSV writer and so the library > defaults take place which strips the whitespace. > I think it would make the most sense if the writer semantics matched the > reader (and did not alter the data) however this would be a change in > behaviour. In any case it'd be great to have the _option_ to strip or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org