[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929438#comment-15929438
 ] 

Hyukjin Kwon commented on SPARK-19288:
--

FWIW, for me it has been fine. Mac OS 10.12.3 & KTS & R version 3.2.3.
It has been fine for Windows via AppVeyor so far.

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: 

[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-03-16 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929433#comment-15929433
 ] 

Miao Wang commented on SPARK-19288:
---

I think it only happens at local build. I had another similar issue due to 
building hive once. So if you comment out this one, it will pass. I don't think 
Jenkins will suffer this issue.

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: 

[jira] [Comment Edited] (SPARK-19827) spark.ml R API for PIC

2017-03-16 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929429#comment-15929429
 ] 

Miao Wang edited comment on SPARK-19827 at 3/17/17 5:15 AM:


Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper 
is not merged yet.
Please wait for https://github.com/apache/spark/pull/15770 merged before 
submitting PR.
cc [~felixcheung]


was (Author: wm624):
Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper 
is not merged yet.
Please wait for https://github.com/apache/spark/pull/15770 merged before 
submitting PR.

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19827) spark.ml R API for PIC

2017-03-16 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929429#comment-15929429
 ] 

Miao Wang commented on SPARK-19827:
---

Please hold on. We need to add wrapper to ML instead of MLLIB. The ML wrapper 
is not merged yet.
Please wait for https://github.com/apache/spark/pull/15770 merged before 
submitting PR.

> spark.ml R API for PIC
> --
>
> Key: SPARK-19827
> URL: https://issues.apache.org/jira/browse/SPARK-19827
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929422#comment-15929422
 ] 

Kay Ousterhout commented on SPARK-19990:


Thanks [~windpiger]!

> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> 

[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit  in SPARK-19235   
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> 

[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> 

[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun commented on SPARK-19990:
--

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> 

[jira] [Comment Edited] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-03-16 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929399#comment-15929399
 ] 

Kazuaki Ishizaki edited comment on SPARK-19984 at 3/17/17 4:23 AM:
---

This problem occurs since Spark generates {{.compare(UTF8String)}} method for 
{{long}} primitive type. It should not be generated. I am investigating why it 
occurs from the log.
Can you post the code while it may not always reproduce this problem?


was (Author: kiszk):
This problem occurs since Spark generates {.compare()} method for {long} 
primitive type. It should not be generated. I am investigating why it occurs 
from the log.
Can you post the code while it may not always reproduce this problem?

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
> /* 035 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter1;
> /* 036 */
> /* 037 */   public GeneratedIterator(Object[] references) {
> /* 038 */ this.references = references;
> /* 039 */   }
> /* 040 */
> /* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 042 */ partitionIndex = index;
> /* 043 */ this.inputs = inputs;
> /* 044 */ wholestagecodegen_init_0();
> /* 045 */ wholestagecodegen_init_1();
> /* 046 */
> /* 047 */   }
> /* 048 */
> /* 049 */   private void wholestagecodegen_init_0() {
> /* 050 */ agg_initAgg = false;
> /* 051 */
> /* 052 */ agg_initAgg1 = false;
> /* 053 */
> /* 054 */ smj_leftInput = inputs[0];
> /* 055 */ smj_rightInput = inputs[1];
> /* 056 */
> /* 057 */ smj_rightRow = null;
> /* 058 */
> /* 059 */ smj_matches = new java.util.ArrayList();
> /* 060 */
> /* 061 */  

[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-03-16 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929399#comment-15929399
 ] 

Kazuaki Ishizaki commented on SPARK-19984:
--

This problem occurs since Spark generates {.compare()} method for {long} 
primitive type. It should not be generated. I am investigating why it occurs 
from the log.
Can you post the code while it may not always reproduce this problem?

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
> /* 035 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter1;
> /* 036 */
> /* 037 */   public GeneratedIterator(Object[] references) {
> /* 038 */ this.references = references;
> /* 039 */   }
> /* 040 */
> /* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 042 */ partitionIndex = index;
> /* 043 */ this.inputs = inputs;
> /* 044 */ wholestagecodegen_init_0();
> /* 045 */ wholestagecodegen_init_1();
> /* 046 */
> /* 047 */   }
> /* 048 */
> /* 049 */   private void wholestagecodegen_init_0() {
> /* 050 */ agg_initAgg = false;
> /* 051 */
> /* 052 */ agg_initAgg1 = false;
> /* 053 */
> /* 054 */ smj_leftInput = inputs[0];
> /* 055 */ smj_rightInput = inputs[1];
> /* 056 */
> /* 057 */ smj_rightRow = null;
> /* 058 */
> /* 059 */ smj_matches = new java.util.ArrayList();
> /* 060 */
> /* 061 */ this.smj_numOutputRows = 
> (org.apache.spark.sql.execution.metric.SQLMetric) references[0];
> /* 062 */ smj_result = new UnsafeRow(2);
> /* 063 */ this.smj_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, 
> 64);
> /* 064 */ this.smj_rowWriter = new 
> 

[jira] [Created] (SPARK-19991) FileSegmentManagedBuffer performance improvement.

2017-03-16 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-19991:
---

 Summary: FileSegmentManagedBuffer performance improvement.
 Key: SPARK-19991
 URL: https://issues.apache.org/jira/browse/SPARK-19991
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.1.0, 2.0.2
Reporter: Guoqiang Li
Priority: Minor


When we do not set the value of the configuration items 
`{{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, 
each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
FileSegmentManagedBuffer.createInputStream method creates a 
NoSuchElementException instance. This is a more time-consuming operation.
The shuffle-server thread`s stack:

{noformat}
"shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 
nid=0x28d12 runnable [0x7f71af93e000]
   java.lang.Thread.State: RUNNABLE
at java.lang.Throwable.fillInStackTrace(Native Method)
at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
- locked <0x0007a930f080> (a java.util.NoSuchElementException)
at java.lang.Throwable.(Throwable.java:265)
at java.lang.Exception.(Exception.java:66)
at java.lang.RuntimeException.(RuntimeException.java:62)
at 
java.util.NoSuchElementException.(NoSuchElementException.java:57)
at 
org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38)
at 
org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31)
at 
org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50)
at 
org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
at 
org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
at 
org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 

[jira] [Updated] (SPARK-19991) FileSegmentManagedBuffer performance improvement.

2017-03-16 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-19991:

Description: 
When we do not set the value of the configuration items 
{{spark.storage.memoryMapThreshold}} and {{spark.shuffle.io.lazyFD}}, 
each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
FileSegmentManagedBuffer.createInputStream method creates a 
NoSuchElementException instance. This is a more time-consuming operation.
The shuffle-server thread`s stack:

{noformat}
"shuffle-server-2-42" #335 daemon prio=5 os_prio=0 tid=0x7f71e4507800 
nid=0x28d12 runnable [0x7f71af93e000]
   java.lang.Thread.State: RUNNABLE
at java.lang.Throwable.fillInStackTrace(Native Method)
at java.lang.Throwable.fillInStackTrace(Throwable.java:783)
- locked <0x0007a930f080> (a java.util.NoSuchElementException)
at java.lang.Throwable.(Throwable.java:265)
at java.lang.Exception.(Exception.java:66)
at java.lang.RuntimeException.(RuntimeException.java:62)
at 
java.util.NoSuchElementException.(NoSuchElementException.java:57)
at 
org.apache.spark.network.yarn.util.HadoopConfigProvider.get(HadoopConfigProvider.java:38)
at 
org.apache.spark.network.util.ConfigProvider.get(ConfigProvider.java:31)
at 
org.apache.spark.network.util.ConfigProvider.getBoolean(ConfigProvider.java:50)
at 
org.apache.spark.network.util.TransportConf.lazyFileDescriptor(TransportConf.java:157)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.convertToNetty(FileSegmentManagedBuffer.java:132)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:54)
at 
org.apache.spark.network.protocol.MessageEncoder.encode(MessageEncoder.java:33)
at 
org.spark_project.io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:88)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:820)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:728)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:806)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:818)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:799)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:835)
at 
org.spark_project.io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1017)
at 
org.spark_project.io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:256)
at 
org.apache.spark.network.server.TransportRequestHandler.respond(TransportRequestHandler.java:194)
at 
org.apache.spark.network.server.TransportRequestHandler.processFetchRequest(TransportRequestHandler.java:135)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:105)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
org.spark_project.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at 
org.spark_project.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at 
org.spark_project.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at 

[jira] [Updated] (SPARK-19736) refreshByPath should clear all cached plans with the specified path

2017-03-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19736:

Fix Version/s: 2.1.1

> refreshByPath should clear all cached plans with the specified path
> ---
>
> Key: SPARK-19736
> URL: https://issues.apache.org/jira/browse/SPARK-19736
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.1, 2.2.0
>
>
> Catalog.refreshByPath can refresh the cache entry and the associated metadata 
> for all dataframes (if any), that contain the given data source path. 
> However, CacheManager.invalidateCachedPath doesn't clear all cached plans 
> with the specified path. It causes some strange behaviors reported in 
> SPARK-15678.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19093) Cached tables are not used in SubqueryExpression

2017-03-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19093:

Fix Version/s: 2.1.1

> Cached tables are not used in SubqueryExpression
> 
>
> Key: SPARK-19093
> URL: https://issues.apache.org/jira/browse/SPARK-19093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Josh Rosen
>Assignee: Dilip Biswal
> Fix For: 2.1.1, 2.2.0
>
>
> See reproduction at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html
> Consider the following:
> {code}
> Seq(("a", "b"), ("c", "d"))
>   .toDS
>   .write
>   .parquet("/tmp/rows")
> val df = spark.read.parquet("/tmp/rows")
> df.cache()
> df.count()
> df.createOrReplaceTempView("rows")
> spark.sql("""
>   select * from rows cross join rows
> """).explain(true)
> spark.sql("""
>   select * from rows where not exists (select * from rows)
> """).explain(true)
> {code}
> In both plans, I'd expect that both sides of the joins would read from the 
> cached table for both the cross join and anti join, but the left anti join 
> produces the following plan which only reads the left side from cache and 
> reads the right side via a regular non-cahced scan:
> {code}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT exists#3994
>:  +- 'Project [*]
>: +- 'UnresolvedRelation `rows`
>+- 'UnresolvedRelation `rows`
> == Analyzed Logical Plan ==
> _1: string, _2: string
> Project [_1#3775, _2#3776]
> +- Filter NOT predicate-subquery#3994 []
>:  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>: +- Project [_1#3775, _2#3776]
>:+- SubqueryAlias rows
>:   +- Relation[_1#3775,_2#3776] parquet
>+- SubqueryAlias rows
>   +- Relation[_1#3775,_2#3776] parquet
> == Optimized Logical Plan ==
> Join LeftAnti
> :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>+- Relation[_1#3775,_2#3776] parquet
> == Physical Plan ==
> BroadcastNestedLoopJoin BuildRight, LeftAnti
> :- InMemoryTableScan [_1#3775, _2#3776]
> : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- BroadcastExchange IdentityBroadcastMode
>+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18549:

Fix Version/s: 2.1.1

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-03-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19765:

Fix Version/s: 2.1.1

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
> Fix For: 2.1.1, 2.2.0
>
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19975) Add map_keys and map_values functions to Python

2017-03-16 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929342#comment-15929342
 ] 

Yong Tang commented on SPARK-19975:
---

Created a PR for that:
https://github.com/apache/spark/pull/17328

Please take a look.

> Add map_keys and map_values functions  to Python 
> -
>
> Key: SPARK-19975
> URL: https://issues.apache.org/jira/browse/SPARK-19975
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Maciej Bryński
>
> We have `map_keys` and `map_values` functions in SQL.
> There is no Python equivalent functions for that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19990:
---
Description: 
This test seems to be failing consistently on all of the maven builds: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
 and is possibly caused by SPARK-19763.

Here's a stack trace for the failure: 

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: 
jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
  at org.apache.hadoop.fs.Path.initialize(Path.java:206)
  at org.apache.hadoop.fs.Path.(Path.java:172)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
  at 
org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
  at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
  at 
org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
  at 
org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
  at 
org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
  at 
org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
  at 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
  at 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
  

[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929334#comment-15929334
 ] 

Kay Ousterhout commented on SPARK-19988:


With some help from [~joshrosen] I spent some time digging into this and found:

(1) if you look at the failures, they're all from the maven build.  In fact, 
100% of the maven builds shown there fail (and none of the SBT ones).  This is 
weird because this is also failing on the PR builder, which uses SBT. 

(2) The maven build failures are all accompanied by 3 other tests; the group of 
4 tests seems to consistently fail together.  3 tests fail with errors similar 
to this one (saying that some database does not exist).  The 4th test, 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary 
view using, fails with a more real error.  I filed SPARK-19990 for that issue.

(3) A commit right around the time the tests started failing: 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9#diff-b7094baa12601424a5d19cb930e3402fR46
 added code to remove all of the databases after each test.  I wonder if that's 
somehow getting run concurrently or asynchronously in the maven build (after 
the HiveCataloguedDDLSuite fails), which is why the error in the DDLSuite 
causes the other tests to fail saying that a database can't be found.  I have 
extremely limited knowledge of both (a) how the maven tests are executed and 
(b) the SQL code so it's possible these are totally unrelated issues.

None of this explains why the test is failing in the PR builder, where the 
failures have been isolated to this test.

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19990:
--

 Summary: Flaky test: 
org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary 
view using
 Key: SPARK-19990
 URL: https://issues.apache.org/jira/browse/SPARK-19990
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Kay Ousterhout


This test seems to be failing consistently on all of the maven builds: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
 and is possibly caused by SPARK-19763.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19987) Pass all filters into FileIndex

2017-03-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19987.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Pass all filters into FileIndex
> ---
>
> Key: SPARK-19987
> URL: https://issues.apache.org/jira/browse/SPARK-19987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> This is a tiny teeny refactoring to pass data filters also to the FileIndex, 
> so FileIndex can have a more global view on predicates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929296#comment-15929296
 ] 

Wenchen Fan commented on SPARK-19982:
-

yea makes sense, the test harness should hold the {{this}} reference. I have no 
idea why this can go wrong, maybe we should just move the test class to 
top-level so it's not an inner class anymore.

> JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate 
> an encoder for inner class"
> -
>
> Key: SPARK-19982
> URL: https://issues.apache.org/jira/browse/SPARK-19982
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>  Labels: flaky-test
>
> JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:
> Unable to generate an encoder for inner class 
> `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
> the scope that this class was defined in. Try moving this class out of its 
> parent class.
> From https://spark-tests.appspot.com/test-logs/35475788
> [~vanzin] looked into this back in October and reported:
> I ran this test in a loop (both alone and with the rest of the spark-sql 
> tests) and never got a failure. I even used the same JDK as Jenkins 
> (1.7.0_51).
> Also looked at the code and nothing seems wrong. The errors is when an entry 
> with the parent class name is missing from the map kept in OuterScopes.scala, 
> but the test populates that map in its first line. So it doesn't look like a 
> race nor some issue with weak references (the map uses weak values).
>   public void testJavaBeanEncoder() {
> OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929275#comment-15929275
 ] 

Michael Armbrust commented on SPARK-19982:
--

I'm not sure if changing weak to strong references will change 
[anything|http://stackoverflow.com/questions/299659/what-is-the-difference-between-a-soft-reference-and-a-weak-reference-in-java].
  It seems like there must be another handle to {{this}} since the test harness 
is actively executing it.  So either way it shouldn't be available for garbage 
collection, or am I missing something?

> JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate 
> an encoder for inner class"
> -
>
> Key: SPARK-19982
> URL: https://issues.apache.org/jira/browse/SPARK-19982
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>  Labels: flaky-test
>
> JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:
> Unable to generate an encoder for inner class 
> `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
> the scope that this class was defined in. Try moving this class out of its 
> parent class.
> From https://spark-tests.appspot.com/test-logs/35475788
> [~vanzin] looked into this back in October and reported:
> I ran this test in a loop (both alone and with the rest of the spark-sql 
> tests) and never got a failure. I even used the same JDK as Jenkins 
> (1.7.0_51).
> Also looked at the code and nothing seems wrong. The errors is when an entry 
> with the parent class name is missing from the map kept in OuterScopes.scala, 
> but the test populates that map in its first line. So it doesn't look like a 
> race nor some issue with weak references (the map uses weak values).
>   public void testJavaBeanEncoder() {
> OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929272#comment-15929272
 ] 

Hyukjin Kwon commented on SPARK-18789:
--

Do you mind if I ask a simple code for this? Pseudocode is fine. (I am just 
trying to verify this)

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> 

[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929271#comment-15929271
 ] 

yuhao yang commented on SPARK-19969:


Looks like jira stops auto binding with PR.
https://github.com/apache/spark/pull/17324

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929259#comment-15929259
 ] 

Harish commented on SPARK-18789:


When you create the DF (dynamic) withough knowing the type of the column then 
you cant define the schema. In my case i am not  knowing the type of a column. 
When you dont define the column type and if the entire column in None then i am 
getting this error message. i hope i am clear.

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> 

[jira] [Updated] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19964:
---
Summary: Flaky test: SparkSubmitSuite fails due to Timeout  (was: 
SparkSubmitSuite fails due to Timeout)

> Flaky test: SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257
 ] 

Kay Ousterhout edited comment on SPARK-19964 at 3/17/17 12:54 AM:
--

[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars
 (I added flaky to the name which is I suspect the source of confusion)



was (Author: kayousterhout):
[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars


> Flaky test: SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout

2017-03-16 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929257#comment-15929257
 ] 

Kay Ousterhout commented on SPARK-19964:


[~srowen] it looks like this is failing periodically in master: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite_name=includes+jars+passed+in+through+--jars


> SparkSubmitSuite fails due to Timeout
> -
>
> Key: SPARK-19964
> URL: https://issues.apache.org/jira/browse/SPARK-19964
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Tests
>Affects Versions: 2.2.0
>Reporter: Eren Avsarogullari
>  Labels: flaky-test
> Attachments: SparkSubmitSuite_Stacktrace
>
>
> The following test case has been failed due to TestFailedDueToTimeoutException
> *Test Suite:* SparkSubmitSuite
> *Test Case:* includes jars passed in through --packages
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/
> *Stacktrace is also attached.*



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929250#comment-15929250
 ] 

Wenchen Fan commented on SPARK-19982:
-

I think this is caused by weak references, a GC may happen right after users 
call `OuterScopes.addOuterScope(this);` Shall we use soft reference? cc 
[~marmbrus]

> JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate 
> an encoder for inner class"
> -
>
> Key: SPARK-19982
> URL: https://issues.apache.org/jira/browse/SPARK-19982
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>  Labels: flaky-test
>
> JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:
> Unable to generate an encoder for inner class 
> `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
> the scope that this class was defined in. Try moving this class out of its 
> parent class.
> From https://spark-tests.appspot.com/test-logs/35475788
> [~vanzin] looked into this back in October and reported:
> I ran this test in a loop (both alone and with the rest of the spark-sql 
> tests) and never got a failure. I even used the same JDK as Jenkins 
> (1.7.0_51).
> Also looked at the code and nothing seems wrong. The errors is when an entry 
> with the parent class name is missing from the map kept in OuterScopes.scala, 
> but the test populates that map in its first line. So it doesn't look like a 
> race nor some issue with weak references (the map uses weak values).
>   public void testJavaBeanEncoder() {
> OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-03-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19635.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17110
[https://github.com/apache/spark/pull/17110]

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-19988:
---
Component/s: SQL

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-16 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19989:
--

 Summary: Flaky Test: 
org.apache.spark.sql.kafka010.KafkaSourceStressSuite
 Key: SPARK-19989
 URL: https://issues.apache.org/jira/browse/SPARK-19989
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.2.0
Reporter: Kay Ousterhout
Priority: Minor


This test failed recently here: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/

And based on Josh's dashboard 
(https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions),
 seems to fail a few times every month.  Here's the full error from the most 
recent failure:

Error Message

org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
factor: 1 larger than available brokers: 0 
kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
  scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
  
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
   == Progress ==AssertOnQuery(, )CheckAnswer: 
StopStream
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map()) 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9]StopStream
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map()) 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(9, 10, 11, 12, 13, 14), message = )CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]StopStream 
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(), message = ) => AddKafkaData(topics = Set(stress4, stress6, stress2, 
stress1, stress5, stress3), data = Range(15), message = Add topic stress7)
AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition)
AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(23, 24), message = Add partition)AddKafkaData(topics 
= Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(), message = Add topic stress9)AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(25, 26, 27, 
28, 29, 30, 31, 32, 33), message = )AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message 
= )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(), message = )AddKafkaData(topics = 
Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(34, 35, 36, 37, 38, 39), message = )AddKafkaData(topics = 
Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = 
Range(40, 41, 42, 43), message = )AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(44), 
message = Add partition)AddKafkaData(topics = Set(stress4, stress6, 
stress2, stress8, stress1, stress5, stress3), data = Range(45, 46, 47, 48, 49, 
50, 51, 52), message = Add partition)AddKafkaData(topics = Set(stress4, 
stress6, stress2, stress8, stress1, stress5, stress3), data = Range(53, 54, 
55), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, 
stress8, stress1, stress5, stress3), data = Range(56, 57, 58, 59, 60, 61, 62, 
63), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, 
stress2, stress8, stress1, stress5, stress3), data = Range(64, 65, 66, 67, 68, 
69, 70), message = )
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@65068637,Map()) 
   AddKafkaData(topics = Set(stress4, stress6, stress2, 

[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929107#comment-15929107
 ] 

Herman van Hovell commented on SPARK-19988:
---

It is probably some other test changing the current database to {{db2}}. This 
is super annoying to debug, and the only solution I see is that we fix the 
database names in the test.

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929084#comment-15929084
 ] 

Hyukjin Kwon commented on SPARK-18789:
--

It seems it goes failed in schema inference.

{code}
>>> data = [
... ["a", 1, None],
... ["b", 1, None],
... ["c", 1, None],
... ["d", 1, None],
... ]
>>> df = spark.createDataFrame(data)
Traceback (most recent call last):
  File "", line 1, in 
  File ".../spark/python/pyspark/sql/session.py", line 526, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File ".../spark/python/pyspark/sql/session.py", line 390, in _createFromLocal
struct = self._inferSchemaFromList(data)
  File ".../spark/python/pyspark/sql/session.py", line 324, in 
_inferSchemaFromList
raise ValueError("Some of types cannot be determined after inferring")
ValueError: Some of types cannot be determined after inferring
{code}

that's why I specified the schema. Did I maybe misunderstand your comment?

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> 

[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-19988:
-
Attachment: trimmed-unit-test.log

Attaching a trimmed version of the unit-test.log file, though nothing looks 
particularly notable in it to me.

also I tried running this test on my laptop in a loop, and it passed 100 times 
in a row.

> Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column 
> written by Hive
> 
>
> Key: SPARK-19988
> URL: https://issues.apache.org/jira/browse/SPARK-19988
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Imran Rashid
>  Labels: flaky-test
> Attachments: trimmed-unit-test.log
>
>
> "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
> Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 
> days here:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive
> eg. 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/
> {noformat}
> sbt.ForkMain$ForkError: 
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> SemanticException [Error 10072]: Database does not exist: db2
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
>   at 
> org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive

2017-03-16 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-19988:


 Summary: Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read 
char/varchar column written by Hive
 Key: SPARK-19988
 URL: https://issues.apache.org/jira/browse/SPARK-19988
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.2.0
Reporter: Imran Rashid


"OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by 
Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 days 
here:

https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive

eg. 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/

{noformat}
sbt.ForkMain$ForkError: org.apache.spark.sql.execution.QueryExecutionException: 
FAILED: SemanticException [Error 10072]: Database does not exist: db2
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611)
at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160)
at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
at 
org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19987) Pass all filters into FileIndex

2017-03-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-19987:

Description: 
This is a tiny teeny refactoring to pass data filters also to the FileIndex, so 
FileIndex can have a more global view on predicates.


  was:This is a tiny teeny refactoring to pass data filters also to the 
FileIndex.


> Pass all filters into FileIndex
> ---
>
> Key: SPARK-19987
> URL: https://issues.apache.org/jira/browse/SPARK-19987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is a tiny teeny refactoring to pass data filters also to the FileIndex, 
> so FileIndex can have a more global view on predicates.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-19982:
-
Labels: flaky-test  (was: )

> JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate 
> an encoder for inner class"
> -
>
> Key: SPARK-19982
> URL: https://issues.apache.org/jira/browse/SPARK-19982
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>  Labels: flaky-test
>
> JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:
> Unable to generate an encoder for inner class 
> `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
> the scope that this class was defined in. Try moving this class out of its 
> parent class.
> From https://spark-tests.appspot.com/test-logs/35475788
> [~vanzin] looked into this back in October and reported:
> I ran this test in a loop (both alone and with the rest of the spark-sql 
> tests) and never got a failure. I even used the same JDK as Jenkins 
> (1.7.0_51).
> Also looked at the code and nothing seems wrong. The errors is when an entry 
> with the parent class name is missing from the map kept in OuterScopes.scala, 
> but the test populates that map in its first line. So it doesn't look like a 
> race nor some issue with weak references (the map uses weak values).
>   public void testJavaBeanEncoder() {
> OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19987) Pass all filters into FileIndex

2017-03-16 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-19987:
---

 Summary: Pass all filters into FileIndex
 Key: SPARK-19987
 URL: https://issues.apache.org/jira/browse/SPARK-19987
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


This is a tiny teeny refactoring to pass data filters also to the FileIndex.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-03-16 Thread Drew Robb (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928928#comment-15928928
 ] 

Drew Robb commented on SPARK-12664:
---

This feature is also very important to me. I'm considering working on it myself 
and will post here if I begin that.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Yanbo Liang
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928926#comment-15928926
 ] 

Harish commented on SPARK-18789:


In your example you are defining the schema first and then loading the data. 
Which works. Try to create the DF without defining the schema (column type).

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)

[jira] [Updated] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-03-16 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-19985:
-
Description: 
Some ML Models fail when copied due to not having a default constructor and 
implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
the parent when the model is copied.  These models were missing the normal 
check that tests for these in the test suites.

Models with issues are:

* RFormlaModel
* MultilayerPerceptronClassificationModel
* BucketedRandomProjectionLSHModel
* MinHashLSH

  was:Some ML Models fail when copied due to not having a default constructor 
and implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly 
set the parent when the model is copied.  These models were missing the normal 
check that tests for these in the test suites.


> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.
> Models with issues are:
> * RFormlaModel
> * MultilayerPerceptronClassificationModel
> * BucketedRandomProjectionLSHModel
> * MinHashLSH



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19986) Make pyspark.streaming.tests.CheckpointTests more stable

2017-03-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19986:


 Summary: Make pyspark.streaming.tests.CheckpointTests more stable
 Key: SPARK-19986
 URL: https://issues.apache.org/jira/browse/SPARK-19986
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.1.0
Reporter: Shixiong Zhu


Sometimes, CheckpointTests will hang because the streaming jobs are too slow 
and cannot catch up.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-03-16 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928924#comment-15928924
 ] 

Bryan Cutler commented on SPARK-19985:
--

I'll fix this

> Some ML Models error when copy or do not set parent
> ---
>
> Key: SPARK-19985
> URL: https://issues.apache.org/jira/browse/SPARK-19985
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>
> Some ML Models fail when copied due to not having a default constructor and 
> implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
> the parent when the model is copied.  These models were missing the normal 
> check that tests for these in the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19985) Some ML Models error when copy or do not set parent

2017-03-16 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-19985:


 Summary: Some ML Models error when copy or do not set parent
 Key: SPARK-19985
 URL: https://issues.apache.org/jira/browse/SPARK-19985
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Bryan Cutler


Some ML Models fail when copied due to not having a default constructor and 
implementing {{copy}} with {{defaultCopy}}.  Other cases do not properly set 
the parent when the model is copied.  These models were missing the normal 
check that tests for these in the test suites.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-03-16 Thread Andrey Yakovenko (JIRA)
Andrey Yakovenko created SPARK-19984:


 Summary: ERROR codegen.CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java'
 Key: SPARK-19984
 URL: https://issues.apache.org/jira/browse/SPARK-19984
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Andrey Yakovenko


I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. This 
is not permanent error, next time i run it could disappear. Unfortunately i 
don't know how to reproduce the issue.  As you can see from the log my logic is 
pretty complicated.
Here is a part of log i've got (container_1489514660953_0015_01_01)
{code}
17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
151, Column 29: A method named "compare" is not declared in any enclosing class 
nor any supertype, nor through a static import
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private boolean agg_initAgg;
/* 009 */   private boolean agg_bufIsNull;
/* 010 */   private long agg_bufValue;
/* 011 */   private boolean agg_initAgg1;
/* 012 */   private boolean agg_bufIsNull1;
/* 013 */   private long agg_bufValue1;
/* 014 */   private scala.collection.Iterator smj_leftInput;
/* 015 */   private scala.collection.Iterator smj_rightInput;
/* 016 */   private InternalRow smj_leftRow;
/* 017 */   private InternalRow smj_rightRow;
/* 018 */   private UTF8String smj_value2;
/* 019 */   private java.util.ArrayList smj_matches;
/* 020 */   private UTF8String smj_value3;
/* 021 */   private UTF8String smj_value4;
/* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
smj_numOutputRows;
/* 023 */   private UnsafeRow smj_result;
/* 024 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
/* 025 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter smj_rowWriter;
/* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_numOutputRows;
/* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric agg_aggTime;
/* 028 */   private UnsafeRow agg_result;
/* 029 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 030 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter;
/* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_numOutputRows1;
/* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
agg_aggTime1;
/* 033 */   private UnsafeRow agg_result1;
/* 034 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
/* 035 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
agg_rowWriter1;
/* 036 */
/* 037 */   public GeneratedIterator(Object[] references) {
/* 038 */ this.references = references;
/* 039 */   }
/* 040 */
/* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 042 */ partitionIndex = index;
/* 043 */ this.inputs = inputs;
/* 044 */ wholestagecodegen_init_0();
/* 045 */ wholestagecodegen_init_1();
/* 046 */
/* 047 */   }
/* 048 */
/* 049 */   private void wholestagecodegen_init_0() {
/* 050 */ agg_initAgg = false;
/* 051 */
/* 052 */ agg_initAgg1 = false;
/* 053 */
/* 054 */ smj_leftInput = inputs[0];
/* 055 */ smj_rightInput = inputs[1];
/* 056 */
/* 057 */ smj_rightRow = null;
/* 058 */
/* 059 */ smj_matches = new java.util.ArrayList();
/* 060 */
/* 061 */ this.smj_numOutputRows = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[0];
/* 062 */ smj_result = new UnsafeRow(2);
/* 063 */ this.smj_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, 64);
/* 064 */ this.smj_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(smj_holder, 
2);
/* 065 */ this.agg_numOutputRows = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[1];
/* 066 */ this.agg_aggTime = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[2];
/* 067 */ agg_result = new UnsafeRow(1);
/* 068 */ this.agg_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(agg_result, 0);
/* 069 */ this.agg_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(agg_holder, 
1);
/* 070 */ this.agg_numOutputRows1 = 
(org.apache.spark.sql.execution.metric.SQLMetric) references[3];
/* 071 */ this.agg_aggTime1 = 

[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928854#comment-15928854
 ] 

Nick Pentreath commented on SPARK-19969:


Ok - I can help on it but probably only some time next week.

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-16 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928827#comment-15928827
 ] 

Maciej Szymkiewicz commented on SPARK-19899:


[~mlnick] For some reason SparkQA recognized the PR but it is not reflected on 
JIRA :/ So manually: https://github.com/apache/spark/pull/17321

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928802#comment-15928802
 ] 

Hyukjin Kwon commented on SPARK-19713:
--

 > please suggest what you think the title should be

Describe the problem you met in a one line. The title, {{saveAsTable}} is not 
helpful. Imagine you manage JIRAs and see a JIRA, for example, {{cache}}. 
Likewise, saveAsTable  what? the title is obviously not complete and does 
not describe the problem.

Check other JIRAs 
https://issues.apache.org/jira/browse/SPARK-19713?jql=text%20~%20%22saveAsTable%22


> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928780#comment-15928780
 ] 

Hyukjin Kwon commented on SPARK-18789:
--

Hm, do you mind if I ask a reproducer?

{code}
from pyspark.sql import Row
from pyspark.sql.types import *

data = [
["a", 1, None],
["b", 1, None],
["c", 1, None],
["d", 1, None],
]

schema = StructType([
   StructField("col1", StringType(), True),
   StructField("col2", IntegerType(), True),
   StructField("col3", StringType(), True)])

df = spark.createDataFrame(data, schema)
df.write.format("orc").save("hdfs://localhost:9000/tmp/squares", 
mode='overwrite')
spark.read.orc("hdfs://localhost:9000/tmp/squares").show()
{code}

produces

{code}
++++
|col1|col2|col3|
++++
|   a|   1|null|
|   b|   1|null|
|   c|   1|null|
|   d|   1|null|
++++
{code}

This seems working fine.

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> 

[jira] [Comment Edited] (SPARK-19713) saveAsTable

2017-03-16 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778
 ] 

Eric Maynard edited comment on SPARK-19713 at 3/16/17 8:08 PM:
---

Not really relevant here, but to address:

>1. Hive will not be able to create the table as the folder already exists
You absolutely can construct a Hive external table on top of an existing folder.

>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 


was (Author: emaynard1121):
Not really relevant here, but to address:
>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-16 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928778#comment-15928778
 ] 

Eric Maynard commented on SPARK-19713:
--

Not really relevant here, but to address:
>2. Hive cannot drop the table because the spark has not updated HiveMetaStore
The canonical solution to this is to run  `MSCK REPAIR TABLE myTable;` in Hive. 

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19721) Good error message for version mismatch in log files

2017-03-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19721:


Assignee: Liwei Lin

> Good error message for version mismatch in log files
> 
>
> Key: SPARK-19721
> URL: https://issues.apache.org/jira/browse/SPARK-19721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Liwei Lin
>Priority: Blocker
> Fix For: 2.2.0
>
>
> There are several places where we write out version identifiers in various 
> logs for structured streaming (usually {{v1}}).  However, in the places where 
> we check for this, we throw a confusing error message.  Instead, we should do 
> the following:
>  - Find all of the places where we do this kind of check.
>  - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was 
> produced by a newer version of Spark and cannot be read by this version.  
> Please upgrade"
>  - for anything else throw an error saying the file is malformed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19721) Good error message for version mismatch in log files

2017-03-16 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19721:
-
Fix Version/s: 2.2.0

> Good error message for version mismatch in log files
> 
>
> Key: SPARK-19721
> URL: https://issues.apache.org/jira/browse/SPARK-19721
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Priority: Blocker
> Fix For: 2.2.0
>
>
> There are several places where we write out version identifiers in various 
> logs for structured streaming (usually {{v1}}).  However, in the places where 
> we check for this, we throw a confusing error message.  Instead, we should do 
> the following:
>  - Find all of the places where we do this kind of check.
>  - for {{vN}} where {{n>1}} say "UnsupportedLogFormat: The file {{path}} was 
> produced by a newer version of Spark and cannot be read by this version.  
> Please upgrade"
>  - for anything else throw an error saying the file is malformed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-16 Thread Balaram R Gadiraju (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928737#comment-15928737
 ] 

Balaram R Gadiraju commented on SPARK-19713:


@Hyukjin Kwon : please suggest what you think the title should be

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928738#comment-15928738
 ] 

Hyukjin Kwon commented on SPARK-18789:
--

Doh, I am sorry. Let me try to test again and will open. I thought the script 
describes how to reproduce. Thanks for pointing this out. Let me try this in 
the current master soon.


> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> 

[jira] [Commented] (SPARK-19713) saveAsTable

2017-03-16 Thread Balaram R Gadiraju (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928734#comment-15928734
 ] 

Balaram R Gadiraju commented on SPARK-19713:


The issue is not only in spark, because when the folder is created and spark 
ends with error. we are not able to create or drop table even in hive as hive 
needs to create the folder in order to create the table. 

1. Hive will not be able to create the table as the folder already exists.
2. Hive cannot drop the table because the spark has not updated HiveMetaStore 
(there is not table in hive to drop)

This causes the folder to be locked until you run "hdfs dfs -rm -r 
/data/hive/databases/testdb.db/brokentable"

Does everyone think this is not a issue ?

> saveAsTable
> ---
>
> Key: SPARK-19713
> URL: https://issues.apache.org/jira/browse/SPARK-19713
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Balaram R Gadiraju
>
> Hi,
> I just observed that when we use dataframe.saveAsTable("table") -- In 
> oldversions
> and dataframe.write.saveAsTable("table") -- in the newer versions
> When using the method “df3.saveAsTable("brokentable")” in 
> scale code. This creates a folder in hdfs and doesn’t update hive-metastore 
> that it plans to create the table. So if anything goes wrong in between the 
> folder still exists and hive is not aware of the folder creation. This will 
> block the users from creating the table “brokentable” as the folder already 
> exists, we can remove the folder using “hadoop fs –rmr 
> /data/hive/databases/testdb.db/brokentable”.  So below is the workaround 
> which will enable to you to continue the development work.
> Current Code:
> val df3 = sqlContext.sql("select * fromtesttable")
> df3.saveAsTable("brokentable")
> THE WORKAROUND:
> By registering the DataFrame as table and then using sql command to load the 
> data will resolve the issue. EX:
> val df3 = sqlContext.sql("select * from testtable").registerTempTable("df3")
> sqlContext.sql("CREATE TABLE brokentable AS SELECT * FROM df3")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928660#comment-15928660
 ] 

Jose Soltren commented on SPARK-19982:
--

[~cloud_fan] added this test as some work related to SPARK-11954. Wenchen, do 
you have any thoughts as to why this might be failing intermittently?

Likely not, but I wonder if this has anything to do with outerScopes being a 
lazy val in object OuterScopes. Then, possibly, in very rare instances, 
Analyzer.scala:ResolveNewInstance could hit the (outer == null) branch and 
throw this exception.

We run all the Spark unit tests about a dozen times a night and this has failed 
on average twice a month since last May (which is as far back as my data goes).

> JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate 
> an encoder for inner class"
> -
>
> Key: SPARK-19982
> URL: https://issues.apache.org/jira/browse/SPARK-19982
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Jose Soltren
>
> JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:
> Unable to generate an encoder for inner class 
> `test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
> the scope that this class was defined in. Try moving this class out of its 
> parent class.
> From https://spark-tests.appspot.com/test-logs/35475788
> [~vanzin] looked into this back in October and reported:
> I ran this test in a loop (both alone and with the rest of the spark-sql 
> tests) and never got a failure. I even used the same JDK as Jenkins 
> (1.7.0_51).
> Also looked at the code and nothing seems wrong. The errors is when an entry 
> with the parent class name is missing from the map kept in OuterScopes.scala, 
> but the test populates that map in its first line. So it doesn't look like a 
> race nor some issue with weak references (the map uses weak values).
>   public void testJavaBeanEncoder() {
> OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-16 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928657#comment-15928657
 ] 

Shixiong Zhu commented on SPARK-19965:
--

[~lwlin] I think we can just ignore “_spark_metadata” in  InMemoryFileIndex. 
Could you try it?

> DataFrame batch reader may fail to infer partitions when reading 
> FileStreamSink's output
> 
>
> Key: SPARK-19965
> URL: https://issues.apache.org/jira/browse/SPARK-19965
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> Reproducer
> {code}
>   test("partitioned writing and batch reading with 'basePath'") {
> val inputData = MemoryStream[Int]
> val ds = inputData.toDS()
> val outputDir = Utils.createTempDir(namePrefix = 
> "stream.output").getCanonicalPath
> val checkpointDir = Utils.createTempDir(namePrefix = 
> "stream.checkpoint").getCanonicalPath
> var query: StreamingQuery = null
> try {
>   query =
> ds.map(i => (i, i * 1000))
>   .toDF("id", "value")
>   .writeStream
>   .partitionBy("id")
>   .option("checkpointLocation", checkpointDir)
>   .format("parquet")
>   .start(outputDir)
>   inputData.addData(1, 2, 3)
>   failAfter(streamingTimeout) {
> query.processAllAvailable()
>   }
>   spark.read.option("basePath", outputDir).parquet(outputDir + 
> "/*").show()
> } finally {
>   if (query != null) {
> query.stop()
>   }
> }
>   }
> {code}
> Stack trace
> {code}
> [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
> (3 seconds, 928 milliseconds)
> [info]   java.lang.AssertionError: assertion failed: Conflicting directory 
> structures detected. Suspicious paths:
> [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
> [info]
> ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19983) Getting ValidationFailureSemanticException on 'INSERT OVEWRITE'

2017-03-16 Thread Rajkumar (JIRA)
Rajkumar created SPARK-19983:


 Summary: Getting ValidationFailureSemanticException on 'INSERT 
OVEWRITE'
 Key: SPARK-19983
 URL: https://issues.apache.org/jira/browse/SPARK-19983
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Rajkumar
Priority: Blocker


Hi, I am creating a DataFrame and registering that DataFrame as temp table 
using df.createOrReplaceTempView('mytable'). After that I try to write the 
content from 'mytable' into Hive table(It has partition) using the following 
query

insert overwrite table
   myhivedb.myhivetable
partition(testdate) // ( 1) : Note here : I have a partition named 'testdate'
select
  Field1, 
  Field2,
  ...
  TestDate //(2) : Note here : I have a field named 'TestDate' ; Both (1) & (2) 
have the same name
from
  mytable

when I execute this query, I am getting the following error

Exception in thread "main" 
org.apache.hadoop.hive.ql.metadata.Table$ValidationFailureSemanticException: 
Partition spec {testdate=, TestDate=2013-01-01}

Looks like I am getting this error because of the same field names ; ie 
testdate(the partition in Hive) & TestDate (The field in temp table 'mytable')

Whereas if my fieldname 'TestDate' is different, the query executes 
successuflly. Example...

insert overwrite table
   myhivedb.myhivetable
partition(testdate) 
select
  Field1, 
  Field2,
  ...
  myDate //Note here : The field name is 'myDate' & not 'TestDate'
from
  mytable





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 6:39 PM:
--

I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mentioned by Shea Parkes 
and I am trying to get it working by changing rdd.py, but it still does not 
provide the positive outcome. Could please write more details on the change 
that has to be done in the proposed bandaid of exhausting the iterator at the 
end of takeUpToNumLeft() ?
{code} def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 6:40 PM:
--

I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mentioned by Shea Parkes 
and I am trying to get it working by changing rdd.py, but it still does not 
provide the positive outcome. Could you please write more details on the change 
that has to be done in the proposed bandaid of exhausting the iterator at the 
end of takeUpToNumLeft() ?
{code} def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mentioned by Shea Parkes 
and I am trying to get it working by changing rdd.py, but it still does not 
provide the positive outcome. Could please write more details on the change 
that has to be done in the proposed bandaid of exhausting the iterator at the 
end of takeUpToNumLeft() ?
{code} def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog

2017-03-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19945:
---

Assignee: Song Jun

> Add test suite for SessionCatalog with HiveExternalCatalog
> --
>
> Key: SPARK-19945
> URL: https://issues.apache.org/jira/browse/SPARK-19945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
> for HiveExternalCatalog.
> And there are some ddl function is not proper to test in 
> ExternalCatalogSuite, because some logic are not full implement in 
> ExternalCatalog, these ddl functions are full implement in SessionCatalog, it 
> is better to test it in SessionCatalogSuite
> So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog

2017-03-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19945.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add test suite for SessionCatalog with HiveExternalCatalog
> --
>
> Key: SPARK-19945
> URL: https://issues.apache.org/jira/browse/SPARK-19945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
> for HiveExternalCatalog.
> And there are some ddl function is not proper to test in 
> ExternalCatalogSuite, because some logic are not full implement in 
> ExternalCatalog, these ddl functions are full implement in SessionCatalog, it 
> is better to test it in SessionCatalogSuite
> So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19982) JavaDatasetSuite.testJavaBeanEncoder sometimes fails with "Unable to generate an encoder for inner class"

2017-03-16 Thread Jose Soltren (JIRA)
Jose Soltren created SPARK-19982:


 Summary: JavaDatasetSuite.testJavaBeanEncoder sometimes fails with 
"Unable to generate an encoder for inner class"
 Key: SPARK-19982
 URL: https://issues.apache.org/jira/browse/SPARK-19982
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.1.0
Reporter: Jose Soltren


JavaDatasetSuite.testJavaBeanEncoder fails sporadically with the error below:

Unable to generate an encoder for inner class 
`test.org.apache.spark.sql.JavaDatasetSuite$SimpleJavaBean` without access to 
the scope that this class was defined in. Try moving this class out of its 
parent class.

>From https://spark-tests.appspot.com/test-logs/35475788

[~vanzin] looked into this back in October and reported:

I ran this test in a loop (both alone and with the rest of the spark-sql tests) 
and never got a failure. I even used the same JDK as Jenkins (1.7.0_51).
Also looked at the code and nothing seems wrong. The errors is when an entry 
with the parent class name is missing from the map kept in OuterScopes.scala, 
but the test populates that map in its first line. So it doesn't look like a 
race nor some issue with weak references (the map uses weak values).

  public void testJavaBeanEncoder() {
OuterScopes.addOuterScope(this);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928549#comment-15928549
 ] 

yuhao yang edited comment on SPARK-19969 at 3/16/17 6:10 PM:
-

Not really. But I can start on it now if needed.


was (Author: yuhaoyan):
Not really. But I can start on it now.

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928549#comment-15928549
 ] 

yuhao yang commented on SPARK-19969:


Not really. But I can start on it now.

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928545#comment-15928545
 ] 

Nick Pentreath commented on SPARK-19979:


I wonder if this fits in as a sort of sub-task of SPARK-19071?

cc [~bryanc] as it relates to your work on SPARK-19357.

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> ---
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928537#comment-15928537
 ] 

Nick Pentreath commented on SPARK-19969:


No haven't done the doc or examples - I seem to recall you had already done 
some work on that?

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:40 PM:
--

I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} 
def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:40 PM:
--

I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} 
def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} 
def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Tomas Pranckevicius (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927746#comment-15927746
 ] 

Tomas Pranckevicius edited comment on SPARK-12261 at 3/16/17 5:39 PM:
--

I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
{code} 
def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1
{code}


was (Author: tomas pranckevicius):
I am looking as well to the solution of this pyspark crash for the large data 
set issue on windows. I have read several posts and spent few days on this 
problem. I am happy to see that there is a solution mention by Shea Parkes and 
I am trying to get it working by changing rdd.py, but it still does not provide 
the positive outcome. Could please write more details on the change that has to 
be done in the proposed bandaid of exhausting the iterator at the end of 
takeUpToNumLeft() by changing rdd.py file?
 def takeUpToNumLeft():
iterator = iter(iterator)
taken = 0
while taken < left:
yield next(iterator)
taken += 1

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-03-16 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19329:

Fix Version/s: 2.1.1

> after alter a datasource table's location to a not exist location and then 
> insert data throw Exception
> --
>
> Key: SPARK-19329
> URL: https://issues.apache.org/jira/browse/SPARK-19329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.1.1, 2.2.0
>
>
> spark.sql("create table t(a string, b int) using parquet")
> spark.sql(s"alter table t set location '$notexistedlocation'")
> spark.sql("insert into table t select 'c', 1")
> this will throw an exception:
> com.google.common.util.concurrent.UncheckedExecutionException: 
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> $notexistedlocation;
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
>   at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19969) Doc and examples for Imputer

2017-03-16 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928443#comment-15928443
 ] 

yuhao yang commented on SPARK-19969:


Thanks for the great help with Imputer, [~mlnick] Have you started on this?

> Doc and examples for Imputer
> 
>
> Key: SPARK-19969
> URL: https://issues.apache.org/jira/browse/SPARK-19969
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14438) Cross-publish Breeze for Scala 2.12

2017-03-16 Thread Kirill chebba Chebunin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928356#comment-15928356
 ] 

Kirill chebba Chebunin commented on SPARK-14438:


2.12 support was added for version 0.13 in 
https://github.com/scalanlp/breeze/issues/604

> Cross-publish Breeze for Scala 2.12
> ---
>
> Key: SPARK-14438
> URL: https://issues.apache.org/jira/browse/SPARK-14438
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>
> Spark relies on Breeze (https://github.com/scalanlp/breeze), so we'll need to 
> cross-publish that for Scala 2.12.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19981) Sort-Merge join inserts shuffles when joining dataframes with aliased columns

2017-03-16 Thread Allen George (JIRA)
Allen George created SPARK-19981:


 Summary: Sort-Merge join inserts shuffles when joining dataframes 
with aliased columns
 Key: SPARK-19981
 URL: https://issues.apache.org/jira/browse/SPARK-19981
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Allen George


Performing a sort-merge join with two dataframes - each of which has the join 
column aliased - causes Spark to insert an unnecessary shuffle.

Consider the scala test code below, which should be equivalent to the following 
SQL.

{code:SQL}
SELECT * FROM
  (SELECT number AS aliased from df1) t1
LEFT JOIN
  (SELECT number AS aliased from df2) t2
ON t1.aliased = t2.aliased
{code}

{code:scala}
private case class OneItem(number: Long)
private case class TwoItem(number: Long, value: String)
test("join with aliases should not trigger shuffle") {
  val df1 = sqlContext.createDataFrame(
Seq(
  OneItem(0),
  OneItem(2),
  OneItem(4)
)
  )
  val partitionedDf1 = df1.repartition(10, col("number"))
  partitionedDf1.createOrReplaceTempView("df1")
  partitionedDf1.cache() partitionedDf1.count()
  
  val df2 = sqlContext.createDataFrame(
Seq(
  TwoItem(0, "zero"),
  TwoItem(2, "two"),
  TwoItem(4, "four")
)
  )
  val partitionedDf2 = df2.repartition(10, col("number"))
  partitionedDf2.createOrReplaceTempView("df2")
  partitionedDf2.cache() partitionedDf2.count()
  
  val fromDf1 = sqlContext.sql("SELECT number from df1")
  val fromDf2 = sqlContext.sql("SELECT number from df2")

  val aliasedDf1 = fromDf1.select(col(fromDf1.columns.head) as "aliased")
  val aliasedDf2 = fromDf2.select(col(fromDf2.columns.head) as "aliased")

  aliasedDf1.join(aliasedDf2, Seq("aliased"), "left_outer") }
{code}

Both the SQL and the Scala code generate a query-plan where an extra exchange 
is inserted before performing the sort-merge join. This exchange changes the 
partitioning from {{HashPartitioning("number", 10)}} for each frame being 
joined into {{HashPartitioning("aliased", 5)}}. I would have expected that 
since it's a simple column aliasing, and both frames have exactly the same 
partitioning that the initial frames.

{noformat} 
*Project [args=[aliased#267L]][outPart=PartitioningCollection(5, 
hashpartitioning(aliased#267L, 5)%NONNULL,hashpartitioning(aliased#270L, 
5)%NONNULL)][outOrder=List(aliased#267L 
ASC%NONNULL)][output=List(aliased#267:bigint%NONNULL)]
+- *SortMergeJoin [args=[aliased#267L], [aliased#270L], 
Inner][outPart=PartitioningCollection(5, hashpartitioning(aliased#267L, 
5)%NONNULL,hashpartitioning(aliased#270L, 
5)%NONNULL)][outOrder=List(aliased#267L 
ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL, 
aliased#270:bigint%NONNULL)]
   :- *Sort [args=[aliased#267L ASC], false, 0][outPart=HashPartitioning(5, 
aliased#267:bigint%NONNULL)][outOrder=List(aliased#267L 
ASC%NONNULL)][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
   :  +- Exchange [args=hashpartitioning(aliased#267L, 
5)%NONNULL][outPart=HashPartitioning(5, 
aliased#267:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
   : +- *Project [args=[number#198L AS 
aliased#267L]][outPart=HashPartitioning(10, 
number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#267:bigint%NONNULL)]
   :+- InMemoryTableScan 
[args=[number#198L]][outPart=HashPartitioning(10, 
number#198:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#198:bigint%NONNULL)]
   :   :  +- InMemoryRelation [number#198L], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), 
false[Statistics(24,false)][output=List(number#198:bigint%NONNULL)]
   :   : :  +- Exchange [args=hashpartitioning(number#198L, 
10)%NONNULL][outPart=HashPartitioning(10, 
number#198:bigint%NONNULL)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
   :   : : +- LocalTableScan 
[args=[number#198L]][outPart=UnknownPartitioning(0)][outOrder=List()][output=List(number#198:bigint%NONNULL)]
   +- *Sort [args=[aliased#270L ASC], false, 0][outPart=HashPartitioning(5, 
aliased#270:bigint%NONNULL)][outOrder=List(aliased#270L 
ASC%NONNULL)][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
  +- Exchange [args=hashpartitioning(aliased#270L, 
5)%NONNULL][outPart=HashPartitioning(5, 
aliased#270:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
 +- *Project [args=[number#223L AS 
aliased#270L]][outPart=HashPartitioning(10, 
number#223:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(aliased#270:bigint%NONNULL)]
+- InMemoryTableScan 
[args=[number#223L]][outPart=HashPartitioning(10, 
number#223:bigint%NONNULL)][outOrder=List()][output=ArrayBuffer(number#223:bigint%NONNULL)]
   :  +- InMemoryRelation [number#223L, value#224], true, 1, 
StorageLevel(disk, memory, 

[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.

2017-03-16 Thread Michel Lemay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-19980:
-
Description: 
Applying an identity map transformation on a statically typed Dataset with a 
POJO produces an unexpected result.

Given POJOs:
{code}
public class Stuff implements Serializable {
private String name;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
}

public class Outer implements Serializable {
private String name;
private Stuff stuff;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
public void setStuff(Stuff stuff) { this.stuff = stuff; }
public Stuff getStuff() { return stuff; }
}
{code}

Produces the result:

{code}
scala> val encoder = Encoders.bean(classOf[Outer])
encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema = encoder.schema
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
struct]

scala> df.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+

scala> df.map(x => x)(encoder).show()
++--+
|name| stuff|
++--+
|  v1|[null]|
++--+
{code}

After identity transformation, `stuff` becomes an object with null values 
inside it instead of staying null itself.

Doing the same with case classes preserves the nulls:
{code}
scala> case class ScalaStuff(name: String)
defined class ScalaStuff

scala> case class ScalaOuter(name: String, stuff: ScalaStuff)
defined class ScalaOuter

scala> val encoder2 = Encoders.product[ScalaOuter]
encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema2 = encoder2.schema
schema2: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema2.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala>

scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter]
df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: 
struct]

scala> df2.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+


scala> df2.map(x => x).show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+
{code}

stuff.json:
{code}
{"name":"v1", "stuff":null }
{code}


  was:
Applying an identity map transformation on a statically typed Dataset with a 
POJO produces an unexpected result.

Given POJOs:
{code}
public class Stuff implements Serializable {
private String name;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
}

public class Outer implements Serializable {
private String name;
private Stuff stuff;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
public void setStuff(Stuff stuff) { this.stuff = stuff; }
public Stuff getStuff() { return stuff; }
}
{code}

And test code:
{code}
val encoder = Encoders.bean(classOf[Outer])
val schema = encoder.schema
schema.printTreeString

val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df.show()
df.map(x => x)(encoder).show()
{code}

Produces the result:

{code}
scala> val encoder = Encoders.bean(classOf[Outer])
encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema = encoder.schema
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
struct]

scala> df.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+

scala> df.map(x => x)(encoder).show()
++--+
|name| stuff|
++--+
|  v1|[null]|
++--+
{code}

After identity transformation, `stuff` becomes an object with null values 
inside it instead of staying null itself.

Doing the same with case classes preserves the nulls:
{code}
scala> case class ScalaStuff(name: String)
defined class ScalaStuff


[jira] [Updated] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.

2017-03-16 Thread Michel Lemay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay updated SPARK-19980:
-
Description: 
Applying an identity map transformation on a statically typed Dataset with a 
POJO produces an unexpected result.

Given POJOs:
{code}
public class Stuff implements Serializable {
private String name;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
}

public class Outer implements Serializable {
private String name;
private Stuff stuff;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
public void setStuff(Stuff stuff) { this.stuff = stuff; }
public Stuff getStuff() { return stuff; }
}
{code}

And test code:
{code}
val encoder = Encoders.bean(classOf[Outer])
val schema = encoder.schema
schema.printTreeString

val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df.show()
df.map(x => x)(encoder).show()
{code}

Produces the result:

{code}
scala> val encoder = Encoders.bean(classOf[Outer])
encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema = encoder.schema
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
struct]

scala> df.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+

scala> df.map(x => x)(encoder).show()
++--+
|name| stuff|
++--+
|  v1|[null]|
++--+
{code}

After identity transformation, `stuff` becomes an object with null values 
inside it instead of staying null itself.

Doing the same with case classes preserves the nulls:
{code}
scala> case class ScalaStuff(name: String)
defined class ScalaStuff

scala> case class ScalaOuter(name: String, stuff: ScalaStuff)
defined class ScalaOuter

scala> val encoder2 = Encoders.product[ScalaOuter]
encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema2 = encoder2.schema
schema2: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema2.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala>

scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter]
df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: 
struct]

scala> df2.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+


scala> df2.map(x => x).show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+
{code}

stuff.json:
{code}
{"name":"v1", "stuff":null }
{code}


  was:
Applying an identity map transformation on a statically typed Dataset with a 
POJO produces an unexpected result.

Given POJOs:
{code}
public class Stuff implements Serializable {
private String name;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
}

public class Outer implements Serializable {
private String name;
private Stuff stuff;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
public void setStuff(Stuff stuff) { this.stuff = stuff; }
public Stuff getStuff() { return stuff; }
}
{code}

And test code:
{code}
val encoder = Encoders.bean(classOf[Outer])
val schema = encoder.schema
schema.printTreeString

val df = spark.read.schema(schema).json("d:\\stuff.json").as[Outer](encoder)
df.show()
df.map(x => x)(encoder).show()
{code}

Produces the result:

{code}
scala> val encoder = Encoders.bean(classOf[Outer])
encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema = encoder.schema
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
struct]

scala> df.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+

scala> df.map(x => x)(encoder).show()
++--+
|name| stuff|
++--+
|  v1|[null]|
++--+
{code}


[jira] [Created] (SPARK-19980) Basic Dataset transformation on POJOs does not preserves nulls.

2017-03-16 Thread Michel Lemay (JIRA)
Michel Lemay created SPARK-19980:


 Summary: Basic Dataset transformation on POJOs does not preserves 
nulls.
 Key: SPARK-19980
 URL: https://issues.apache.org/jira/browse/SPARK-19980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Michel Lemay


Applying an identity map transformation on a statically typed Dataset with a 
POJO produces an unexpected result.

Given POJOs:
{code}
public class Stuff implements Serializable {
private String name;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
}

public class Outer implements Serializable {
private String name;
private Stuff stuff;
public void setName(String name) { this.name = name; }
public String getName() { return name; }
public void setStuff(Stuff stuff) { this.stuff = stuff; }
public Stuff getStuff() { return stuff; }
}
{code}

And test code:
{code}
val encoder = Encoders.bean(classOf[Outer])
val schema = encoder.schema
schema.printTreeString

val df = spark.read.schema(schema).json("d:\\stuff.json").as[Outer](encoder)
df.show()
df.map(x => x)(encoder).show()
{code}

Produces the result:

{code}
scala> val encoder = Encoders.bean(classOf[Outer])
encoder: org.apache.spark.sql.Encoder[pojos.Outer] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema = encoder.schema
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala> val df = spark.read.schema(schema).json("stuff.json").as[Outer](encoder)
df: org.apache.spark.sql.Dataset[pojos.Outer] = [name: string, stuff: 
struct]

scala> df.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+

scala> df.map(x => x)(encoder).show()
++--+
|name| stuff|
++--+
|  v1|[null]|
++--+
{code}

After identity transformation, `stuff` becomes an object with null values 
inside it instead of staying null itself.

Doing the same with case classes preserves the nulls:
{code}
scala> case class ScalaStuff(name: String)
defined class ScalaStuff

scala> case class ScalaOuter(name: String, stuff: ScalaStuff)
defined class ScalaOuter

scala> val encoder2 = Encoders.product[ScalaOuter]
encoder2: org.apache.spark.sql.Encoder[ScalaOuter] = class[name[0]: string, 
stuff[0]: struct]

scala> val schema2 = encoder2.schema
schema2: org.apache.spark.sql.types.StructType = 
StructType(StructField(name,StringType,true), 
StructField(stuff,StructType(StructField(name,StringType,true)),true))

scala> schema2.printTreeString
root
 |-- name: string (nullable = true)
 |-- stuff: struct (nullable = true)
 ||-- name: string (nullable = true)


scala>

scala> val df2 = spark.read.schema(schema2).json("stuff.json").as[ScalaOuter]
df2: org.apache.spark.sql.Dataset[ScalaOuter] = [name: string, stuff: 
struct]

scala> df2.show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+


scala> df2.map(x => x).show()
++-+
|name|stuff|
++-+
|  v1| null|
++-+
{code}

stuff.json:
{code}
{"name":"v1", "stuff":null }
{code}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18789) Save Data frame with Null column-- exception

2017-03-16 Thread Eugen Prokhorenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928213#comment-15928213
 ] 

Eugen Prokhorenko commented on SPARK-18789:
---

Just wanted to mention that initial problem involves saving null values (the 
python script above doesn't have null columns in the df).

> Save Data frame with Null column-- exception
> 
>
> Key: SPARK-18789
> URL: https://issues.apache.org/jira/browse/SPARK-18789
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Harish
>
> I am trying to save a DF to HDFS which is having 1 column is NULL(no data).
> col1 col2 col3
> a   1 null
> b   1 null
> c1null
> d   1 null
> code :  df.write.format("orc").save(path, mode='overwrite')
> Error:
>   java.lang.IllegalArgumentException: Error: type expected at the position 49 
> of 'string:string:string:double:string:double:string:null' but 'null' is 
> found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> 16/12/08 19:41:49 ERROR TaskSetManager: Task 17 in stage 512.0 failed 4 
> times; aborting job
> 16/12/08 19:41:49 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 17 in 
> stage 512.0 failed 4 times, most recent failure: Lost task 17.3 in stage 
> 512.0 (TID 37290, 10.63.136.108): java.lang.IllegalArgumentException: Error: 
> type expected at the position 49 of 
> 'string:string:string:double:string:double:string:null' but 'null' is found.
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:348)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:392)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:765)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde.initialize(OrcSerde.java:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcSerializer.(OrcFileFormat.scala:182)
>   at 
> org.apache.spark.sql.hive.orc.OrcOutputWriter.(OrcFileFormat.scala:225)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:94)
>   at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
>   at 
> 

[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-16 Thread Ray Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196
 ] 

Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:45 PM:
--

One thing to add is that the same application will not have this issue when 
there is only one stream, even this one stream has a much higher load (10x).  
The issue seems to have something to do with multiple streams.


was (Author: rayqiu):
One thing to add is that the same application will not have this issue when 
there is only one stream, even this one stream has a much higher load.  The 
issue seems to have something to do with multiple streams.

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15040) PySpark impl for ml.feature.Imputer

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928180#comment-15928180
 ] 

Nick Pentreath commented on SPARK-15040:


Sorry, I did not see your comment - I opened a 
[PR|https://github.com/apache/spark/pull/17316] already.

> PySpark impl for ml.feature.Imputer
> ---
>
> Key: SPARK-15040
> URL: https://issues.apache.org/jira/browse/SPARK-15040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> PySpark impl for ml.feature.Imputer.
> This need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-16 Thread Ray Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196
 ] 

Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:45 PM:
--

One thing to add is that the same application will not have this issue when 
there is only one stream, even this one stream has a much higher load.  The 
issue seems to have something to do with multiple streams.


was (Author: rayqiu):
One thing to add is that the same application will not have this issue when 
there is only one stream, even this one stream has a much high load.  The issue 
seems to have something to do with multiple streams.

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-16 Thread Ray Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928196#comment-15928196
 ] 

Ray Qiu commented on SPARK-19977:
-

One thing to add is that the same application will not have this issue when 
there is only one stream, even this one stream has a much high load.  The issue 
seems to have something to do with multiple streams.

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928193#comment-15928193
 ] 

Nick Pentreath commented on SPARK-19899:


+1 on {{itemsCol}} - feel free to send a PR :)

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-03-16 Thread David Leifker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928195#comment-15928195
 ] 

David Leifker commented on SPARK-19979:
---

My apologies, I got a little ahead of this with a proposed PR here 
https://github.com/apache/spark/pull/17306

> [MLLIB] Multiple Estimators/Pipelines In CrossValidator
> ---
>
> Key: SPARK-19979
> URL: https://issues.apache.org/jira/browse/SPARK-19979
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: David Leifker
>
> Update CrossValidator and TrainValidationSplit to be able to accept multiple 
> pipelines and grid parameters for testing different algorithms and/or being 
> able to better control tuning combinations. Maintains backwards compatible 
> API and reads legacy serialized objects.
> The same could be done using an external iterative approach. Build different 
> pipelines, throwing each into a CrossValidator, and then taking the best 
> model from each of those CrossValidators. Then finally picking the best from 
> those. This is the initial approach I explored. It resulted in a lot of 
> boiler plate code that felt like it shouldn't need to exist if the api simply 
> allowed for arrays of estimators and their parameters.
> A couple advantages to this implementation to consider come from keeping the 
> functional interface to the CrossValidator.
> 1. The caching of the folds is better utilized. An external iterative 
> approach creates a new set of k folds for each CrossValidator fit and the 
> folds are discarded after each CrossValidator run. In this implementation a 
> single set of k folds is created and cached for all of the pipelines.
> 2. A potential advantage of using this implementation is for future 
> parallelization of the pipelines within the CrossValdiator. It is of course 
> possible to handle the parallelization outside of the CrossValidator here 
> too, however I believe there is already work in progress to parallelize the 
> grid parameters and that could be extended to multiple pipelines.
> Both of those behind-the-scene optimizations are possible because of 
> providing the CrossValidator with the data and the complete set of 
> pipelines/estimators to evaluate up front allowing one to abstract away the 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator

2017-03-16 Thread David Leifker (JIRA)
David Leifker created SPARK-19979:
-

 Summary: [MLLIB] Multiple Estimators/Pipelines In CrossValidator
 Key: SPARK-19979
 URL: https://issues.apache.org/jira/browse/SPARK-19979
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.1.0
Reporter: David Leifker


Update CrossValidator and TrainValidationSplit to be able to accept multiple 
pipelines and grid parameters for testing different algorithms and/or being 
able to better control tuning combinations. Maintains backwards compatible API 
and reads legacy serialized objects.

The same could be done using an external iterative approach. Build different 
pipelines, throwing each into a CrossValidator, and then taking the best model 
from each of those CrossValidators. Then finally picking the best from those. 
This is the initial approach I explored. It resulted in a lot of boiler plate 
code that felt like it shouldn't need to exist if the api simply allowed for 
arrays of estimators and their parameters.

A couple advantages to this implementation to consider come from keeping the 
functional interface to the CrossValidator.

1. The caching of the folds is better utilized. An external iterative approach 
creates a new set of k folds for each CrossValidator fit and the folds are 
discarded after each CrossValidator run. In this implementation a single set of 
k folds is created and cached for all of the pipelines.

2. A potential advantage of using this implementation is for future 
parallelization of the pipelines within the CrossValdiator. It is of course 
possible to handle the parallelization outside of the CrossValidator here too, 
however I believe there is already work in progress to parallelize the grid 
parameters and that could be extended to multiple pipelines.

Both of those behind-the-scene optimizations are possible because of providing 
the CrossValidator with the data and the complete set of pipelines/estimators 
to evaluate up front allowing one to abstract away the implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179
 ] 

Shea Parkes edited comment on SPARK-12261 at 3/16/17 2:38 PM:
--

I simply added the following to the end:

{code}
for _ in iterator:
pass
{code}

This will run through the rest of iterator (until the StopIteration exception 
like normal).

Depending how you're making pyspark importable, you might need to make this 
change inside a zipped copy of pyspark as well (e.g. in the binary 
distributions downloadable from Spark's home page).


was (Author: shea.parkes):
I simply added the following to the end:

{code}
for _ in iterator:
pass
{code}

This will run through the rest of iterator (until the StopIteration exception 
like normal).

Depending how you're making pyspark importable, you might need to make this 
change inside a zipped copy of pyspark as well (e.g. in the binary 
distributions available).

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179
 ] 

Shea Parkes edited comment on SPARK-12261 at 3/16/17 2:38 PM:
--

I simply added the following to the end:

{code}
for _ in iterator:
pass
{code}

This will run through the rest of iterator (until the StopIteration exception 
like normal).

Depending how you're making pyspark importable, you might need to make this 
change inside a zipped copy of pyspark as well (e.g. in the binary 
distributions available).


was (Author: shea.parkes):
I simply added the following to the end:

for _ in iterator:
pass

This will run through the rest of iterator (until the StopIteration exception 
like normal).

Depending how you're making pyspark importable, you might need to make this 
change inside a zipped copy of pyspark as well (e.g. in the binary 
distributions available).

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

2017-03-16 Thread Shea Parkes (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928179#comment-15928179
 ] 

Shea Parkes commented on SPARK-12261:
-

I simply added the following to the end:

for _ in iterator:
pass

This will run through the rest of iterator (until the StopIteration exception 
like normal).

Depending how you're making pyspark importable, you might need to make this 
change inside a zipped copy of pyspark as well (e.g. in the binary 
distributions available).

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10764) Add optional caching to Pipelines

2017-03-16 Thread Sachin Tyagi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928169#comment-15928169
 ] 

Sachin Tyagi commented on SPARK-10764:
--

Hi, I want to take a stab at it. Here's how I am trying to approach it.

A PipelineStage can be marked to persist its output DataFrame by calling a 
persist(storageLevel, columnExprs). This will result in two cases:
* For Transformers -- their output DF should be marked to persist.
* For Estimators -- the output of their models should be marked to persist.

For example,
{code:title=Example.scala|borderStyle=solid}
val tokenizer = ...
// CountVectorizer estimator's Model should persist its output DF (and only 
those columns passed in args to persist) so that the LDA iterations can run on 
the smaller persisted dataframe.
val countVectorizer = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(1000)
  .persist(StorageLevel.MEMORY_AND_DISK, "doc_id", "features")
val lda = ...
val pipelineModel =  new Pipeline().setStages(Array(tokenizer, countVectorizer, 
lda))
{code}

Also, there should be a way use the fitted pipeline to transform an already 
persisted dataframe if that dataframe was persisted as part of some stage 
during fit(). Else we end up doing unnecessary work in some cases (tokenizing 
and countvectorizing the input dataframe again to get topic distributions, in 
above example). 

Instead in such a case, only the necessary stages should be invoked to 
transform.

{code:title=Continue.scala|borderStyle=solid}
// The pipeline model should be able to identify whether the passed DF was 
persisted as part of some stage and then run only needed stages. In this case, 
the model should run only the LDA stage.
pipelineModel.transform(countVectorizer.getCacheDF())

// This should run all stages
pipelineModel.tranform(unpersistedDF)
{code}

In my mind, this can be achieved by modifying the PipelineStage, Pipeline and 
PipelineModel classes. Specifically, their transform and transformSchema 
methods. And obviously, by creating the appropriate persist() method(s) on 
PipelineStage.

Please let me know your comments on this approach. Specifically, if you see any 
issues or things that need to be taken care of. I can submitted a PR soon to 
see how it looks.

> Add optional caching to Pipelines
> -
>
> Key: SPARK-10764
> URL: https://issues.apache.org/jira/browse/SPARK-10764
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> We need to explore how to cache DataFrames during the execution of Pipelines. 
>  It's a hard problem in general to handle automatically or manually, so we 
> should start with some design discussions about:
> * How to control it manually
> * Whether & how to handle it automatically
> * API changes needed for each



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-16 Thread Ray Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928164#comment-15928164
 ] 

Ray Qiu commented on SPARK-19977:
-

Not really.  Many of the batches are empty RDDs, and the scheduler delay still 
in the range of seconds.  This only happen after a few hours of running the 
application.  Everything works initially.

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19977) Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 5 ms to 30 seconds in Spark Streaming application

2017-03-16 Thread Ray Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928164#comment-15928164
 ] 

Ray Qiu edited comment on SPARK-19977 at 3/16/17 2:28 PM:
--

Not really.  Many of the batches are empty RDDs, and the scheduler delay is 
still in the range of seconds.  This only happen after a few hours of running 
the application.  Everything works initially.


was (Author: rayqiu):
Not really.  Many of the batches are empty RDDs, and the scheduler delay still 
in the range of seconds.  This only happen after a few hours of running the 
application.  Everything works initially.

> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30 seconds in Spark Streaming application
> --
>
> Key: SPARK-19977
> URL: https://issues.apache.org/jira/browse/SPARK-19977
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Ray Qiu
>
> Scheduler Delay (in UI Advanced Metrics) for a task gradually increases from 
> 5 ms to 30+ seconds in a Spark Streaming application, where multiple Kafka 
> direct streams are processed.  These kafka streams are processed separately 
> (not combined via union).  
> It causes the task processing time to increase greatly and eventually stops 
> working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19946) DebugFilesystem.assertNoOpenStreams should report the open streams to help debugging

2017-03-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19946.
---
   Resolution: Fixed
 Assignee: Bogdan Raducanu
Fix Version/s: 2.2.0

> DebugFilesystem.assertNoOpenStreams should report the open streams to help 
> debugging
> 
>
> Key: SPARK-19946
> URL: https://issues.apache.org/jira/browse/SPARK-19946
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
>Priority: Minor
> Fix For: 2.2.0
>
>
> In DebugFilesystem.assertNoOpenStreams if there are open streams an exception 
> is thrown showing the number of open streams. This doesn't help much to debug 
> where the open streams were leaked.
> The exception should also report where the stream was leaked. This can be 
> done through a cause exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19962) add DictVectorizor for DataFrame

2017-03-16 Thread yu peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928123#comment-15928123
 ] 

yu peng commented on SPARK-19962:
-

yeah, exactly.. i would love to use FeatureHasher when i have a lot of features 
:) and DictVectorizer is kind of nice to keep all the mapping for me so i can 
play with my classifier/regressor weights with meaningful explain :)

i would like to contribute a pr if you guys think it's worth it..

> add DictVectorizor for DataFrame
> 
>
> Key: SPARK-19962
> URL: https://issues.apache.org/jira/browse/SPARK-19962
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: yu peng
>  Labels: features
>
> it's really useful to have something like 
> sklearn.feature_extraction.DictVectorizor
> Since out features lives in json/data frame like format and 
> classifier/regressors only take vector input. so there is a gap between them.
> something like 
> ```
> df = sqlCtx.createDataFrame([Row(age=1, gender='male', country='cn', 
> hobbies=['sing', 'dance']),Row(age=3, gender='female', country='us',  
> hobbies=['sing']), ])
> df.show()
> |age|gender|country|hobbies|
> |1|male|cn|[sing, dance]|
> |3|female|us|[sing]|
> import DictVectorizor
> vec = DictVectorizor()
> matrix = vec.fit_transform(df)
> matrix.show()
> |features|
> |[1, 0, 1, 0, 1, 1, 1]|
> |[3, 1, 0, 1, 0, 1, 1]|
> vec.show()
> |feature_name| feature_dimension|
> |age|0|
> |gender=female|1|
> |gender=male|2|
> |country=us|3|
> |country=cn|4|
> |hobbies=sing|5|
> |hobbies=dance|6|
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19932) Disallow a case that might cause OOM for steaming deduplication

2017-03-16 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-19932:
--
Summary: Disallow a case that might cause OOM for steaming deduplication  
(was: Disallow a case that might case OOM for steaming deduplication)

> Disallow a case that might cause OOM for steaming deduplication
> ---
>
> Key: SPARK-19932
> URL: https://issues.apache.org/jira/browse/SPARK-19932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Liwei Lin
>
> {code}
> spark
>.readStream // schema: (word, eventTime), like ("a", 10), 
> ("a", 11), ("b", 12) ...
>...
>.withWatermark("eventTime", "10 seconds")
>.dropDuplicates("word") // note: "eventTime" is not part of the key 
> columns
>...
> {code}
> As shown above, right now if watermark is specified for a streaming 
> dropDuplicates query, but not specified as the key columns, then we'll still 
> get the correct answer, but the state just keeps growing and will never get 
> cleaned up.
> The reason is, the watermark attribute is not part of the key of the state 
> store in this case. We're not saving event time information in the state 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2017-03-16 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-6678.
-
Resolution: Not A Problem

I am resolving this as this code path has been radically changed. I still would 
like to help if you face this issue.

Please reopen this if you meet this issue. Let's verify this together.

> select count(DISTINCT C_UID) from parquetdir may be can optimize
> 
>
> Key: SPARK-6678
> URL: https://issues.apache.org/jira/browse/SPARK-6678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
> C_UID).
> I run the following sql, may be RDD.collect is very slow 
> select count(DISTINCT C_UID) from parquetdir
> select count(DISTINCT C_UID) from parquetdir
> collect at SparkPlan.scala:83 +details
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:415)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19932) Disallow a case that might case OOM for steaming deduplication

2017-03-16 Thread Liwei Lin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liwei Lin updated SPARK-19932:
--
Summary: Disallow a case that might case OOM for steaming deduplication  
(was: Also save event time into StateStore for certain cases)

> Disallow a case that might case OOM for steaming deduplication
> --
>
> Key: SPARK-19932
> URL: https://issues.apache.org/jira/browse/SPARK-19932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Liwei Lin
>
> {code}
> spark
>.readStream // schema: (word, eventTime), like ("a", 10), 
> ("a", 11), ("b", 12) ...
>...
>.withWatermark("eventTime", "10 seconds")
>.dropDuplicates("word") // note: "eventTime" is not part of the key 
> columns
>...
> {code}
> As shown above, right now if watermark is specified for a streaming 
> dropDuplicates query, but not specified as the key columns, then we'll still 
> get the correct answer, but the state just keeps growing and will never get 
> cleaned up.
> The reason is, the watermark attribute is not part of the key of the state 
> store in this case. We're not saving event time information in the state 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18579) spark-csv strips whitespace (pyspark)

2017-03-16 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928065#comment-15928065
 ] 

Hyukjin Kwon commented on SPARK-18579:
--

I submitted a PR for this https://github.com/apache/spark/pull/17310 but it 
seems not adding a link automatically for now.

> spark-csv strips whitespace (pyspark) 
> --
>
> Key: SPARK-18579
> URL: https://issues.apache.org/jira/browse/SPARK-18579
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.2
>Reporter: Adrian Bridgett
>Priority: Minor
>
> ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace are supported on CSV 
> reader (and defaults to false).
> However these are not supported options on the CSV writer and so the library 
> defaults take place which strips the whitespace.
> I think it would make the most sense if the writer semantics matched the 
> reader (and did not alter the data) however this would be a change in 
> behaviour.  In any case it'd be great to have the _option_ to strip or not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >