[jira] [Assigned] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-20 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-18788:


Assignee: Felix Cheung

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Assignee: Felix Cheung
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18788:


Assignee: (was: Apache Spark)

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18788:


Assignee: Apache Spark

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18788) Add getNumPartitions() to SparkR

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832873#comment-15832873
 ] 

Apache Spark commented on SPARK-18788:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16668

> Add getNumPartitions() to SparkR
> 
>
> Key: SPARK-18788
> URL: https://issues.apache.org/jira/browse/SPARK-18788
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Raela Wang
>Priority: Minor
>
> Would be really convenient to have getNumPartitions() in SparkR, which was in 
> the RDD API.
> rdd <- SparkR:::toRDD(df)
> SparkR:::getNumPartitions(rdd)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19288) Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in R/run-tests.sh

2017-01-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832837#comment-15832837
 ] 

Felix Cheung commented on SPARK-19288:
--

hmm, that's odd. what system and R version?
I'm wondering if this is related to time zone?

> Failure (at test_sparkSQL.R#1300): date functions on a DataFrame in 
> R/run-tests.sh
> --
>
> Key: SPARK-19288
> URL: https://issues.apache.org/jira/browse/SPARK-19288
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL, Tests
>Affects Versions: 2.0.1
> Environment: Ubuntu 16.04, X86_64, ppc64le
>Reporter: Nirman Narang
>
> Full log here.
> {code:title=R/run-tests.sh|borderStyle=solid}
> Loading required package: methods
> Attaching package: 'SparkR'
> The following object is masked from 'package:testthat':
> describe
> The following objects are masked from 'package:stats':
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from 'package:base':
> as.data.frame, colnames, colnames<-, drop, intersect, rank, rbind,
> sample, subset, summary, transform, union
> functions on binary files : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> 
> binary functions : ...
> broadcast variables : ..
> functions in client.R : .
> test functions in sparkR.R : .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> ...
> include R packages : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> JVM API : ..
> MLlib functions : Spark package found in SPARK_HOME: 
> /var/lib/jenkins/workspace/Sparkv2.0.1/spark
> .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:53 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, BIT_PACKED, 
> RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> Jan 19, 2017 5:40:54 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> Jan 19, 2017 5:40:55 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> Jan 19, 2017 5:40:55 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> 

[jira] [Resolved] (SPARK-19305) partitioned table should always put partition columns at the end of table schema

2017-01-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19305.
-
Resolution: Fixed

Issue resolved by pull request 16655
[https://github.com/apache/spark/pull/16655]

> partitioned table should always put partition columns at the end of table 
> schema
> 
>
> Key: SPARK-19305
> URL: https://issues.apache.org/jira/browse/SPARK-19305
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2017-01-20 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-14536.
-
   Resolution: Fixed
 Assignee: Suresh Thalamati
Fix Version/s: 2.2.0

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>Assignee: Suresh Thalamati
>  Labels: NullPointerException
> Fix For: 2.2.0
>
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2017-01-20 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16101.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19321) Support Hive 2.x's metastore

2017-01-20 Thread Yin Huai (JIRA)
Yin Huai created SPARK-19321:


 Summary: Support Hive 2.x's metastore
 Key: SPARK-19321
 URL: https://issues.apache.org/jira/browse/SPARK-19321
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


It will be good to make Spark work with Hive 2.x's metastores. 

We need to add needed shim classes in 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala.
 Make IsolatedClientLoader recognize new versions of metastores 
(https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala).
 Finally, we want to add tests in 
https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19267) Fix a race condition when stopping StateStore

2017-01-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19267.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.1.1

Issue resolved by pull request 16627
[https://github.com/apache/spark/pull/16627]

> Fix a race condition when stopping StateStore
> -
>
> Key: SPARK-19267
> URL: https://issues.apache.org/jira/browse/SPARK-19267
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.1, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19320) Allow guaranteed amount of GPU to be used when launching jobs

2017-01-20 Thread Timothy Chen (JIRA)
Timothy Chen created SPARK-19320:


 Summary: Allow guaranteed amount of GPU to be used when launching 
jobs
 Key: SPARK-19320
 URL: https://issues.apache.org/jira/browse/SPARK-19320
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Currently the only configuration for using GPUs with Mesos is setting the 
maximum amount of GPUs a job will take from an offer, but doesn't guarantee 
exactly how much.

We should have a configuration that sets this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19316) Spark event logs are huge compared to 1.5.2

2017-01-20 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832710#comment-15832710
 ] 

Jisoo Kim commented on SPARK-19316:
---

I suspect this is due to "SparkListenrTaskEnd" event log having too many 
entries under "Accumulables" field (from TaskMetrics). 

> Spark event logs are huge compared to 1.5.2
> ---
>
> Key: SPARK-19316
> URL: https://issues.apache.org/jira/browse/SPARK-19316
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jisoo Kim
>
> I have a Spark application with many tasks (more than 40k). The event logs 
> for such application used to be around 2g when I was using Spark 1.5.2 
> standalone cluster. Now that I am using Spark 2.0 with Mesos, the size of the 
> event log of such application drastically increased from 2g to 60g with a 
> similar number of tasks. This is affecting Spark History Server since it is 
> having trouble reading such huge event log. I wonder the increase in a size 
> of an event log is expected in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18750:


Assignee: (was: Apache Spark)

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> 

[jira] [Assigned] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18750:


Assignee: Apache Spark

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>Assignee: Apache Spark
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> 

[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832703#comment-15832703
 ] 

Apache Spark commented on SPARK-18750:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16667

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>  

[jira] [Assigned] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19319:


Assignee: (was: Apache Spark)

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832698#comment-15832698
 ] 

Apache Spark commented on SPARK-19319:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/1

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19319:


Assignee: Apache Spark

> SparkR Kmeans summary returns error when the cluster size doesn't equal to k
> 
>
> Key: SPARK-19319
> URL: https://issues.apache.org/jira/browse/SPARK-19319
> Project: Spark
>  Issue Type: Bug
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> When Kmeans using initMode = "random" and some random seed, it is possible 
> the actual cluster size doesn't equal to the configured `k`.
> In this case, summary(model) returns error due to the number of cols of 
> coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k

2017-01-20 Thread Miao Wang (JIRA)
Miao Wang created SPARK-19319:
-

 Summary: SparkR Kmeans summary returns error when the cluster size 
doesn't equal to k
 Key: SPARK-19319
 URL: https://issues.apache.org/jira/browse/SPARK-19319
 Project: Spark
  Issue Type: Bug
Reporter: Miao Wang


When Kmeans using initMode = "random" and some random seed, it is possible the 
actual cluster size doesn't equal to the configured `k`.

In this case, summary(model) returns error due to the number of cols of 
coefficient matrix doesn't equal to k.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-20 Thread Drew Robb (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Robb updated SPARK-16599:
--
Comment: was deleted

(was: I encountered an identical exception when using a singleton spark 
session. For me, I was able to resolve the issue by ensuring all objects that 
used the singleton spark session did a `import spark.implicits._`, even if that 
particular import was not necessary for compiling.)

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-20 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832684#comment-15832684
 ] 

Marcelo Vanzin commented on SPARK-18750:


Yay, I can reproduce it with a unit test against 
{{LocalityPreferredContainerPlacementStrategy}}.

> spark should be able to control the number of executor and should not throw 
> stack overslow
> --
>
> Key: SPARK-18750
> URL: https://issues.apache.org/jira/browse/SPARK-18750
> Project: Spark
>  Issue Type: Bug
>Reporter: Neerja Khattar
>
> When running Sql queries on large datasets. Job fails with stack overflow 
> warning and it shows it is requesting lots of executors.
> Looks like there is no limit to number of executors or not even having an 
> upperbound based on yarn available resources.
> {noformat}
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s). 
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s). 
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n5.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n8.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
> bdtcstr61n2.svr.us.jpmchase.net:8041
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
> 32770 executor(s).
> 16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
> containers, each with 1 cores and 6758 MB memory including 614 MB overhead
> 16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
> 52902 executor(s).
> 16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 
> time(s) in a row.
> java.lang.StackOverflowError
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
>   at 
> scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
>   at 
> scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
>   at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
>   at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
>   at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
>   at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
>   at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   at 
> scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
>   

[jira] [Commented] (SPARK-19289) UnCache Dataset using Name

2017-01-20 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832663#comment-15832663
 ] 

Xiao Li commented on SPARK-19289:
-

Basically, you are creating a view for that dataframe. View name is the 
identifier. Then cache it. Is it what you want?

> UnCache Dataset using Name
> --
>
> Key: SPARK-19289
> URL: https://issues.apache.org/jira/browse/SPARK-19289
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Kaushal Prajapati
>Priority: Minor
>  Labels: features
>
> We can Cache and Uncache any table using its name in Spark Sql.
> {code}
> df.createTempView("myTable")
> sqlContext.cacheTable("myTable")
> sqlContext.uncacheTable("myTable")
> {code}
> Likewise if it is possible to have some kind of uniqueness for names in 
> DataSets and an abstraction like the same that we have for tables. It would 
> be very useful
> {code}
> scala> val df = sc.range(1,1000).toDF
> df: org.apache.spark.sql.DataFrame = [value: bigint]
> scala> df.setName("MyDataset")
> res0: df.type = MyDataset
> scala> df.cache
> res1: df.type = MyDataset
> sqlContext.getDataSet("MyDataset")
> sqlContext.uncacheDataSet("MyDataset")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-01-20 Thread Suresh Thalamati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832659#comment-15832659
 ] 

Suresh Thalamati commented on SPARK-19318:
--

I am looking into this test failure. 

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-19318
> URL: https://issues.apache.org/jira/browse/SPARK-19318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
> data types to be mapped to Oracle' =
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals("class java.sql.Date") was false 
> (OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19300) Executor is waiting for lock

2017-01-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832658#comment-15832658
 ] 

Shixiong Zhu commented on SPARK-19300:
--

Could you provide the full thread dump? Looks like there is some issue in 
fetching shuffle blocks.

> Executor is waiting for lock
> 
>
> Key: SPARK-19300
> URL: https://issues.apache.org/jira/browse/SPARK-19300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>Priority: Critical
>
> I can see all threads in the executor is waiting for lock.
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:313)
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryRelation.scala:138)
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:342)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:293)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:331)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:295)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:330)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Updated] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-20 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18589:
--
Fix Version/s: 2.2.0
   2.1.1

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Resolved] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-20 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18589.
---
Resolution: Fixed

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> 

[jira] [Created] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-01-20 Thread Xiao Li (JIRA)
Xiao Li created SPARK-19318:
---

 Summary: Docker test case failure: `SPARK-16625: General data 
types to be mapped to Oracle`
 Key: SPARK-19318
 URL: https://issues.apache.org/jira/browse/SPARK-19318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li


= FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
data types to be mapped to Oracle' =

- SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
  types.apply(9).equals("class java.sql.Date") was false 
(OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17890) scala.ScalaReflectionException

2017-01-20 Thread Dave DeCaprio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832620#comment-15832620
 ] 

Dave DeCaprio edited comment on SPARK-17890 at 1/20/17 11:46 PM:
-

I'm running into this also.  Naively changing the above line to use the above 
mirror didn't work.

I run into this even when I run locally.  However, if I run directly from the 
command line (not from spark-submit) it works fine.


was (Author: davedecaprio):
I'm running into this also.  Naively changing the above line to use the above 
mirror didn't work.

> scala.ScalaReflectionException
> --
>
> Key: SPARK-17890
> URL: https://issues.apache.org/jira/browse/SPARK-17890
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>Reporter: Khalid Reid
>Priority: Minor
>  Labels: newbie
>
> Hello,
> I am seeing an error message in spark-shell when I map a DataFrame to a 
> Seq\[Foo\].  However, things work fine when I use flatMap.  
> {noformat}
> scala> case class Foo(value:String)
> defined class Foo
> scala> val df = sc.parallelize(List(1,2,3)).toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
> scala> df.map{x => Seq.empty[Foo]}
> scala.ScalaReflectionException: object $line14.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator1$1.apply(:29)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
>   ... 48 elided
> scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
> res2: org.apache.spark.sql.Dataset[Foo] = [value: string]
> {noformat}
> I am seeing the same error reported 
> [here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
>  when I use spark-submit.
> I am new to Spark but I don't expect this to throw an exception.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19317) UnsupportedOperationException: empty.reduceLeft in LinearSeqOptimized

2017-01-20 Thread Barry Becker (JIRA)
Barry Becker created SPARK-19317:


 Summary: UnsupportedOperationException: empty.reduceLeft in 
LinearSeqOptimized
 Key: SPARK-19317
 URL: https://issues.apache.org/jira/browse/SPARK-19317
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Barry Becker


I wish I had more of a simple reproducible case to give, but I got the below 
exception while selecting null values in one of the columns of a dataframe.
My client code that failed was 
df.filter(filterExp).count()
where the filter expression was something like someColumn.isNull.
There were 412 nulls out of 716,000 total rows for the column being filtered.
Its odd because I have a different, smaller dataset where I did the same thing 
on a column with 100 nulls out of 800 and did not get the error.
The exception seems to indicate that spark is trying to do reduceLeft on an 
empy list.

{code}
java.lang.UnsupportedOperationException: 
empty.reduceLeftscala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:137)
 scala.collection.immutable.List.reduceLeft(List.scala:84) 
scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:208) 
scala.collection.AbstractTraversable.reduce(Traversable.scala:104) 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:90)
 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54)
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:61)
 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$1.applyOrElse(InMemoryTableScanExec.scala:54)
 scala.PartialFunction$Lifted.apply(PartialFunction.scala:223) 
scala.PartialFunction$Lifted.apply(PartialFunction.scala:219) 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:95)
 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec$$anonfun$2.apply(InMemoryTableScanExec.scala:94)
 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
 scala.collection.immutable.List.foreach(List.scala:381) 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
scala.collection.immutable.List.flatMap(List.scala:344) 
org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.(InMemoryTableScanExec.scala:94)
 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306)
 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$$anonfun$6.apply(SparkStrategies.scala:306)
 
org.apache.spark.sql.execution.SparkPlanner.pruneFilterProject(SparkPlanner.scala:96)
 
org.apache.spark.sql.execution.SparkStrategies$InMemoryScans$.apply(SparkStrategies.scala:302)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
 scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
 scala.collection.Iterator$class.foreach(Iterator.scala:893) 
scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) 
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
 scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
 

[jira] [Commented] (SPARK-17890) scala.ScalaReflectionException

2017-01-20 Thread Dave DeCaprio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832620#comment-15832620
 ] 

Dave DeCaprio commented on SPARK-17890:
---

I'm running into this also.  Naively changing the above line to use the above 
mirror didn't work.

> scala.ScalaReflectionException
> --
>
> Key: SPARK-17890
> URL: https://issues.apache.org/jira/browse/SPARK-17890
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: x86_64 GNU/Linux
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
>Reporter: Khalid Reid
>Priority: Minor
>  Labels: newbie
>
> Hello,
> I am seeing an error message in spark-shell when I map a DataFrame to a 
> Seq\[Foo\].  However, things work fine when I use flatMap.  
> {noformat}
> scala> case class Foo(value:String)
> defined class Foo
> scala> val df = sc.parallelize(List(1,2,3)).toDF
> df: org.apache.spark.sql.DataFrame = [value: int]
> scala> df.map{x => Seq.empty[Foo]}
> scala.ScalaReflectionException: object $line14.$read not found.
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:162)
>   at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:22)
>   at $typecreator1$1.apply(:29)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
>   at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
>   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
>   ... 48 elided
> scala> df.flatMap{_ => Seq.empty[Foo]} //flatMap works
> res2: org.apache.spark.sql.Dataset[Foo] = [value: string]
> {noformat}
> I am seeing the same error reported 
> [here|https://issues.apache.org/jira/browse/SPARK-8465?jql=text%20~%20%22scala.ScalaReflectionException%22]
>  when I use spark-submit.
> I am new to Spark but I don't expect this to throw an exception.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13478:


Assignee: Apache Spark  (was: Marcelo Vanzin)

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832617#comment-15832617
 ] 

Apache Spark commented on SPARK-13478:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16665

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13478:


Assignee: Marcelo Vanzin  (was: Apache Spark)

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-01-20 Thread Jonathan Alvarado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832598#comment-15832598
 ] 

Jonathan Alvarado commented on SPARK-12837:
---

I am seeing this issue on EMR 5.2.0 with Spark 2.0.2 and can reproduce locally 
with Spark 2.0.2 following Jason Herman's steps above.

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)

2017-01-20 Thread Drew Robb (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832569#comment-15832569
 ] 

Drew Robb commented on SPARK-16599:
---

I encountered an identical exception when using a singleton spark session. For 
me, I was able to resolve the issue by ensuring all objects that used the 
singleton spark session did a `import spark.implicits._`, even if that 
particular import was not necessary for compiling.

> java.util.NoSuchElementException: None.get  at at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
> --
>
> Key: SPARK-16599
> URL: https://issues.apache.org/jira/browse/SPARK-16599
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: centos 6.7   spark 2.0
>Reporter: binde
>
> run a spark job with spark 2.0, error message
> Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): 
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-01-20 Thread Erik LaBianca (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832516#comment-15832516
 ] 

Erik LaBianca edited comment on SPARK-18859 at 1/20/17 10:33 PM:
-

Not quite a repro, but here's explain output.

{noformat}
val df = spark.read
  .format("jdbc")
  .options(DataSources.PostgresOptions + (
  "dbtable" -> s"""(select
 profiles_contact_points.id,
 remote_id
  |from profiles_contact_points
  |  left join profiles_contacts_connectors
  |on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
  | where profiles_contact_points.user_id = 1
) t""".stripMargin
))
  .load

df.printSchema()
df.explain(true)
df.map(_ != null).count()
{noformat}

Results in the following.
{noformat}
df: org.apache.spark.sql.DataFrame = [id: bigint, remote_id: string]
root
 |-- id: long (nullable = false)
 |-- remote_id: string (nullable = false)
== Parsed Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Analyzed Logical Plan ==
id: bigint, remote_id: string
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Optimized Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Physical Plan ==
*Scan JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t) [id#2018L,remote_id#2019] ReadSchema: 
struct
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 238.0 
(TID 55547, ip-x.ec2.internal): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 

[jira] [Comment Edited] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-01-20 Thread Erik LaBianca (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832516#comment-15832516
 ] 

Erik LaBianca edited comment on SPARK-18859 at 1/20/17 10:32 PM:
-

Not quite a repro, but here's explain output.

{noformat}
val df = spark.read
  .format("jdbc")
  .options(DataSources.TelepathPlayDb.PostgresOptions + (
  "dbtable" -> s"""(select
 profiles_contact_points.id,
 remote_id
  |from profiles_contact_points
  |  left join profiles_contacts_connectors
  |on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
  | where profiles_contact_points.user_id = 1
) t""".stripMargin
))
  .load
  //.as[ContactRemoteValue]
df.printSchema()
df.explain(true)
df.map(_ != null).count()
{noformat}

Results in the following.
{noformat}
df: org.apache.spark.sql.DataFrame = [id: bigint, remote_id: string]
root
 |-- id: long (nullable = false)
 |-- remote_id: string (nullable = false)
== Parsed Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Analyzed Logical Plan ==
id: bigint, remote_id: string
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Optimized Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Physical Plan ==
*Scan JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t) [id#2018L,remote_id#2019] ReadSchema: 
struct
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 238.0 
(TID 55547, ip-x.ec2.internal): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 

[jira] [Comment Edited] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-01-20 Thread Erik LaBianca (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832516#comment-15832516
 ] 

Erik LaBianca edited comment on SPARK-18859 at 1/20/17 10:32 PM:
-

Not quite a repro, but here's explain output.

{noformat}
val df = spark.read
  .format("jdbc")
  .options(DataSources.PostgresOptions + (
  "dbtable" -> s"""(select
 profiles_contact_points.id,
 remote_id
  |from profiles_contact_points
  |  left join profiles_contacts_connectors
  |on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
  | where profiles_contact_points.user_id = 1
) t""".stripMargin
))
  .load
  //.as[ContactRemoteValue]
df.printSchema()
df.explain(true)
df.map(_ != null).count()
{noformat}

Results in the following.
{noformat}
df: org.apache.spark.sql.DataFrame = [id: bigint, remote_id: string]
root
 |-- id: long (nullable = false)
 |-- remote_id: string (nullable = false)
== Parsed Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Analyzed Logical Plan ==
id: bigint, remote_id: string
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Optimized Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Physical Plan ==
*Scan JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t) [id#2018L,remote_id#2019] ReadSchema: 
struct
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 238.0 
(TID 55547, ip-x.ec2.internal): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 

[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

2017-01-20 Thread Erik LaBianca (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832516#comment-15832516
 ] 

Erik LaBianca commented on SPARK-18859:
---

Not quite a repro, but here's explain output.

{noformat}
val df = spark.read
  .format("jdbc")
  .options(DataSources.TelepathPlayDb.PostgresOptions + (
  "dbtable" -> s"""(select
 profiles_contact_points.id,
 remote_id
  |from profiles_contact_points
  |  left join profiles_contacts_connectors
  |on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
  | where profiles_contact_points.user_id = 1
) t""".stripMargin
))
  .load
  //.as[ContactRemoteValue]
df.printSchema()
df.explain(true)
df.map(_ != null).count()
{noformat}

Results in the following.
{noformat}
df: org.apache.spark.sql.DataFrame = [id: bigint, remote_id: string]
root
 |-- id: long (nullable = false)
 |-- remote_id: string (nullable = false)
== Parsed Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Analyzed Logical Plan ==
id: bigint, remote_id: string
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Optimized Logical Plan ==
Relation[id#2018L,remote_id#2019] JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 1
) t)
== Physical Plan ==
*Scan JDBCRelation((select
 profiles_contact_points.id,
 remote_id
from profiles_contact_points
  left join profiles_contacts_connectors
on profiles_contact_points.contact_id = 
profiles_contacts_connectors.contact_id
 where profiles_contact_points.user_id = 225
) t) [id#2018L,remote_id#2019] ReadSchema: 
struct
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 238.0 failed 4 times, most recent failure: Lost task 0.3 in stage 238.0 
(TID 55547, ip-x.ec2.internal): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)

[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487
 ] 

Yun Ni edited comment on SPARK-18392 at 1/20/17 10:29 PM:
--

Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

But you are right. It's better to comparing with less number of records. I will 
think deeper and see if there is a better solution.


was (Author: yunn):
Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832487#comment-15832487
 ] 

Yun Ni commented on SPARK-18392:


Yes, comparing if the hash signature equals is faster than computing the 
distance between the key and every instance.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19314) Do not allow sort before aggregation in Structured Streaming plan

2017-01-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19314.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.0.3
   2.1.1

Issue resolved by pull request 16662
[https://github.com/apache/spark/pull/16662]

> Do not allow sort before aggregation in Structured Streaming plan
> -
>
> Key: SPARK-19314
> URL: https://issues.apache.org/jira/browse/SPARK-19314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 2.1.1, 2.0.3, 3.0.0
>
>
> Sort in a streaming plan should be allowed only after a aggregation in 
> complete mode. Currently it is incorrectly allowed when present anywhere in 
> the plan. It gives unpredictable potentially incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-20 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832473#comment-15832473
 ] 

Jisoo Kim commented on SPARK-19111:
---

Related to https://issues.apache.org/jira/browse/SPARK-19316.

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 

[jira] [Commented] (SPARK-19316) Spark event logs are huge compared to 1.5.2

2017-01-20 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832470#comment-15832470
 ] 

Jisoo Kim commented on SPARK-19316:
---

Related to https://issues.apache.org/jira/browse/SPARK-19111

> Spark event logs are huge compared to 1.5.2
> ---
>
> Key: SPARK-19316
> URL: https://issues.apache.org/jira/browse/SPARK-19316
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Jisoo Kim
>
> I have a Spark application with many tasks (more than 40k). The event logs 
> for such application used to be around 2g when I was using Spark 1.5.2 
> standalone cluster. Now that I am using Spark 2.0 with Mesos, the size of the 
> event log of such application drastically increased from 2g to 60g with a 
> similar number of tasks. This is affecting Spark History Server since it is 
> having trouble reading such huge event log. I wonder the increase in a size 
> of an event log is expected in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0

2017-01-20 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831033#comment-15831033
 ] 

Paul Wu edited comment on SPARK-19296 at 1/20/17 9:52 PM:
--

We found this Util is very useful in general (much, much better than primitive 
jdbc) and have been using it since 1.5.x... didn't realize it is internal. It 
will a big regret for us to not be able to use it. But it seems it is a pain 
for us now. 

I guess for code quality purpose, at least refactor the code to eliminate the 
duplication args. 


was (Author: zwu@gmail.com):
We found this Util is very useful in general (much, much better than primitive 
jdbc) and have been using it since 1.3.x... didn't realize it is internal. It 
will a big regret for us to not be able to use it. But it seems it is a pain 
for us now. 

I guess for code quality purpose, at least refactor the code to eliminate the 
duplication args. 

> Awkward changes for JdbcUtils.saveTable in Spark 2.1.0
> --
>
> Key: SPARK-19296
> URL: https://issues.apache.org/jira/browse/SPARK-19296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Paul Wu
>Priority: Minor
>
> The Change from JdbcUtils.saveTable(DataFrame, String, String, Property) to  
> JdbcUtils.saveTable(DataFrame, String, String, JDBCOptions), not only 
> incompatible to previous versions (so the previous code in java won't 
> compile, but also introduced silly code change: One has to specify url and 
> table  twice like this:
> JDBCOptions jdbcOptions = new JDBCOptions(url, table, map);
> JdbcUtils.saveTable(ds, url, table,jdbcOptions);
> Why does one have to supply the same things ulr, table twice? (If you don't 
> specify it in both places, the exception will be thrown).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread David S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832464#comment-15832464
 ] 

David S commented on SPARK-18392:
-

Hi Yun and thanks for the answer, but my question now is, are there any 
perfomance advantage between calculate the Nearest Neighbor of normal way 
(without LSH) and with LSH?. I see that in both cases is necesary compare an 
instance with all intances in the dataset. 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-20 Thread Jisoo Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832458#comment-15832458
 ] 

Jisoo Kim commented on SPARK-19111:
---

Thanks [~ste...@apache.org] for information, using S3a helped with uploading 
the event logs (I used your PR https://github.com/apache/spark/pull/12004 to 
use S3a). Turned out that the size of the event log was dramatically larger 
than it used to be with Spark 1.5.2 (~30x). 

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 

[jira] [Created] (SPARK-19316) Spark event logs are huge compared to 1.5.2

2017-01-20 Thread Jisoo Kim (JIRA)
Jisoo Kim created SPARK-19316:
-

 Summary: Spark event logs are huge compared to 1.5.2
 Key: SPARK-19316
 URL: https://issues.apache.org/jira/browse/SPARK-19316
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Jisoo Kim


I have a Spark application with many tasks (more than 40k). The event logs for 
such application used to be around 2g when I was using Spark 1.5.2 standalone 
cluster. Now that I am using Spark 2.0 with Mesos, the size of the event log of 
such application drastically increased from 2g to 60g with a similar number of 
tasks. This is affecting Spark History Server since it is having trouble 
reading such huge event log. I wonder the increase in a size of an event log is 
expected in Spark 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18750) spark should be able to control the number of executor and should not throw stack overslow

2017-01-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18750:
---
Description: 
When running Sql queries on large datasets. Job fails with stack overflow 
warning and it shows it is requesting lots of executors.

Looks like there is no limit to number of executors or not even having an 
upperbound based on yarn available resources.

{noformat}
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n5.svr.us.jpmchase.net:8041 
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n8.svr.us.jpmchase.net:8041 
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n2.svr.us.jpmchase.net:8041 
16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
32770 executor(s). 
16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
containers, each with 1 cores and 6758 MB memory including 614 MB overhead 
16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
52902 executor(s). 
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n5.svr.us.jpmchase.net:8041
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n8.svr.us.jpmchase.net:8041
16/11/29 15:47:47 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bdtcstr61n2.svr.us.jpmchase.net:8041
16/11/29 15:47:47 INFO yarn.YarnAllocator: Driver requested a total number of 
32770 executor(s).
16/11/29 15:47:47 INFO yarn.YarnAllocator: Will request 24576 executor 
containers, each with 1 cores and 6758 MB memory including 614 MB overhead
16/11/29 15:49:11 INFO yarn.YarnAllocator: Driver requested a total number of 
52902 executor(s).
16/11/29 15:49:11 WARN yarn.ApplicationMaster: Reporter thread fails 1 time(s) 
in a row.
java.lang.StackOverflowError
at scala.collection.immutable.HashMap.$plus(HashMap.scala:57)
at scala.collection.immutable.HashMap.$plus(HashMap.scala:36)
at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at 
scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:24)
at 
scala.collection.TraversableLike$class.$plus$plus(TraversableLike.scala:156)
at 
scala.collection.AbstractTraversable.$plus$plus(Traversable.scala:105)
at scala.collection.immutable.HashMap.$plus(HashMap.scala:60)
at scala.collection.immutable.Map$Map4.updated(Map.scala:172)
at scala.collection.immutable.Map$Map4.$plus(Map.scala:173)
at scala.collection.immutable.Map$Map4.$plus(Map.scala:158)
at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:28)
at scala.collection.mutable.MapBuilder.$plus$eq(MapBuilder.scala:24)
at 
scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 
scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at 

[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404
 ] 

Yun Ni edited comment on SPARK-18392 at 1/20/17 9:15 PM:
-

Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause shuffle spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.


was (Author: yunn):
Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause OOM or spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries 

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832404#comment-15832404
 ] 

Yun Ni commented on SPARK-18392:


Hi David,

Thanks for the question. I did not group the records by their hash signatures 
for 2 reasons:
(1) A large number of instances could have the same hash signature. In that 
case, the group-by operations could cause OOM or spill. See 
http://stackoverflow.com/questions/25136064/spark-groupby-outofmemory-woes
(2) Since Spark Datasets are lazy evaluated, tagging every instance with hash 
signature and comparing with every instance will run in the same staging. (Even 
more steps may run the same stage.) Thus it should not hurt the running time.

If you see any performance issues, please let us know.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> 

[jira] [Created] (SPARK-19315) StructType should support nested lookup; throws IllegalArgumentException

2017-01-20 Thread Vinay varma (JIRA)
Vinay varma created SPARK-19315:
---

 Summary: StructType should support nested lookup; throws 
IllegalArgumentException
 Key: SPARK-19315
 URL: https://issues.apache.org/jira/browse/SPARK-19315
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 1.6.1
Reporter: Vinay varma
Priority: Minor


Datasets supports class composition. .joinWith operation in dataset also 
results in composed type. StructType throws IllegalArgumentException for a 
nested lookup. Since many validations check the schema, we are limiting these 
to use flattened datasets only (ex: org.apache.spark.ml.feature.StringIndexer)

Is there any reason for not supporting such operations?

>From an initial check, looks like adding support to such look ups will break 
>the existing contract at: 
org.apache.spark.sql.types.StructType 
 def fieldIndex(name: String): Int 

Example code, with breaking code:

case class A(id: Int, name: String)

case class B(id: Int, location: String)

class TestCompositionStruct extends FunSuite {
  val spark = 
SparkSession.builder().appName("TestCompositionStruct").master("local[4]").getOrCreate()

  import spark.implicits._

  val adf = spark.createDataFrame(List(A(1, "X"), A(2, "Y"))).as[A]
  val bdf = spark.createDataFrame(List(B(1, "X_loc"), B(2, "Y_loc"))).as[B]

  test("supportNestedDataset") {
val jdf = adf.joinWith(bdf, adf("id") === 
bdf("id")).withColumnRenamed("_1", "a").withColumnRenamed("_2", "b").as[(A, B)]
assert(jdf.select("a.id").count() > 0)
intercept[IllegalArgumentException](jdf.schema("a.id"))
  }
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18120:


Assignee: (was: Apache Spark)

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18120:


Assignee: Apache Spark

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>Assignee: Apache Spark
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18120) QueryExecutionListener method doesnt' get executed for DataFrameWriter methods

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832362#comment-15832362
 ] 

Apache Spark commented on SPARK-18120:
--

User 'salilsurendran' has created a pull request for this issue:
https://github.com/apache/spark/pull/16664

> QueryExecutionListener method doesnt' get executed for DataFrameWriter methods
> --
>
> Key: SPARK-18120
> URL: https://issues.apache.org/jira/browse/SPARK-18120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Salil Surendran
>
> QueryExecutionListener is a class that has methods named onSuccess() and 
> onFailure() that gets called when a query is executed. Each of those methods 
> takes a QueryExecution object as a parameter which can be used for metrics 
> analysis. It gets called for several of the DataSet methods like take, head, 
> first, collect etc. but doesn't get called for any of the DataFrameWriter 
> methods like saveAsTable, save etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18823:


Assignee: (was: Apache Spark)

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18823:


Assignee: Apache Spark

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18823) Assignation by column name variable not available or bug?

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832324#comment-15832324
 ] 

Apache Spark commented on SPARK-18823:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16663

> Assignation by column name variable not available or bug?
> -
>
> Key: SPARK-18823
> URL: https://issues.apache.org/jira/browse/SPARK-18823
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 2.0.2
> Environment: RStudio Server in EC2 Instances (EMR Service of AWS) Emr 
> 4. Or databricks (community.cloud.databricks.com) .
>Reporter: Vicente Masip
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I really don't know if this is a bug or can be done with some function:
> Sometimes is very important to assign something to a column which name has to 
> be access trough a variable. Normally, I have always used it with doble 
> brackets likes this out of SparkR problems:
> # df could be faithful normal data frame or data table.
> # accesing by variable name:
> myname = "waiting"
> df[[myname]] <- c(1:nrow(df))
> # or even column number
> df[[2]] <- df$eruptions
> The error is not caused by the right side of the "<-" operator of assignment. 
> The problem is that I can't assign to a column name using a variable or 
> column number as I do in this examples out of spark. Doesn't matter if I am 
> modifying or creating column. Same problem.
> I have also tried to use this with no results:
> val df2 = withColumn(df,"tmp", df$eruptions)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19314) Do not allow sort before aggregation in Structured Streaming plan

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832299#comment-15832299
 ] 

Apache Spark commented on SPARK-19314:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16662

> Do not allow sort before aggregation in Structured Streaming plan
> -
>
> Key: SPARK-19314
> URL: https://issues.apache.org/jira/browse/SPARK-19314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Sort in a streaming plan should be allowed only after a aggregation in 
> complete mode. Currently it is incorrectly allowed when present anywhere in 
> the plan. It gives unpredictable potentially incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19314) Do not allow sort before aggregation in Structured Streaming plan

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19314:


Assignee: Apache Spark  (was: Tathagata Das)

> Do not allow sort before aggregation in Structured Streaming plan
> -
>
> Key: SPARK-19314
> URL: https://issues.apache.org/jira/browse/SPARK-19314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>
> Sort in a streaming plan should be allowed only after a aggregation in 
> complete mode. Currently it is incorrectly allowed when present anywhere in 
> the plan. It gives unpredictable potentially incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19314) Do not allow sort before aggregation in Structured Streaming plan

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19314:


Assignee: Tathagata Das  (was: Apache Spark)

> Do not allow sort before aggregation in Structured Streaming plan
> -
>
> Key: SPARK-19314
> URL: https://issues.apache.org/jira/browse/SPARK-19314
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Sort in a streaming plan should be allowed only after a aggregation in 
> complete mode. Currently it is incorrectly allowed when present anywhere in 
> the plan. It gives unpredictable potentially incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19314) Do not allow sort before aggregation in Structured Streaming plan

2017-01-20 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-19314:
-

 Summary: Do not allow sort before aggregation in Structured 
Streaming plan
 Key: SPARK-19314
 URL: https://issues.apache.org/jira/browse/SPARK-19314
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


Sort in a streaming plan should be allowed only after a aggregation in complete 
mode. Currently it is incorrectly allowed when present anywhere in the plan. It 
gives unpredictable potentially incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19313:


Assignee: (was: Apache Spark)

> GaussianMixture throws cryptic error when number of features is too high
> 
>
> Key: SPARK-19313
> URL: https://issues.apache.org/jira/browse/SPARK-19313
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> The following fails
> {code}
> val df = Seq(
>   Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
>   Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
>   .map(Tuple1.apply).toDF("features")
> val gm = new GaussianMixture()
> gm.fit(df)
> {code}
> It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
> and in this case we'll get integer overflow. We should limit the number of 
> features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19313:


Assignee: Apache Spark

> GaussianMixture throws cryptic error when number of features is too high
> 
>
> Key: SPARK-19313
> URL: https://issues.apache.org/jira/browse/SPARK-19313
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> The following fails
> {code}
> val df = Seq(
>   Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
>   Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
>   .map(Tuple1.apply).toDF("features")
> val gm = new GaussianMixture()
> gm.fit(df)
> {code}
> It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
> and in this case we'll get integer overflow. We should limit the number of 
> features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832290#comment-15832290
 ] 

Apache Spark commented on SPARK-19313:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/16661

> GaussianMixture throws cryptic error when number of features is too high
> 
>
> Key: SPARK-19313
> URL: https://issues.apache.org/jira/browse/SPARK-19313
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> The following fails
> {code}
> val df = Seq(
>   Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
>   Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
>   .map(Tuple1.apply).toDF("features")
> val gm = new GaussianMixture()
> gm.fit(df)
> {code}
> It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
> and in this case we'll get integer overflow. We should limit the number of 
> features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19313) GaussianMixture throws cryptic error when number of features is too high

2017-01-20 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-19313:


 Summary: GaussianMixture throws cryptic error when number of 
features is too high
 Key: SPARK-19313
 URL: https://issues.apache.org/jira/browse/SPARK-19313
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Seth Hendrickson
Priority: Minor


The following fails

{code}
val df = Seq(
  Vectors.sparse(46400, Array(0, 4), Array(3.0, 8.0)),
  Vectors.sparse(46400, Array(1, 5), Array(4.0, 9.0)))
  .map(Tuple1.apply).toDF("features")
val gm = new GaussianMixture()
gm.fit(df)
{code}

It fails because GMMs allocate an array of size {{numFeatures * numFeatures}} 
and in this case we'll get integer overflow. We should limit the number of 
features appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18496) java.lang.AssertionError: assertion failed

2017-01-20 Thread John Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832214#comment-15832214
 ] 

John Myers commented on SPARK-18496:


Upgraded to 2.1.0 using java8, same problem exists, cannot reproduce on demand.

> java.lang.AssertionError: assertion failed
> --
>
> Key: SPARK-18496
> URL: https://issues.apache.org/jira/browse/SPARK-18496
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
> Environment: 2.0.2 snapshot
>Reporter: Harish
>
> I am getting this error when i store the estimates from Julia output to a DF 
> and then i do df.cache()
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 177.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 177.0 (TID 9722, 10.63.136.108): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:156)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns cause data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Summary: Nulls in non nullable columns cause data corruption in parquet  
(was: Nulls in non nullable columns causes data corruption in parquet)

> Nulls in non nullable columns cause data corruption in parquet
> --
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> The problem we're seeing is that if a null occurs in a non-nullable field and 
> is written down to parquet the resulting file gets corrupted and can not be 
> read back correctly.
> One way that this can occur is if a long value in python overflows the sql 
> LongType, this results in a null value inside the dataframe.
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value 
> in column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. 
> repetition level: 0, definition level: 0
>   at 
> 

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Summary: Nulls in non nullable columns causes data corruption in parquet  
(was: Nulls in non nullable columns cause data corruption in parquet)

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> The problem we're seeing is that if a null occurs in a non-nullable field and 
> is written down to parquet the resulting file gets corrupted and can not be 
> read back correctly.
> One way that this can occur is if a long value in python overflows the sql 
> LongType, this results in a null value inside the dataframe.
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value 
> in column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. 
> repetition level: 0, definition level: 0
>   at 
> 

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Description: 
The problem we're seeing is that if a null occurs in a non-nullable field and 
is written down to parquet the resulting file gets corrupted and can not be 
read back correctly.

One way that this can occur is if a long value in python overflows the sql 
LongType, this results in a null value inside the dataframe.

We're also seeing that the behaviour is different depending on whether or not 
the vectorized reader is enabled.

Here's an example in PySpark

{code}
from datetime import datetime
from pyspark.sql import types

data = [
  (1, 6),
  (2, 7),
  (3, 2 ** 64), # value overflows sql LongType
  (4, 8),
  (5, 9)
]

schema = types.StructType([
  types.StructField("index", types.LongType(), False),
  types.StructField("long", types.LongType(), False),
])

df = sc.sql.createDataFrame(data, schema)

df.collect()

df.write.parquet("corrupt_parquet")

df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")

df_parquet.collect()
{code}

with the vectorized reader enabled this causes

{code}
In [2]: df.collect()
Out[2]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=None),
 Row(index=4, long=8),
 Row(index=5, long=9)]

In [3]: df_parquet.collect()
Out[3]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=8),
 Row(index=4, long=9),
 Row(index=5, long=5)]
{code}

as you can see reading the data back from disk causes data to get shifted up 
and between columns.

with the vectorized reader disabled we are completely unable to read the file.

{code}
Py4JJavaError: An error occurred while calling o143.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 4 in block 0 in file 
file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in 
column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. repetition 
level: 0, definition level: 0
at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 19 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read long
at 
org.apache.parquet.column.values.plain.PlainValuesReader$LongPlainValuesReader.readLong(PlainValuesReader.java:131)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:258)
at 

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Description: 
The problem we're seeing is that if a null occurs in a no-nullable field and is 
written down to parquet the resulting file gets corrupted and can not be read 
back correctly.

One way that this can occur is when a long value in python is too big to fit 
into a sql LongType it gets cast to null. 

We're also seeing that the behaviour is different depending on whether or not 
the vectorized reader is enabled.

Here's an example in PySpark

{code}
from datetime import datetime
from pyspark.sql import types

data = [
  (1, 6),
  (2, 7),
  (3, 2 ** 64), # value overflows sql LongType
  (4, 8),
  (5, 9)
]

schema = types.StructType([
  types.StructField("index", types.LongType(), False),
  types.StructField("long", types.LongType(), False),
])

df = sc.sql.createDataFrame(data, schema)

df.collect()

df.write.parquet("corrupt_parquet")

df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")

df_parquet.collect()
{code}

with the vectorized reader enabled this causes

{code}
In [2]: df.collect()
Out[2]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=None),
 Row(index=4, long=8),
 Row(index=5, long=9)]

In [3]: df_parquet.collect()
Out[3]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=8),
 Row(index=4, long=9),
 Row(index=5, long=5)]
{code}

as you can see reading the data back from disk causes data to get shifted up 
and between columns.

with the vectorized reader disabled we are completely unable to read the file.

{code}
Py4JJavaError: An error occurred while calling o143.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 4 in block 0 in file 
file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in 
column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. repetition 
level: 0, definition level: 0
at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 19 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read long
at 
org.apache.parquet.column.values.plain.PlainValuesReader$LongPlainValuesReader.readLong(PlainValuesReader.java:131)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:258)
at 

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Description: 
The problem we're seeing is that if a null occurs in a non-nullable field and 
is written down to parquet the resulting file gets corrupted and can not be 
read back correctly.

One way that this can occur is when a long value in python is too big to fit 
into a sql LongType it gets cast to null. 

We're also seeing that the behaviour is different depending on whether or not 
the vectorized reader is enabled.

Here's an example in PySpark

{code}
from datetime import datetime
from pyspark.sql import types

data = [
  (1, 6),
  (2, 7),
  (3, 2 ** 64), # value overflows sql LongType
  (4, 8),
  (5, 9)
]

schema = types.StructType([
  types.StructField("index", types.LongType(), False),
  types.StructField("long", types.LongType(), False),
])

df = sc.sql.createDataFrame(data, schema)

df.collect()

df.write.parquet("corrupt_parquet")

df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")

df_parquet.collect()
{code}

with the vectorized reader enabled this causes

{code}
In [2]: df.collect()
Out[2]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=None),
 Row(index=4, long=8),
 Row(index=5, long=9)]

In [3]: df_parquet.collect()
Out[3]:
[Row(index=1, long=6),
 Row(index=2, long=7),
 Row(index=3, long=8),
 Row(index=4, long=9),
 Row(index=5, long=5)]
{code}

as you can see reading the data back from disk causes data to get shifted up 
and between columns.

with the vectorized reader disabled we are completely unable to read the file.

{code}
Py4JJavaError: An error occurred while calling o143.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 4 in block 0 in file 
file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value in 
column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. repetition 
level: 0, definition level: 0
at 
org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:364)
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
... 19 more
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read long
at 
org.apache.parquet.column.values.plain.PlainValuesReader$LongPlainValuesReader.readLong(PlainValuesReader.java:131)
at 
org.apache.parquet.column.impl.ColumnReaderImpl$2$4.read(ColumnReaderImpl.java:258)
at 

[jira] [Comment Edited] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832195#comment-15832195
 ] 

Jason White edited comment on SPARK-19299 at 1/20/17 6:14 PM:
--

These seem like two or three separate issues.
- Python long doesn't fit into Scala long, resulting in Null even in a 
non-nullable field
- Python datetime near the epoch doesn't fit ( ? ) into a Scala timestamp, 
resulting in Null even in a non-nullable field
- DataFrame serialization to disk assumes, but doesn't verify, no nulls exist 
in non-nullable fields. This assumption is obviously fragile. (Also tested with 
JSON serialization - the column simply disappears in the JSON output for that 
row)


was (Author: jason.white):
These seem like two or three separate issues.
- Python long doesn't fit into Scala long, resulting in Null even in a 
non-nullable field
- Python datetime near the epoch doesn't fit ( ? ) into a Scala timestamp, 
resulting in Null even in a non-nullable field
- DataFrame serialization to disk doesn't assumes, but doesn't verify, no nulls 
exist in non-nullable fields. This assumption is obviously fragile. (Also 
tested with JSON serialization - the column simply disappears in the JSON 
output for that row)

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> 

[jira] [Updated] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza updated SPARK-19299:

Priority: Critical  (was: Major)

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>Priority: Critical
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value 
> in column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. 
> repetition level: 0, definition level: 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)
>   at 
> 

[jira] [Comment Edited] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832195#comment-15832195
 ] 

Jason White edited comment on SPARK-19299 at 1/20/17 6:09 PM:
--

These seem like two or three separate issues.
- Python long doesn't fit into Scala long, resulting in Null even in a 
non-nullable field
- Python datetime near the epoch doesn't fit ( ? ) into a Scala timestamp, 
resulting in Null even in a non-nullable field
- DataFrame serialization to disk doesn't assumes, but doesn't verify, no nulls 
exist in non-nullable fields. This assumption is obviously fragile. (Also 
tested with JSON serialization - the column simply disappears in the JSON 
output for that row)


was (Author: jason.white):
These seem like two or three separate issues.
- Python long doesn't fit into Scala long, resulting in Null even in a 
non-nullable field
- Python datetime near the epoch doesn't fit (?) into a Scala timestamp, 
resulting in Null even in a non-nullable field
- DataFrame serialization to disk doesn't assumes, but doesn't verify, no nulls 
exist in non-nullable fields. This assumption is obviously fragile. (Also 
tested with JSON serialization - the column simply disappears in the JSON 
output for that row)

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> 

[jira] [Commented] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832195#comment-15832195
 ] 

Jason White commented on SPARK-19299:
-

These seem like two or three separate issues.
- Python long doesn't fit into Scala long, resulting in Null even in a 
non-nullable field
- Python datetime near the epoch doesn't fit (?) into a Scala timestamp, 
resulting in Null even in a non-nullable field
- DataFrame serialization to disk doesn't assumes, but doesn't verify, no nulls 
exist in non-nullable fields. This assumption is obviously fragile. (Also 
tested with JSON serialization - the column simply disappears in the JSON 
output for that row)

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> 

[jira] [Comment Edited] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread David S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832157#comment-15832157
 ] 

David S edited comment on SPARK-18392 at 1/20/17 6:02 PM:
--

Hi, I have a question about the approx Nearest Neighbor implementation. I 
understand that you first search the hash signatures in the dataset wich are 
equal to the key hash signature, but this implicate the comparision with all 
instances. Why not first have  separate the hash signature dataset in the 
buckets,  and in this way compare the hash key against the buckets and not 
against each instance?. I hope you understand my question, because my english 
is not  very well, sorry.



was (Author: david s):
Hi, I have a question about the approx Nearest Neighbor implementation. I 
understand that you first search the hash signatures in the dataset wich are 
equal to the key hash signature, but this implicate the comparision with all 
instances. Why not first have  separate the hash signature dataset in the 
buckets,  and in this way compare the hash key against the buckets and not 
against each instance?. I hope you understand my question, because my english 
not is  very well, sorry.


> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> 

[jira] [Commented] (SPARK-17248) Add native Scala enum support to Dataset Encoders

2017-01-20 Thread Leif Warner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832163#comment-15832163
 ] 

Leif Warner commented on SPARK-17248:
-

Much more efficient encodings than strings are possible with enums. That's a 
big reason I was looking to encode things as enums (or the Scala sum type 
equivalent, a sealed abstract class with a few case objects that extend it).
Consider:
{code}
object Bool extends Enumeration {
  val True, False = Value
}
{code}

It'd be a lot more efficient to encode a 1 or 0 instead of the strings "True" 
or "False" every time.

> Add native Scala enum support to Dataset Encoders
> -
>
> Key: SPARK-17248
> URL: https://issues.apache.org/jira/browse/SPARK-17248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Silvio Fiorito
>
> Enable support for Scala enums in Encoders. Ideally, users should be able to 
> use enums as part of case classes automatically.
> Currently, this code...
> {code}
> object MyEnum extends Enumeration {
>   type MyEnum = Value
>   val EnumVal1, EnumVal2 = Value
> }
> case class MyClass(col: MyEnum.MyEnum)
> val data = Seq(MyClass(MyEnum.EnumVal1), MyClass(MyEnum.EnumVal2)).toDS()
> {code}
> ...results in this stacktrace:
> {code}
> ava.lang.UnsupportedOperationException: No Encoder found for MyEnum.MyEnum
> - field (class: "scala.Enumeration.Value", name: "col")
> - root class: 
> "line550c9f34c5144aa1a1e76bcac863244717.$read.$iwC.$iwC.$iwC.$iwC.MyClass"
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:598)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:592)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:583)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:583)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:61)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:274)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:47)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-01-20 Thread David S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832157#comment-15832157
 ] 

David S commented on SPARK-18392:
-

Hi, I have a question about the approx Nearest Neighbor implementation. I 
understand that you first search the hash signatures in the dataset wich are 
equal to the key hash signature, but this implicate the comparision with all 
instances. Why not first have  separate the hash signature dataset in the 
buckets,  and in this way compare the hash key against the buckets and not 
against each instance?. I hope you understand my question, because my english 
not is  very well, sorry.


> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message 

[jira] [Commented] (SPARK-19162) UserDefinedFunction constructor should verify that func is callable

2017-01-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832127#comment-15832127
 ] 

Ryan Blue commented on SPARK-19162:
---

[~rxin], I think this one is ready for a final review and commit, too. Thank 
you!

> UserDefinedFunction constructor should verify that func is callable
> ---
>
> Key: SPARK-19162
> URL: https://issues.apache.org/jira/browse/SPARK-19162
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current state
> Right now `UserDefinedFunctions` don't perform any input type validation. It 
> will accept non-callable objects just to fail with hard to understand 
> traceback:
> {code}
> In [1]: from pyspark.sql.functions import udf
> In [2]: df = spark.range(0, 1)
> In [3]: f = udf(None)
> In [4]: df.select(f()).first()
> 17/01/07 19:30:50 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> ...
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> ...
> TypeError: 'NoneType' object is not callable
> ...
> {code}
> Proposed
> Apply basic validation for {{func}} argument:
> {code}
> In [7]: udf(None)
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 udf(None)
> ...
> TypeError: func should be a callable object (a function or an instance of a 
> class with __call__). Got 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19160) Decorator for UDF creation.

2017-01-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832124#comment-15832124
 ] 

Ryan Blue edited comment on SPARK-19160 at 1/20/17 5:14 PM:


[~rxin], I think this one is ready to be merged. Who is a good person to do a 
final review and commit for python?


was (Author: rdblue):
@rxin, I think this one is ready to be merged. Who is a good person to do a 
final review and commit for python?

> Decorator for UDF creation.
> ---
>
> Key: SPARK-19160
> URL: https://issues.apache.org/jira/browse/SPARK-19160
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Right now there are a few ways we can create UDF:
> - With standalone function:
> {code}
> def _add_one(x):
> """Adds one"""
> if x is not None:
>  return x + 1
> add_one = udf(_add_one, IntegerType())
> {code}
> This allows for full control flow, including exception handling, but 
> duplicates variables.
> 
> - With `lambda` expression:
> {code}
> add_one = udf(lambda x: x + 1 if x is not None else None, IntegerType())
> {code}
> No variable duplication but only pure expressions.
> - Using nested functions with immediate call:
> {code}
> def add_one(c):
> def add_one_(x):
> if x is not None:
> return x + 1
> return udf(add_one_, IntegerType())(c)
> {code}
> Quite verbose but enables full control flow and clearly indicates expected 
> number of arguments.
> 
> - Using `udf` functions as a decorator:
> {code}
> @udf
> def add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> {code}
> Possible but only with default `returnType` (or curried `@partial(udf, 
> returnType=IntegerType())`).
> 
> Proposed
> Add `udf` decorator which can be used as follows:
> {code}
> from pyspark.sql.decorators import udf
> @udf(IntegerType())
> def add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> {code}
> or 
> {code}
> @udf()
> def strip(x):
> """Strips String"""
> if x is not None:
> return x.strip()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19160) Decorator for UDF creation.

2017-01-20 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832124#comment-15832124
 ] 

Ryan Blue commented on SPARK-19160:
---

@rxin, I think this one is ready to be merged. Who is a good person to do a 
final review and commit for python?

> Decorator for UDF creation.
> ---
>
> Key: SPARK-19160
> URL: https://issues.apache.org/jira/browse/SPARK-19160
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Right now there are a few ways we can create UDF:
> - With standalone function:
> {code}
> def _add_one(x):
> """Adds one"""
> if x is not None:
>  return x + 1
> add_one = udf(_add_one, IntegerType())
> {code}
> This allows for full control flow, including exception handling, but 
> duplicates variables.
> 
> - With `lambda` expression:
> {code}
> add_one = udf(lambda x: x + 1 if x is not None else None, IntegerType())
> {code}
> No variable duplication but only pure expressions.
> - Using nested functions with immediate call:
> {code}
> def add_one(c):
> def add_one_(x):
> if x is not None:
> return x + 1
> return udf(add_one_, IntegerType())(c)
> {code}
> Quite verbose but enables full control flow and clearly indicates expected 
> number of arguments.
> 
> - Using `udf` functions as a decorator:
> {code}
> @udf
> def add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> {code}
> Possible but only with default `returnType` (or curried `@partial(udf, 
> returnType=IntegerType())`).
> 
> Proposed
> Add `udf` decorator which can be used as follows:
> {code}
> from pyspark.sql.decorators import udf
> @udf(IntegerType())
> def add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> {code}
> or 
> {code}
> @udf()
> def strip(x):
> """Strips String"""
> if x is not None:
> return x.strip()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19069) Expose task 'status' and 'duration' in spark history server REST API.

2017-01-20 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-19069:
-
Assignee: Parag Chaudhari

> Expose task 'status' and 'duration' in spark history server REST API.
> -
>
> Key: SPARK-19069
> URL: https://issues.apache.org/jira/browse/SPARK-19069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Parag Chaudhari
>Assignee: Parag Chaudhari
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> Although Spark history server UI shows task ‘status’ and ‘duration’ fields, 
> it does not expose these fields in the REST API response. For the Spark 
> history server API users, it is not possible to determine task status and 
> duration. Spark history server has access to task status and duration from 
> event log, but it is not exposing these in API. This patch is proposed to 
> expose task ‘status’ and ‘duration’ fields in Spark history server REST API.
> e.g. Spark history server UI: PFA
> e.g. Spark history sever REST API response with no ‘status’ and ‘duration’:
> {noformat}
> {
>   "taskId" : 7,
>   "index" : 0,
>   "attempt" : 0,
>   "launchTime" : "2017-01-02T17:32:43.037GMT",
>   "executorId" : "2",
>   "host" : "ip-10-171-154-17.ec2.internal",
>   "taskLocality" : "NODE_LOCAL",
>   "speculative" : false,
>   "accumulatorUpdates" : [ ],
>   "taskMetrics" : {
>     "executorDeserializeTime" : 138,
>     "executorRunTime" : 10524,
>     "resultSize" : 2078,
>     "jvmGcTime" : 240,
>     "resultSerializationTime" : 0,
>     "memoryBytesSpilled" : 0,
>     "diskBytesSpilled" : 0,
>     "inputMetrics" : {
>   "bytesRead" : 0,
>   "recordsRead" : 0
>     },
>     "outputMetrics" : {
>   "bytesWritten" : 7474953,
>   "recordsWritten" : 287254
>     },
>     "shuffleReadMetrics" : {
>   "remoteBlocksFetched" : 4,
>   "localBlocksFetched" : 3,
>   "fetchWaitTime" : 203,
>   "remoteBytesRead" : 4740801,
>   "localBytesRead" : 2011044,
>   "recordsRead" : 134
>     },
>     "shuffleWriteMetrics" : {
>   "bytesWritten" : 0,
>   "writeTime" : 0,
>   "recordsWritten" : 0
>     }
>   }
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19069) Expose task 'status' and 'duration' in spark history server REST API.

2017-01-20 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-19069.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16473
[https://github.com/apache/spark/pull/16473]

> Expose task 'status' and 'duration' in spark history server REST API.
> -
>
> Key: SPARK-19069
> URL: https://issues.apache.org/jira/browse/SPARK-19069
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Parag Chaudhari
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> Although Spark history server UI shows task ‘status’ and ‘duration’ fields, 
> it does not expose these fields in the REST API response. For the Spark 
> history server API users, it is not possible to determine task status and 
> duration. Spark history server has access to task status and duration from 
> event log, but it is not exposing these in API. This patch is proposed to 
> expose task ‘status’ and ‘duration’ fields in Spark history server REST API.
> e.g. Spark history server UI: PFA
> e.g. Spark history sever REST API response with no ‘status’ and ‘duration’:
> {noformat}
> {
>   "taskId" : 7,
>   "index" : 0,
>   "attempt" : 0,
>   "launchTime" : "2017-01-02T17:32:43.037GMT",
>   "executorId" : "2",
>   "host" : "ip-10-171-154-17.ec2.internal",
>   "taskLocality" : "NODE_LOCAL",
>   "speculative" : false,
>   "accumulatorUpdates" : [ ],
>   "taskMetrics" : {
>     "executorDeserializeTime" : 138,
>     "executorRunTime" : 10524,
>     "resultSize" : 2078,
>     "jvmGcTime" : 240,
>     "resultSerializationTime" : 0,
>     "memoryBytesSpilled" : 0,
>     "diskBytesSpilled" : 0,
>     "inputMetrics" : {
>   "bytesRead" : 0,
>   "recordsRead" : 0
>     },
>     "outputMetrics" : {
>   "bytesWritten" : 7474953,
>   "recordsWritten" : 287254
>     },
>     "shuffleReadMetrics" : {
>   "remoteBlocksFetched" : 4,
>   "localBlocksFetched" : 3,
>   "fetchWaitTime" : 203,
>   "remoteBytesRead" : 4740801,
>   "localBytesRead" : 2011044,
>   "recordsRead" : 134
>     },
>     "shuffleWriteMetrics" : {
>   "bytesWritten" : 0,
>   "writeTime" : 0,
>   "recordsWritten" : 0
>     }
>   }
>     }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19312) Spark gives wrong error message when failes to create file due to hdfs quota limit.

2017-01-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832072#comment-15832072
 ] 

Sean Owen commented on SPARK-19312:
---

Hive on Spark is part of Hive, not Spark.

> Spark gives wrong error message when failes to create file due to hdfs quota 
> limit.
> ---
>
> Key: SPARK-19312
> URL: https://issues.apache.org/jira/browse/SPARK-19312
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: CDH 5.8
>Reporter: Rivkin Andrey
>Priority: Minor
>  Labels: hdfs, quota, spark,
>
> If we set quota on user space and then will try to create table through hive 
> on spark, which will need more space then avaliable, spark will fail with: 
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  failed to create file 
> /user//hive_db/.hive-staging_hive_/_task_tmp.-ext-10003/_tmp.30_0 
> for DFSClient_NONMAPREDUCE_-27052423_230 for client 192.168.x.x because 
> current leaseholder is trying to recreate file.
> If we will change hive execution engine to mr and execute the same command - 
> create table, we will get:
> Caused by: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The 
> DiskSpace quota of /user/ is exceeded: quota = 10737418240 B = 10 GB but 
> diskspace consumed = 11098812438 B = 10.34 GB
> After increasing quota hive on spark is working. 
> The problem is with log message, which is inaccurate and not helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Franklyn Dsouza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832048#comment-15832048
 ] 

Franklyn Dsouza commented on SPARK-19299:
-

These issues also are very likely reproducible in scala if a null is ever 
written to a non-nullable field.

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value 
> in column [long] INT64 at value 5 out of 5, 5 out of 5 in currentPage. 
> repetition level: 0, definition level: 0
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.readValue(ColumnReaderImpl.java:462)

[jira] [Commented] (SPARK-19282) RandomForestRegressionModel should expose getMaxDepth

2017-01-20 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832022#comment-15832022
 ] 

Xin Ren commented on SPARK-19282:
-

Thank you Nick. 
I'll give it a try to fix it. :)

> RandomForestRegressionModel should expose getMaxDepth
> -
>
> Key: SPARK-19282
> URL: https://issues.apache.org/jira/browse/SPARK-19282
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nick Lothian
>Priority: Minor
>
> Currently it isn't clear hot to get the max depth of a 
> RandomForestRegressionModel (eg, after doing a grid search)
> It is possible to call
> {{regressor._java_obj.getMaxDepth()}} 
> but most other decision trees allow
> {{regressor.getMaxDepth()}} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19299) Nulls in non nullable columns causes data corruption in parquet

2017-01-20 Thread Jason White (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832006#comment-15832006
 ] 

Jason White commented on SPARK-19299:
-

Also seeing this same behaviour in Spark 2.0.1 when creating a DataFrame with a 
timestamp at or near the epoch.

My computer is in Eastern time, so 1969-12-31T19:00:00-0500 is unix timestamp 0.

{code}
>>> from datetime import datetime
>>> dt = datetime(1969, 12, 31, 19, 0, 0)

>>> from pyspark.sql import SQLContext
>>> sql = SQLContext(sc)

>>> from pyspark.sql.types import StructType, StructField, TimestampType
>>> schema = StructType([StructField('ts', TimestampType(), False)])
>>> df = sql.createDataFrame([(dt,)], schema)

>>> df.schema
StructType(List(StructField(ts,TimestampType,false)))

>>> df.collect()
[Row(ts=None)]
{code}

Weirdly, this continues on for over half an hour after the epoch:
{code}
>>> dt = datetime(1969, 12, 31, 19, 35, 47)
>>> df = sql.createDataFrame([(dt,)], schema)
>>> df.collect()
[Row(ts=None)]
>>> dt = datetime(1969, 12, 31, 19, 35, 48)
>>> df = sql.createDataFrame([(dt,)], schema)
>>> df.collect()
[Row(ts=datetime.datetime(1969, 12, 31, 19, 35, 48))]
{code}

> Nulls in non nullable columns causes data corruption in parquet
> ---
>
> Key: SPARK-19299
> URL: https://issues.apache.org/jira/browse/SPARK-19299
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Franklyn Dsouza
>
> The problem we're seeing is that if a null occurs in a no-nullable field and 
> is written down to parquet the resulting file gets corrupt and can not be 
> read back correctly.
> One way that this can occur is when a long value in python is too big to fit 
> into a sql LongType it gets cast to null. 
> We're also seeing that the behaviour is different depending on whether or not 
> the vectorized reader is enabled.
> Here's an example in PySpark
> {code}
> from datetime import datetime
> from pyspark.sql import types
> data = [
>   (1, 6),
>   (2, 7),
>   (3, 2 ** 64), # value overflows sql LongType
>   (4, 8),
>   (5, 9)
> ]
> schema = types.StructType([
>   types.StructField("index", types.LongType(), False),
>   types.StructField("long", types.LongType(), False),
> ])
> df = sc.sql.createDataFrame(data, schema)
> df.collect()
> df.write.parquet("corrupt_parquet")
> df_parquet = sqlCtx.read.parquet("corrupt_parquet/*.parquet")
> df_parquet.collect()
> {code}
> with the vectorized reader enabled this causes
> {code}
> In [2]: df.collect()
> Out[2]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=None),
>  Row(index=4, long=8),
>  Row(index=5, long=9)]
> In [3]: df_parquet.collect()
> Out[3]:
> [Row(index=1, long=6),
>  Row(index=2, long=7),
>  Row(index=3, long=8),
>  Row(index=4, long=9),
>  Row(index=5, long=5)]
> {code}
> as you can see reading the data back from disk causes data to get shifted up 
> and between columns.
> with the vectorized reader disabled we are completely unable to read the file.
> {code}
> Py4JJavaError: An error occurred while calling o143.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 3, localhost): org.apache.parquet.io.ParquetDecodingException: Can not 
> read value at 4 in block 0 in file 
> file:/Users/franklyndsouza/dev/starscream/corrupt/part-r-0-4fa5aee8-2138-4e0c-b6d8-22a418d90fd3.snappy.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> 

[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-20 Thread Junfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831964#comment-15831964
 ] 

Junfeng commented on SPARK-17602:
-

[~davies] the trouble really is the python worker share mode is not works for 
many cases. For example the static variable will cause trouble among tasks. 
Many of our users do not setup the shared mode. Then it leads to this issue.

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19311) UDFs disregard UDT type hierarchy

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19311:


Assignee: Apache Spark

> UDFs disregard UDT type hierarchy
> -
>
> Key: SPARK-19311
> URL: https://issues.apache.org/jira/browse/SPARK-19311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gregor Moehler
>Assignee: Apache Spark
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you define UDTs based on hierarchical traits UDFs disregard the type 
> hierarchy:
> E.g. I have 2 UDTs based on 2 hierarchical traits. I then define 2 UDFs: The 
> first one returns the derived type, the second takes the base type. This 
> results in an error, although i believe it should be feasible:
> {quote}
> (...)cannot resolve 'UDF(UDF(22))' due to data type mismatch: argument 1 
> requires exampleBaseType type, however, 'UDF(22)' is of exampleFirstSubType 
> type.
> {quote}
> The reason is that DataType defines
> {quote}
> override private[sql] def acceptsType(dataType: DataType) =
> this.getClass == dataType.getClass
> {quote}
> However I believe it should be:
> {quote}
> override private[sql] def acceptsType(dataType: DataType) = dataType match \{
> case other: UserDefinedType[_] =>
>   this.getClass == other.getClass || 
> this.userClass.isAssignableFrom(other.userClass)
> case _ => false
>   \}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19311) UDFs disregard UDT type hierarchy

2017-01-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19311:


Assignee: (was: Apache Spark)

> UDFs disregard UDT type hierarchy
> -
>
> Key: SPARK-19311
> URL: https://issues.apache.org/jira/browse/SPARK-19311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gregor Moehler
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you define UDTs based on hierarchical traits UDFs disregard the type 
> hierarchy:
> E.g. I have 2 UDTs based on 2 hierarchical traits. I then define 2 UDFs: The 
> first one returns the derived type, the second takes the base type. This 
> results in an error, although i believe it should be feasible:
> {quote}
> (...)cannot resolve 'UDF(UDF(22))' due to data type mismatch: argument 1 
> requires exampleBaseType type, however, 'UDF(22)' is of exampleFirstSubType 
> type.
> {quote}
> The reason is that DataType defines
> {quote}
> override private[sql] def acceptsType(dataType: DataType) =
> this.getClass == dataType.getClass
> {quote}
> However I believe it should be:
> {quote}
> override private[sql] def acceptsType(dataType: DataType) = dataType match \{
> case other: UserDefinedType[_] =>
>   this.getClass == other.getClass || 
> this.userClass.isAssignableFrom(other.userClass)
> case _ => false
>   \}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19311) UDFs disregard UDT type hierarchy

2017-01-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831938#comment-15831938
 ] 

Apache Spark commented on SPARK-19311:
--

User 'gmoehler' has created a pull request for this issue:
https://github.com/apache/spark/pull/16660

> UDFs disregard UDT type hierarchy
> -
>
> Key: SPARK-19311
> URL: https://issues.apache.org/jira/browse/SPARK-19311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gregor Moehler
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you define UDTs based on hierarchical traits UDFs disregard the type 
> hierarchy:
> E.g. I have 2 UDTs based on 2 hierarchical traits. I then define 2 UDFs: The 
> first one returns the derived type, the second takes the base type. This 
> results in an error, although i believe it should be feasible:
> {quote}
> (...)cannot resolve 'UDF(UDF(22))' due to data type mismatch: argument 1 
> requires exampleBaseType type, however, 'UDF(22)' is of exampleFirstSubType 
> type.
> {quote}
> The reason is that DataType defines
> {quote}
> override private[sql] def acceptsType(dataType: DataType) =
> this.getClass == dataType.getClass
> {quote}
> However I believe it should be:
> {quote}
> override private[sql] def acceptsType(dataType: DataType) = dataType match \{
> case other: UserDefinedType[_] =>
>   this.getClass == other.getClass || 
> this.userClass.isAssignableFrom(other.userClass)
> case _ => false
>   \}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19311) UDFs disregard UDT type hierarchy

2017-01-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831888#comment-15831888
 ] 

Liang-Chi Hsieh edited comment on SPARK-19311 at 1/20/17 3:06 PM:
--

[~gmoehler] I think you already have the fixing. Can you directly submit the 
PR? Thanks.


was (Author: viirya):
[~Gregor Moehler] I think you already have the fixing. Can you directly submit 
the PR? Thanks.

> UDFs disregard UDT type hierarchy
> -
>
> Key: SPARK-19311
> URL: https://issues.apache.org/jira/browse/SPARK-19311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gregor Moehler
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you define UDTs based on hierarchical traits UDFs disregard the type 
> hierarchy:
> E.g. I have 2 UDTs based on 2 hierarchical traits. I then define 2 UDFs: The 
> first one returns the derived type, the second takes the base type. This 
> results in an error, although i believe it should be feasible:
> {quote}
> (...)cannot resolve 'UDF(UDF(22))' due to data type mismatch: argument 1 
> requires exampleBaseType type, however, 'UDF(22)' is of exampleFirstSubType 
> type.
> {quote}
> The reason is that DataType defines
> {quote}
> override private[sql] def acceptsType(dataType: DataType) =
> this.getClass == dataType.getClass
> {quote}
> However I believe it should be:
> {quote}
> override private[sql] def acceptsType(dataType: DataType) = dataType match \{
> case other: UserDefinedType[_] =>
>   this.getClass == other.getClass || 
> this.userClass.isAssignableFrom(other.userClass)
> case _ => false
>   \}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19311) UDFs disregard UDT type hierarchy

2017-01-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831888#comment-15831888
 ] 

Liang-Chi Hsieh commented on SPARK-19311:
-

[~Gregor Moehler] I think you already have the fixing. Can you directly submit 
the PR? Thanks.

> UDFs disregard UDT type hierarchy
> -
>
> Key: SPARK-19311
> URL: https://issues.apache.org/jira/browse/SPARK-19311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Gregor Moehler
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When you define UDTs based on hierarchical traits UDFs disregard the type 
> hierarchy:
> E.g. I have 2 UDTs based on 2 hierarchical traits. I then define 2 UDFs: The 
> first one returns the derived type, the second takes the base type. This 
> results in an error, although i believe it should be feasible:
> {quote}
> (...)cannot resolve 'UDF(UDF(22))' due to data type mismatch: argument 1 
> requires exampleBaseType type, however, 'UDF(22)' is of exampleFirstSubType 
> type.
> {quote}
> The reason is that DataType defines
> {quote}
> override private[sql] def acceptsType(dataType: DataType) =
> this.getClass == dataType.getClass
> {quote}
> However I believe it should be:
> {quote}
> override private[sql] def acceptsType(dataType: DataType) = dataType match \{
> case other: UserDefinedType[_] =>
>   this.getClass == other.getClass || 
> this.userClass.isAssignableFrom(other.userClass)
> case _ => false
>   \}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19312) Spark gives wrong error message when failes to create file due to hdfs quota limit.

2017-01-20 Thread Rivkin Andrey (JIRA)
Rivkin Andrey created SPARK-19312:
-

 Summary: Spark gives wrong error message when failes to create 
file due to hdfs quota limit.
 Key: SPARK-19312
 URL: https://issues.apache.org/jira/browse/SPARK-19312
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
 Environment: CDH 5.8
Reporter: Rivkin Andrey
Priority: Minor


If we set quota on user space and then will try to create table through hive on 
spark, which will need more space then avaliable, spark will fail with: 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
 failed to create file 
/user//hive_db/.hive-staging_hive_/_task_tmp.-ext-10003/_tmp.30_0 
for DFSClient_NONMAPREDUCE_-27052423_230 for client 192.168.x.x because current 
leaseholder is trying to recreate file.


If we will change hive execution engine to mr and execute the same command - 
create table, we will get:

Caused by: org.apache.hadoop.hdfs.protocol.DSQuotaExceededException: The 
DiskSpace quota of /user/ is exceeded: quota = 10737418240 B = 10 GB but 
diskspace consumed = 11098812438 B = 10.34 GB

After increasing quota hive on spark is working. 
The problem is with log message, which is inaccurate and not helpful.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16683) Group by does not work after multiple joins of the same dataframe

2017-01-20 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831878#comment-15831878
 ] 

Andrew Ray commented on SPARK-16683:


I'm working on a solution for this

> Group by does not work after multiple joins of the same dataframe
> -
>
> Key: SPARK-16683
> URL: https://issues.apache.org/jira/browse/SPARK-16683
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
> Environment: local and yarn
>Reporter: Witold Jędrzejewski
> Attachments: code_2.0.txt, Duplicates Problem Presentation.json
>
>
> When I join a dataframe, group by a field from it, then join it again by 
> different field and group by field from it, second aggregation does not 
> trigger.
> Minimal example showing the problem is attached as the text to paste into 
> spark-shell (code_2.0.txt).
> The detailed description and minimal example, workaround and possible cause 
> are in the attachment, in a form of Zeppelin notebook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19155) MLlib GeneralizedLinearRegression family and link should case insensitive

2017-01-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19155:
---

Assignee: Yanbo Liang

> MLlib GeneralizedLinearRegression family and link should case insensitive 
> --
>
> Key: SPARK-19155
> URL: https://issues.apache.org/jira/browse/SPARK-19155
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> ML {{GeneralizedLinearRegression}} only support lowercase input for 
> {{family}} and {{link}} currently. For example, the following code will throw 
> exception:
> {code}
> val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
> {code}
> However, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
> We should make {{family}} and {{link}} case insensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19155) MLlib GeneralizedLinearRegression family and link should case insensitive

2017-01-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19155:

Description: 
ML {{GeneralizedLinearRegression}} should support both uppercase and lowercase. 
For example, the following code will throw exception currently:
{code}
val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
{code}
Actually, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
We should make {{family}} and {{link}} case insensitive. 

  was:
ML {{GeneralizedLinearRegression}} should support both uppercase and lowercase. 
For example, the following code will throw exception currently:
{code}
val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
{code}
We should make string params case insensitive.


> MLlib GeneralizedLinearRegression family and link should case insensitive 
> --
>
> Key: SPARK-19155
> URL: https://issues.apache.org/jira/browse/SPARK-19155
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> ML {{GeneralizedLinearRegression}} should support both uppercase and 
> lowercase. For example, the following code will throw exception currently:
> {code}
> val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
> {code}
> Actually, R glm only accepts families: {{gaussian, binomial, poisson, 
> Gamma}}. We should make {{family}} and {{link}} case insensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19155) MLlib GeneralizedLinearRegression family and link should case insensitive

2017-01-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19155:

Description: 
ML {{GeneralizedLinearRegression}} only support lowercase input for {{family}} 
and {{link}} currently. For example, the following code will throw exception:
{code}
val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
{code}
However, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
We should make {{family}} and {{link}} case insensitive. 

  was:
ML {{GeneralizedLinearRegression}} should support both uppercase and lowercase. 
For example, the following code will throw exception currently:
{code}
val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
{code}
Actually, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
We should make {{family}} and {{link}} case insensitive. 


> MLlib GeneralizedLinearRegression family and link should case insensitive 
> --
>
> Key: SPARK-19155
> URL: https://issues.apache.org/jira/browse/SPARK-19155
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> ML {{GeneralizedLinearRegression}} only support lowercase input for 
> {{family}} and {{link}} currently. For example, the following code will throw 
> exception:
> {code}
> val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
> {code}
> However, R glm only accepts families: {{gaussian, binomial, poisson, Gamma}}. 
> We should make {{family}} and {{link}} case insensitive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19155) MLlib GeneralizedLinearRegression family and link should case insensitive

2017-01-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-19155:

Summary: MLlib GeneralizedLinearRegression family and link should case 
insensitive   (was: ML GLR string params should support both uppercase and 
lowercase)

> MLlib GeneralizedLinearRegression family and link should case insensitive 
> --
>
> Key: SPARK-19155
> URL: https://issues.apache.org/jira/browse/SPARK-19155
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> ML {{GeneralizedLinearRegression}} should support both uppercase and 
> lowercase. For example, the following code will throw exception currently:
> {code}
> val trainer = new GeneralizedLinearRegression().setFamily("Gamma")
> {code}
> We should make string params case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >