Possible SPIP to improve matrix and vector column type support

2018-04-11 Thread Leif Walsh
Hi all, I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and I’ve found myself wanting more. There is a minor issue in that with the arrow serialization enabled, these types don’t serialize properly in python UDF calls or in toPandas. There’s a natural representation for

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Dongjoon Hyun
Great. If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together. Currently, the patch is only merged into master branch now. 1.4.1 has the following issue. https://issues.apache.org/jira/browse/SPARK-23340

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Reynold Xin
Seems like this would make sense... we usually make maintenance releases for bug fixes after a month anyway. On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson wrote: > > > On 11 April 2018 at 12:47, Ryan Blue wrote: > >> I think a 1.8.3 Parquet

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Henry Robinson
On 11 April 2018 at 12:47, Ryan Blue wrote: > I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of > Spark. > > To be clear though, this only affects Spark when reading data written by > Impala, right? Or does Parquet CPP also produce data like this? >

Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Ryan Blue
I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of Spark. To be clear though, this only affects Spark when reading data written by Impala, right? Or does Parquet CPP also produce data like this? On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson wrote: > Hi

Maintenance releases for SPARK-23852?

2018-04-11 Thread Henry Robinson
Hi all - SPARK-23852 (where a query can silently give wrong results thanks to a predicate pushdown bug in Parquet) is a fairly bad bug. In other projects I've been involved with, we've released maintenance releases for bugs of this severity. Since Spark 2.4.0 is probably a while away, I wanted

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

2018-04-11 Thread Gerard Maas
Hi, I'm looking into the Parquet format support for the File source in Structured Streaming. The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1] What would be the practical use of that in a streaming context? In its batch counterpart,

Re: cache OS memory and spark usage of it

2018-04-11 Thread Jose Raul Perez Rodriguez
it was helpful, Then, the OS needs to fill some pressure from the applications requesting memory to free some memory cache? Exactly under which circumstances the OS free that memory to give it to applications requesting it? I mean if the total memory is 16GB and 10GB are used for OS cache,