Parquet read performance for different schemas

2019-09-19 Thread Tomas Bartalos
Hello, I have 2 parquets (each containing 1 file): - parquet-wide - schema has 25 top level cols + 1 array - parquet-narrow - schema has 3 top level cols Both files have same data for given columns. When I read from parquet-wide spark reports* read 52.6 KB*, from parquet-narrow *only 2.6

Incorrect results in left_outer join in DSv2 implementation with filter pushdown - spark 2.3.2

2019-09-19 Thread Shubham Chaurasia
Hi, Consider the following statements: 1) > scala> val df = spark.read.format("com.shubham.MyDataSource").load > scala> df.show > +---+---+ > | i| j| > +---+---+ > | 0| 0| > | 1| -1| > | 2| -2| > | 3| -3| > | 4| -4| > +---+---+ > 2) > scala> val df1 = df.filter("i < 3") > scala> df1.show

Re: Thoughts on Spark 3 release, or a preview release

2019-09-19 Thread Mats Rydberg
Hello all, We are Martin and Mats from Neo4j and we're working on the Spark Graph SPIP (https://issues.apache.org/jira/browse/SPARK-25994). We are also +1 for a Spark 3.0 preview release and setting a timeline for the actual release. The SPIP was accepted in the beginning of this year and we've