Re: A couple of questions about pyarrow.parquet

2019-05-23 Thread Uwe L. Korn
Hello Ted,

regarding predicate pushdown in Python, have a look at my unfinished PR at 
https://github.com/apache/arrow/pull/2623. This was stopped since we were 
missing native filter in Arrow. The requirements for that have now been 
implemented and we could probably reactivate the PR.

Uwe

On Sat, May 18, 2019, at 3:53 AM, Ted Gooch wrote:
> Thanks Micah and Wes.
> 
> Definitely interested in the *Predicate Pushdown* and *Schema inference,
> schema-on-read, and schema normalization *sections.
> 
> On Fri, May 17, 2019 at 12:47 PM Wes McKinney  wrote:
> 
> > Please see also
> >
> >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
> >
> > And prior mailing list discussion. I will comment in more detail on the
> > other items later
> >
> > On Fri, May 17, 2019, 2:44 PM Micah Kornfield 
> > wrote:
> >
> > > I can't help on the first question.
> > >
> > > Regarding push-down predicates, there is an open JIRA [1] to do just that
> > >
> > > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > > <
> > >
> > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > > >
> > >
> > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
> > >
> > > > Hi,
> > > >
> > > > I've been doing some work trying to get the parquet read path going for
> > > the
> > > > python iceberg 
> > library.  I
> > > > have two questions that I couldn't get figured out, and was hoping I
> > > could
> > > > get some guidance from the list here.
> > > >
> > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > > it
> > > > appears that only limited information is available in the ColumnSchema
> > > > passed back to the python client[2]:
> > > >
> > > > 
> > > >   name: key
> > > >   path: m.map.key
> > > >   max_definition_level: 2
> > > >   max_repetition_level: 1
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: key
> > > >   path: m.map.value.map.key
> > > >   max_definition_level: 4
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: value
> > > >   path: m.map.value.map.value
> > > >   max_definition_level: 5
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > >
> > > >
> > > > where physical_type and logical_type are both strings[1].  The arrow
> > > schema
> > > > I can get from *to_arrow_schema *looks to be more expressive(although
> > may
> > > > be I just don't understand the parquet format well enough):
> > > >
> > > > m: struct > list > > > struct not null>>> not null>>
> > > >   child 0, map: list > > list > > > struct not null>>> not null>
> > > >   child 0, map: struct > > > struct not null>>>
> > > >   child 0, key: string
> > > >   child 1, value: struct > > value:
> > > > string> not null>>
> > > >   child 0, map: list > string>
> > > > not null>
> > > >   child 0, map: struct
> > > >   child 0, key: string
> > > >   child 1, value: string
> > > >
> > > >
> > > > It seems like I can infer the info from the name/path, but is there a
> > > more
> > > > direct way of getting the detailed parquet schema information?
> > > >
> > > > Second question, is there a way to push record level filtering into the
> > > > parquet reader, so that the parquet reader only reads in values that
> > > match
> > > > a given predicate expression? Predicate expressions would be simple
> > > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > > connected with logical operators(AND, OR, NOT).
> > > >
> > > > I've seen that after reading-in I can use the filtering language in
> > > > gandiva[3] to get filtered record-batches, but was looking for
> > somewhere
> > > > lower in the stack if possible.
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > > CREATE TABLE `iceberg`.`nested_map` (
> > > > m map>)
> > > > [3]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > > >
> > >
> >
>


Re: A couple of questions about pyarrow.parquet

2019-05-17 Thread Ted Gooch
Thanks Micah and Wes.

Definitely interested in the *Predicate Pushdown* and *Schema inference,
schema-on-read, and schema normalization *sections.

On Fri, May 17, 2019 at 12:47 PM Wes McKinney  wrote:

> Please see also
>
>
> https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
>
> And prior mailing list discussion. I will comment in more detail on the
> other items later
>
> On Fri, May 17, 2019, 2:44 PM Micah Kornfield 
> wrote:
>
> > I can't help on the first question.
> >
> > Regarding push-down predicates, there is an open JIRA [1] to do just that
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > <
> >
> https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > >
> >
> > On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
> >
> > > Hi,
> > >
> > > I've been doing some work trying to get the parquet read path going for
> > the
> > > python iceberg 
> library.  I
> > > have two questions that I couldn't get figured out, and was hoping I
> > could
> > > get some guidance from the list here.
> > >
> > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > it
> > > appears that only limited information is available in the ColumnSchema
> > > passed back to the python client[2]:
> > >
> > > 
> > >   name: key
> > >   path: m.map.key
> > >   max_definition_level: 2
> > >   max_repetition_level: 1
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > 
> > >   name: key
> > >   path: m.map.value.map.key
> > >   max_definition_level: 4
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > > 
> > >   name: value
> > >   path: m.map.value.map.value
> > >   max_definition_level: 5
> > >   max_repetition_level: 2
> > >   physical_type: BYTE_ARRAY
> > >   logical_type: UTF8
> > >
> > >
> > > where physical_type and logical_type are both strings[1].  The arrow
> > schema
> > > I can get from *to_arrow_schema *looks to be more expressive(although
> may
> > > be I just don't understand the parquet format well enough):
> > >
> > > m: struct list > > struct not null>>> not null>>
> > >   child 0, map: list > list > > struct not null>>> not null>
> > >   child 0, map: struct > > struct not null>>>
> > >   child 0, key: string
> > >   child 1, value: struct > value:
> > > string> not null>>
> > >   child 0, map: list string>
> > > not null>
> > >   child 0, map: struct
> > >   child 0, key: string
> > >   child 1, value: string
> > >
> > >
> > > It seems like I can infer the info from the name/path, but is there a
> > more
> > > direct way of getting the detailed parquet schema information?
> > >
> > > Second question, is there a way to push record level filtering into the
> > > parquet reader, so that the parquet reader only reads in values that
> > match
> > > a given predicate expression? Predicate expressions would be simple
> > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > connected with logical operators(AND, OR, NOT).
> > >
> > > I've seen that after reading-in I can use the filtering language in
> > > gandiva[3] to get filtered record-batches, but was looking for
> somewhere
> > > lower in the stack if possible.
> > >
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > CREATE TABLE `iceberg`.`nested_map` (
> > > m map>)
> > > [3]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > >
> >
>


Re: A couple of questions about pyarrow.parquet

2019-05-17 Thread Wes McKinney
Please see also

https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk

And prior mailing list discussion. I will comment in more detail on the
other items later

On Fri, May 17, 2019, 2:44 PM Micah Kornfield  wrote:

> I can't help on the first question.
>
> Regarding push-down predicates, there is an open JIRA [1] to do just that
>
> [1] https://issues.apache.org/jira/browse/PARQUET-473
> <
> https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> >
>
> On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
>
> > Hi,
> >
> > I've been doing some work trying to get the parquet read path going for
> the
> > python iceberg  library.  I
> > have two questions that I couldn't get figured out, and was hoping I
> could
> > get some guidance from the list here.
> >
> > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> it
> > appears that only limited information is available in the ColumnSchema
> > passed back to the python client[2]:
> >
> > 
> >   name: key
> >   path: m.map.key
> >   max_definition_level: 2
> >   max_repetition_level: 1
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> > 
> >   name: key
> >   path: m.map.value.map.key
> >   max_definition_level: 4
> >   max_repetition_level: 2
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> > 
> >   name: value
> >   path: m.map.value.map.value
> >   max_definition_level: 5
> >   max_repetition_level: 2
> >   physical_type: BYTE_ARRAY
> >   logical_type: UTF8
> >
> >
> > where physical_type and logical_type are both strings[1].  The arrow
> schema
> > I can get from *to_arrow_schema *looks to be more expressive(although may
> > be I just don't understand the parquet format well enough):
> >
> > m: struct > struct not null>>> not null>>
> >   child 0, map: list list > struct not null>>> not null>
> >   child 0, map: struct > struct not null>>>
> >   child 0, key: string
> >   child 1, value: struct value:
> > string> not null>>
> >   child 0, map: list
> > not null>
> >   child 0, map: struct
> >   child 0, key: string
> >   child 1, value: string
> >
> >
> > It seems like I can infer the info from the name/path, but is there a
> more
> > direct way of getting the detailed parquet schema information?
> >
> > Second question, is there a way to push record level filtering into the
> > parquet reader, so that the parquet reader only reads in values that
> match
> > a given predicate expression? Predicate expressions would be simple
> > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > connected with logical operators(AND, OR, NOT).
> >
> > I've seen that after reading-in I can use the filtering language in
> > gandiva[3] to get filtered record-batches, but was looking for somewhere
> > lower in the stack if possible.
> >
> >
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > [2] Spark/Hive Table DDL for this parquet file looks like:
> > CREATE TABLE `iceberg`.`nested_map` (
> > m map>)
> > [3]
> >
> >
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> >
>


Re: A couple of questions about pyarrow.parquet

2019-05-17 Thread Micah Kornfield
I can't help on the first question.

Regarding push-down predicates, there is an open JIRA [1] to do just that

[1] https://issues.apache.org/jira/browse/PARQUET-473


On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:

> Hi,
>
> I've been doing some work trying to get the parquet read path going for the
> python iceberg  library.  I
> have two questions that I couldn't get figured out, and was hoping I could
> get some guidance from the list here.
>
> First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
> appears that only limited information is available in the ColumnSchema
> passed back to the python client[2]:
>
> 
>   name: key
>   path: m.map.key
>   max_definition_level: 2
>   max_repetition_level: 1
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
> 
>   name: key
>   path: m.map.value.map.key
>   max_definition_level: 4
>   max_repetition_level: 2
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
> 
>   name: value
>   path: m.map.value.map.value
>   max_definition_level: 5
>   max_repetition_level: 2
>   physical_type: BYTE_ARRAY
>   logical_type: UTF8
>
>
> where physical_type and logical_type are both strings[1].  The arrow schema
> I can get from *to_arrow_schema *looks to be more expressive(although may
> be I just don't understand the parquet format well enough):
>
> m: struct struct not null>>> not null>>
>   child 0, map: list struct not null>>> not null>
>   child 0, map: struct struct not null>>>
>   child 0, key: string
>   child 1, value: struct string> not null>>
>   child 0, map: list
> not null>
>   child 0, map: struct
>   child 0, key: string
>   child 1, value: string
>
>
> It seems like I can infer the info from the name/path, but is there a more
> direct way of getting the detailed parquet schema information?
>
> Second question, is there a way to push record level filtering into the
> parquet reader, so that the parquet reader only reads in values that match
> a given predicate expression? Predicate expressions would be simple
> field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> connected with logical operators(AND, OR, NOT).
>
> I've seen that after reading-in I can use the filtering language in
> gandiva[3] to get filtered record-batches, but was looking for somewhere
> lower in the stack if possible.
>
>
>
> [1]
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> [2] Spark/Hive Table DDL for this parquet file looks like:
> CREATE TABLE `iceberg`.`nested_map` (
> m map>)
> [3]
>
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
>


A couple of questions about pyarrow.parquet

2019-05-17 Thread Ted Gooch
Hi,

I've been doing some work trying to get the parquet read path going for the
python iceberg  library.  I
have two questions that I couldn't get figured out, and was hoping I could
get some guidance from the list here.

First, I'd like to create a ParquetSchema->IcebergSchema converter, but it
appears that only limited information is available in the ColumnSchema
passed back to the python client[2]:


  name: key
  path: m.map.key
  max_definition_level: 2
  max_repetition_level: 1
  physical_type: BYTE_ARRAY
  logical_type: UTF8

  name: key
  path: m.map.value.map.key
  max_definition_level: 4
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8

  name: value
  path: m.map.value.map.value
  max_definition_level: 5
  max_repetition_level: 2
  physical_type: BYTE_ARRAY
  logical_type: UTF8


where physical_type and logical_type are both strings[1].  The arrow schema
I can get from *to_arrow_schema *looks to be more expressive(although may
be I just don't understand the parquet format well enough):

m: struct not null>>> not null>>
  child 0, map: list not null>>> not null>
  child 0, map: struct not null>>>
  child 0, key: string
  child 1, value: struct not null>>
  child 0, map: list
not null>
  child 0, map: struct
  child 0, key: string
  child 1, value: string


It seems like I can infer the info from the name/path, but is there a more
direct way of getting the detailed parquet schema information?

Second question, is there a way to push record level filtering into the
parquet reader, so that the parquet reader only reads in values that match
a given predicate expression? Predicate expressions would be simple
field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
connected with logical operators(AND, OR, NOT).

I've seen that after reading-in I can use the filtering language in
gandiva[3] to get filtered record-batches, but was looking for somewhere
lower in the stack if possible.



[1]
https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
[2] Spark/Hive Table DDL for this parquet file looks like:
CREATE TABLE `iceberg`.`nested_map` (
m map>)
[3]
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100