[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-04-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8208:
-
Labels: dataset  (was: dataset dataset-parquet-read)

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8208:
-
Description: 
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket: 
ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file (in a dataset or not)?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...

  was:
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file (in a dataset or not)?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...


> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket: 
> ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8208:

Labels: dataset dataset-parquet-read  (was: )

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>  Labels: dataset, dataset-parquet-read
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Christophe Clienti (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Clienti updated ARROW-8208:
--
Description: 
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file (in a dataset or not)?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...

  was:
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file in a dataset or not?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...


> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file (in a dataset or not)?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Christophe Clienti (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Clienti updated ARROW-8208:
--
Description: 
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround proposed here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file in a dataset or not?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...

  was:
Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround propose here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
   filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file in a dataset or not?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...


> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround proposed here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file in a dataset or not?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8208) [PYTHON] Row Group Filtering With ParquetDataset

2020-03-25 Thread Christophe Clienti (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Clienti updated ARROW-8208:
--
Summary: [PYTHON] Row Group Filtering With ParquetDataset  (was: [PYTHON] 
RowGroup filtering with ParquetDataset)

> [PYTHON] Row Group Filtering With ParquetDataset
> 
>
> Key: ARROW-8208
> URL: https://issues.apache.org/jira/browse/ARROW-8208
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Christophe Clienti
>Priority: Major
>
> Hello,
> I tried to use the row_group filtering at the file level with an instance of 
> ParquetDataset without success.
> I've tested the workaround propose here:
>  [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]
> But I wonder if it can work on a file as I get an exception with the 
> following code:
> {code:python}
> ParquetDataset('data.parquet',
>filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
> {code}
> {noformat}
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {noformat}
> I read the documentation, and the filtering seems to work only on partitioned 
> dataset. Moreover I read some information in the following JIRA ticket:
>  https://issues.apache.org/jira/browse/ARROW-1796
> So I'm not sure that a ParquetDataset can use row_group statistics to filter 
> specific row_group in a file in a dataset or not?
> As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
> (statistics.min instead of statistics.min_value), I was able to apply the 
> row_group filtering.
> Today I'm forced with pyarrow to filter manually the row_groups in each file, 
> which prevents me to use the ParquetDataset partition filtering functionality.
> The row groups are really useful because it prevents to fill the filesystem 
> with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)