Re: [VOTE] Migration of parquet-cpp issues to Arrow's issue tracker

2024-05-29 Thread Wes McKinney
+1 (binding for Arrow and Parquet)

On Wed, May 29, 2024 at 12:13 PM Raúl Cumplido 
wrote:

> +1 (binding for Arrow)
>
> El mié, 29 may 2024, 18:15, Andy Grove  escribió:
>
> > +1 (binding for Arrow).
> >
> > Thanks,
> >
> > Andy.
> >
> > On Wed, May 29, 2024 at 9:48 AM Alenka Frim  > .invalid>
> > wrote:
> >
> > > +1 (non-binding)
> > >
> > > Thank you Rok!
> > >
> > > On Wed, May 29, 2024 at 4:57 PM Gang Wu  wrote:
> > >
> > > > +1 (binding for Parquet)
> > > >
> > > > Thanks!
> > > > Gang
> > > >
> > > > On Wed, May 29, 2024 at 10:47 PM Fokko Driesprong 
> > > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > Op wo 29 mei 2024 om 16:46 schreef Felipe Oliveira Carvalho <
> > > > > felipe...@gmail.com>:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > On Wed, 29 May 2024 at 11:30 Micah Kornfield <
> > emkornfi...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (non-binding for Parquet, Binding for Arrow if that makes a
> > > > > > difference)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, May 29, 2024 at 7:15 AM Rok Mihevc <
> rok.mih...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > # sending this to both dev@arrow and dev@parquet
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Following the ML discussion [1] I would like to propose a
> vote
> > > for
> > > > > > > > parquet-cpp issues to be moved from Parquet Jira [2] to
> Arrow's
> > > > issue
> > > > > > > > tracker [3].
> > > > > > > >
> > > > > > > > [1]
> > > > https://lists.apache.org/thread/zklp0lwcbcsdzgxoxy6wqjwrvt6y4s9p
> > > > > > > > [2] https://issues.apache.org/jira/projects/PARQUET/issues/
> > > > > > > > [3] https://github.com/apache/arrow/issues/
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours.
> > > > > > > >
> > > > > > > > [ ] +1 Migrate parquet-cpp issues
> > > > > > > > [ ] +0
> > > > > > > > [ ] -1 Do not migrate parquet-cpp issues because...
> > > > > > > >
> > > > > > > >
> > > > > > > > Rok
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Interest in Parquet V3

2024-05-15 Thread Wes McKinney
hi all,

Just to add some of my perspective (and I would like to write up some
longer form thoughts since I've been collaborating / talking with the
Nimble and Lance folks -- and as a result I know a lot about the details of
Nimble, BtrBlocks, and also the recent Bullion research format from
UMD/ByteDance -- and I've been consulting/advising some of the research
work that's been referenced ).

Firstly, I 100% agree that documenting implementation support and details,
and cross-compatibility is essential. It would have been better for Parquet
to have integration tests from day one between Impala and parquet-mr, but
this never happened and so there was some initial impedance mismatch
between the two halves of the initial Parquet community going back to the
early days. When I started working on Parquet in 2015, the motivation was
mainly to fill the urgent need to be able to read these files from C++ for
use in Python (and eventually R and other C++ consuming languages).

As far as the issues in Parquet:

- The all-or-nothing footer decoding for datasets with large schemas or
many row groups has always been problematic (I've been asked to present
quantitative evidence to support this "problematic" statement so I will try
to make some!). So I think any work that does not make it much cheaper to
read a single column from a single row group is very nearly dead on arrival
-- I am not sure how you fully make this problem go away in generality
without doing away with Thrift at the footer level, but at that point you
are making such a disruptive change that why not try to fix some other
problems as well? If you go down that rabbit hole, you have created a new
file format that is no longer Parquet, and so calling it ParquetV3 is
probably misleading.

- Parquet's data page format has worked well over time, but aside from
fixing the metadata overhead issue, the data page itself needs to be
extensible. There is DATA_PAGE_V2, but structurally it is the same as
DATA_PAGE{_V1} with the repetition and definition levels kept outside of
the compressed portion. You can kind of think of Parquet's data page
structure as one possible choice of options in a general purpose nested
encoding scheme (most implementations do dictionary+rle and falls back on
plain encoding when the dictionary exceeds a certain size). We could create
a DATA_PAGE_V3 that allows for an whole alternate -- and even pluggable --
encoding scheme, without changing the metadata, and this would be valuable
to the Parquet community, even if most mainstream Parquet users (e.g.
Spark) will opt not to use it for a period of some years for compatibility
reasons.

- Another problem that I haven't seen mentioned but maybe I just missed it
is that Parquet is very painful to decode on accelerators like GPUs. RAPIDS
has created a CUDA implementation of Parquet decoding (including decoding
the Thrift data page headers on the GPU), but there are two primary
problems 1) there is metadata that is necessary for control flow on the
host side within the ColumnChunk in the row group and 2) there are not
sufficient memory preallocation hints -- how much memory you need to
allocate to fully decode a data page. This is also discussed in
https://github.com/facebookincubator/nimble/discussions/50

Personally, I struggle to see how the metadata issues are fixable -- at
least in a satisfactory fashion where we could get behind calling something
ParquetV3 when it would basically be a new file format masquerading as a
major version of an existing file format. It also adds a lot of
implementation complexity for anyone setting out to support "Parquet".

I think there is significant value in developing + researching accelerated
"codecs" (basically, new data page formats -- think about how h.264 and
h.265 have superseded MPEG-2 in video encoding) and finding a way to
incorporate them into Parquet, e.g. with a new DATA_PAGE_V3 page type or
similar. It would be ideal for Parquet and its implementations to continue
to improve.

That said, it's unclear that Parquet as a file container for encoded data
can be evolved to satisfactorily resolve all of the above issues, and I
don't think it needs to. It seems inevitable that we will end up with new
file containers and implementations, but the ideal scenario would be to
develop reusable "codec" libraries (like the nested encoding scheme in
Nimble or in BtrBlocks -- they're very similar) and then use them in
multiple places.

Anyway, it's good to see many opinions on this and I look forward to
continued dialogue.

Thanks
Wes

On Wed, May 15, 2024 at 7:56 AM Steve Loughran 
wrote:

> On Tue, 14 May 2024 at 17:48, Julien Le Dem  wrote:
>
> > +1 on Micah starting a doc and following up by commenting in it.
> >
>
> +maybe some conf call where people of interest can talk about it.
>
>
>
> >
> > @Raphael, Wish Maple: agreed that changing the metadata representation is
> > less important. Most engines can externalize and index metadata in some
> > way.
>
>
> wo

[jira] [Assigned] (PARQUET-2110) Fix Typos in LogicalTypes.md

2022-01-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-2110:
-

Assignee: jincongho

> Fix Typos in LogicalTypes.md
> 
>
> Key: PARQUET-2110
> URL: https://issues.apache.org/jira/browse/PARQUET-2110
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: jincongho
>Assignee: jincongho
>Priority: Trivial
>
> interpertations -> interpretations
> regadless -> regardless
> unambigously -> unambiguously



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2110) Fix Typos in LogicalTypes.md

2022-01-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-2110:
--
Fix Version/s: format-2.10.0

> Fix Typos in LogicalTypes.md
> 
>
> Key: PARQUET-2110
> URL: https://issues.apache.org/jira/browse/PARQUET-2110
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: jincongho
>Assignee: jincongho
>Priority: Trivial
> Fix For: format-2.10.0
>
>
> interpertations -> interpretations
> regadless -> regardless
> unambigously -> unambiguously



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2110) Fix Typos in LogicalTypes.md

2022-01-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-2110.
---
Resolution: Fixed

Resolved in PR https://github.com/apache/parquet-format/pull/181

> Fix Typos in LogicalTypes.md
> 
>
> Key: PARQUET-2110
> URL: https://issues.apache.org/jira/browse/PARQUET-2110
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: jincongho
>Priority: Trivial
>
> interpertations -> interpretations
> regadless -> regardless
> unambigously -> unambiguously



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Parquet quarterly board report is due today (April 16)

2021-04-16 Thread Wes McKinney
I'm not sure if someone has started working on a draft, but it would
be good to put something together in time for the board deadline.


Re: [VOTE] Release Apache Parquet Format 2.9.0 RC0

2021-04-08 Thread Wes McKinney
hi Gabor — I think you may need to be a PMC member? I'm not sure though.

+1 (binding), verified signature and checksum on the artifact

On Wed, Apr 7, 2021 at 10:19 AM Gabor Szadovszky  wrote:
>
> I've updated the KEYS file with your public key in the release repo (
> downloads.apache.org is updated already). Please keep in mind that you will
> still need write access to the release repo to finalize the release after
> the vote passes. Guys, any idea how to request write access to a repo?
>
> Verified checksum and signature; unit tests pass; parquet-mr builds with
> the new RC.
> +1(binding)
>
>
>
>
> On Wed, Apr 7, 2021 at 4:51 PM Antoine Pitrou  wrote:
>
> >
> > Ok, I've tried multiple variations and I still can't commit to the
> > release repository.
> >
> > May I ask you to commit the following patch:
> > https://gist.github.com/pitrou/0f9f1ffe280cfb48ea9427ebec19b65e
> >
> > You can check that the key block matches the one I added in the dev
> > repo.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Wed, 7 Apr 2021 16:35:16 +0200
> > Gabor Szadovszky
> > 
> > wrote:
> > > I don't have too much experience in svn. I usually follow the commands
> > > listed in the how to release doc and it works for me. (Don't remember if
> > > I've had to do some initial steps.) As a committer you should have write
> > > access to all the repositories of the Parquet community.
> > >
> > > On Wed, Apr 7, 2021 at 4:18 PM Antoine Pitrou 
> > wrote:
> > >
> > > >
> > > > Ah!  It seems I can't push to that repo:
> > > >
> > > > SendingKEYS
> > > > Transmitting file data .svn: E195023: Commit failed (details follow):
> > > > svn: E195023: Changing file
> > '/home/antoine/apache/parquet-release/KEYS' is
> > > > forbidden by the server
> > > > svn: E175013: Access to
> > > > '/repos/dist/!svn/txr/46918-13e8/release/parquet/KEYS' forbidden
> > > >
> > > >
> > > > The URL I used for checkout is
> > > > https://apit...@dist.apache.org/repos/dist/release/parquet
> > > > Should I use another one?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > > > On Wed, 7 Apr 2021 16:00:26 +0200
> > > > Gabor Szadovszky
> > > > 
> > > > wrote:
> > > > > Sorry, I've missed you updated the dev repo. The downloads page
> > mirrors
> > > > the
> > > > > release repo. Yet another place (besides the parquet-format and
> > > > parquet-mr
> > > > > repos) where we store a KEYS file for whatever reason. Please update
> > the
> > > > > one in the release repo.
> > > > >
> > > > > On Wed, Apr 7, 2021 at 3:47 PM Gabor Szadovszky <
> > > > > gabor.szadovs...@cloudera.com> wrote:
> > > > >
> > > > > > I guess it only requires some time to sync. Last time the release
> > > > tarball
> > > > > > required ~1hour to sync.
> > > > > >
> > > > > > On Wed, Apr 7, 2021 at 3:42 PM Antoine Pitrou 
> >
> > > > wrote:
> > > > > >
> > > > > >>
> > > > > >> Hi Gabor,
> > > > > >>
> > > > > >> Ok, I updated the KEYS file in the Parquet SVN repository.
> > > > > >> The changes do appear in
> > > > > >> https://dist.apache.org/repos/dist/dev/parquet/KEYS -- but not in
> > > > > >> https://downloads.apache.org/parquet/KEYS .  Is there any
> > additional
> > > > > >> step I should perform?
> > > > > >>
> > > > > >> Regards
> > > > > >>
> > > > > >> Antoine.
> > > > > >>
> > > > > >>
> > > > > >> On Wed, 7 Apr 2021 15:19:24 +0200
> > > > > >> Gabor Szadovszky  wrote:
> > > > > >>
> > > > > >> > Hi Antoine,
> > > > > >> >
> > > > > >> > Thanks for initiating this release! You need to update the
> > listed
> > > > KEYS
> > > > > >> file
> > > > > >> > with your public key otherwise we cannot validate the
> > signature.
> > > > (To do
> > > > > >> > that you need to update the releases svn repo. See details in
> > the
> > > > how to
> > > > > >> > release doc about the publishing.)
> > > > > >> >
> > > > > >> > Regards,
> > > > > >> > Gabor
> > > > > >> >
> > > > > >> > On Wed, Apr 7, 2021 at 3:10 PM Antoine Pitrou <
> > anto...@python.org>
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Hi everyone,
> > > > > >> > >
> > > > > >> > > I propose the following RC to be released as official Apache
> > > > Parquet
> > > > > >> > > Format 2.9.0 release.
> > > > > >> > >
> > > > > >> > > The commit id is b4f0c0a643a6ec1a7def37115dd6967ba9346df7
> > > > > >> > > * This corresponds to the tag: apache-parquet-format-2.9.0-rc0
> > > > > >> > > *
> > > > > >> > >
> > > > > >>
> > > >
> > https://github.com/apache/parquet-format/tree/b4f0c0a643a6ec1a7def37115dd6967ba9346df7
> > > >
> > > > > >> > >
> > > > > >> > > The release tarball, signature, and checksums are here:
> > > > > >> > > *
> > > > > >> > >
> > > > > >>
> > > >
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.9.0-rc0/
> > > >
> > > > > >> > >
> > > > > >> > > You can find the KEYS file here:
> > > > > >> > > * https://downloads.apache.org/parquet/KEYS
> > > > > >> > >
> > > > > >> > > Binary artifacts are staged in Nexus here:
> > > > > >> >

[jira] [Commented] (PARQUET-1345) [C++] It is possible to overflow a TMemoryBuffer when serializing the file metadata

2020-10-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205494#comment-17205494
 ] 

Wes McKinney commented on PARQUET-1345:
---

Can you make a repro? Seems like something we should see if we can fix

> [C++] It is possible to overflow a TMemoryBuffer when serializing the file 
> metadata
> ---
>
> Key: PARQUET-1345
> URL: https://issues.apache.org/jira/browse/PARQUET-1345
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>
> I'm not sure if this is fixable, but see issue reported to Arrow:
> https://github.com/apache/arrow/issues/2077



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1878:
-

Assignee: Patrick Pai

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Assignee: Patrick Pai
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-09-22 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1878.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7789
[https://github.com/apache/arrow/pull/7789]

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186124#comment-17186124
 ] 

Wes McKinney commented on PARQUET-1904:
---

Done. I also made you an administrator so you can do this in the future

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1904:
--
Fix Version/s: cpp-1.6.0

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1904) [C++] Export file_offset in RowGroupMetaData

2020-08-27 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1904:
-

Assignee: Simon Bertron

> [C++] Export file_offset in RowGroupMetaData
> 
>
> Key: PARQUET-1904
> URL: https://issues.apache.org/jira/browse/PARQUET-1904
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Simon Bertron
>Assignee: Simon Bertron
>Priority: Trivial
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> In the C++ row group metadata object, the offset of the row group in the file 
> is stored, but not exposed to users. RowGroupMetaDataImpl has a field 
> file_offset and a method file_offset() that exposes it. But RowGroupMetaData 
> does not have a file_offset() method. This seems odd, most other fields in 
> RowGroupMetaDataImpl are exposed by RowGroupMetaData.
>  
> This issue is similar to ARROW-3590, but that issue seems pretty stale and is 
> requesting a python feature. I think this issue is more focused and detailed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1845) [C++] Int96 memory images in test cases assume only little-endian

2020-08-03 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1845.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6981
[https://github.com/apache/arrow/pull/6981]

> [C++] Int96 memory images in test cases assume only little-endian
> -
>
> Key: PARQUET-1845
> URL: https://issues.apache.org/jira/browse/PARQUET-1845
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Int96 is used as a pair of uint_64 and uint_32. Both elements can be handled 
> using a native endian for effectiveness.
> Int96 memory images in parquet-internal-tests assume only little-endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1845) [C++] Int96 memory images in test cases assume only little-endian

2020-08-03 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1845:
-

Assignee: Kazuaki Ishizaki

> [C++] Int96 memory images in test cases assume only little-endian
> -
>
> Key: PARQUET-1845
> URL: https://issues.apache.org/jira/browse/PARQUET-1845
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Int96 is used as a pair of uint_64 and uint_32. Both elements can be handled 
> using a native endian for effectiveness.
> Int96 memory images in parquet-internal-tests assume only little-endian.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1882) [C++] Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1882.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7718
[https://github.com/apache/arrow/pull/7718]

> [C++] Writing an all-null column and then reading it with buffered_stream 
> aborts the process
> 
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Assignee: Micah Kornfield
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, &page_buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1882) [C++] Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1882:
--
Summary: [C++] Writing an all-null column and then reading it with 
buffered_stream aborts the process  (was: Writing an all-null column and then 
reading it with buffered_stream aborts the process)

> [C++] Writing an all-null column and then reading it with buffered_stream 
> aborts the process
> 
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Assignee: Micah Kornfield
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, &page_buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1839) [C++] values_read not updated in ReadBatchSpaced

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1839.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7717
[https://github.com/apache/arrow/pull/7717]

> [C++] values_read not updated in ReadBatchSpaced 
> -
>
> Key: PARQUET-1839
> URL: https://issues.apache.org/jira/browse/PARQUET-1839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Nileema Shingte
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> values_read is not updated in some cases in the 
> `TypedColumnReaderImpl::ReadBatchSpaced` API
> we probably need to add 
> {code:java}
> *values_read = total_values;{code}
> After 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L906]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1839) [C++] values_read not updated in ReadBatchSpaced

2020-07-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1839:
--
Summary: [C++] values_read not updated in ReadBatchSpaced   (was: 
values_read not updated in ReadBatchSpaced )

> [C++] values_read not updated in ReadBatchSpaced 
> -
>
> Key: PARQUET-1839
> URL: https://issues.apache.org/jira/browse/PARQUET-1839
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Nileema Shingte
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> values_read is not updated in some cases in the 
> `TypedColumnReaderImpl::ReadBatchSpaced` API
> we probably need to add 
> {code:java}
> *values_read = total_values;{code}
> After 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L906]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1882) Writing an all-null column and then reading it with buffered_stream aborts the process

2020-07-09 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154724#comment-17154724
 ] 

Wes McKinney commented on PARQUET-1882:
---

Can you provide a reproducible code example?

> Writing an all-null column and then reading it with buffered_stream aborts 
> the process
> --
>
> Key: PARQUET-1882
> URL: https://issues.apache.org/jira/browse/PARQUET-1882
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Windows 10 64-bit, MSVC
>Reporter: Eric Gorelik
>Priority: Critical
>
> When writing a column unbuffered that contains only nulls, a 0-byte 
> dictionary page gets written. When then reading the resulting file with 
> buffered_stream enabled, the column reader gets the length of the page (which 
> is 0), and then tries to read that many bytes from the underlying input 
> stream.
> parquet/column_reader.cc, SerializedPageReader::NextPage
>  
> {code:java}
> int compressed_len = current_page_header_.compressed_page_size;
> int uncompressed_len = current_page_header_.uncompressed_page_size;
> // Read the compressed data page.
> std::shared_ptr page_buffer;
> PARQUET_THROW_NOT_OK(stream_->Read(compressed_len, &page_buffer));{code}
>  
> BufferedInputStream::Read, however, has an assertion that the bytes to read 
> is strictly positive, so the assertion fails and aborts the process.
> arrow/io/buffered.cc, BufferedInputStream::Impl
>  
> {code:java}
> Status Read(int64_t nbytes, int64_t* bytes_read, void* out) {
>   ARROW_CHECK_GT(nbytes, 0);
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-07-06 Thread Wes McKinney
On Mon, Jul 6, 2020 at 11:08 AM Antoine Pitrou  wrote:
>
>
> Le 06/07/2020 à 17:57, Steve Kim a écrit :
> > The Parquet format specification is ambiguous about the exact details of
> > LZ4 compression. However, the *de facto* reference implementation in Java
> > (parquet-mr) uses the Hadoop LZ4 codec.
> >
> > I think that it is important for Parquet c++ to have compatibility and
> > feature parity with parquet-mr when possible. I prefer to change the
> > LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
> > that is used by parquet-mr (
> > https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
> > change will be quick and easy. I have an intern under my supervision who is
> > available to work on it full time, starting immediately. Please let me know
> > if we ought to proceed.
>
> Would that keep compatibility with existing files produces by Parquet C++?

Given that LZ4 has been constantly broken in C++ (first using the raw
format, then the block format -- still incompatible apparently) I
think we would recommend that in the rare event that people have
LZ4-compressed files (likely not very ubiquitous, FWIW, Snappy is used
mostly) they should rewrite their files with a different codec using
e.g. pyarrow 0.17.1

> Regards
>
> Antoine.


Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-29 Thread Wes McKinney
On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou  wrote:
>
>
> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > hi folks,
> >
> > (cross-posting to dev@arrow and dev@parquet since there are
> > stakeholders in both places)
> >
> > It seems there are still problems at least with the C++ implementation
> > of LZ4 compression in Parquet files
> >
> > https://issues.apache.org/jira/browse/PARQUET-1241
> > https://issues.apache.org/jira/browse/PARQUET-1878
>
> I don't have any particular opinion on how to solve the LZ4 issue, but
> I'd like to mention that LZ4 and ZStandard are the two most efficient
> compression algorithms available, and they span different parts of the
> speed/compression spectrum, so it would be a pity to disable one of them.

It's true, however I think it's worse to write LZ4-compressed files
that cannot be read by other Parquet implementations (if that's what's
happening as I understand it?). If we are indeed shipping something
broken then we either should fix it or disable it until it can be
fixed.

> Regards
>
> Antoine.


[DISCUSS] Ongoing LZ4 problems with Parquet files

2020-06-24 Thread Wes McKinney
hi folks,

(cross-posting to dev@arrow and dev@parquet since there are
stakeholders in both places)

It seems there are still problems at least with the C++ implementation
of LZ4 compression in Parquet files

https://issues.apache.org/jira/browse/PARQUET-1241
https://issues.apache.org/jira/browse/PARQUET-1878

If these problems cannot be resolved, I am going to recommend that we
disable use of LZ4 in the Parquet C++ library until these things can
be properly tested and validated across different implementations.
Thoughts? We're within weeks of the next Apache Arrow release so if
we're going to disable LZ4-for-Parquet it needs to happen soon.

Thanks
Wes


[jira] [Commented] (PARQUET-1878) [C++] lz4 codec is not compatible with Hadoop Lz4Codec

2020-06-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17139954#comment-17139954
 ] 

Wes McKinney commented on PARQUET-1878:
---

[~chairmank] can you also send an e-mail to dev@parquet.apache.org about this? 
We've been going around in circles on this LZ4 stuff and I think it's time that 
we fix this up once and for all across the implementations

cc [~apitrou] [~fsaintjacques] [~uwe] 

> [C++] lz4 codec is not compatible with Hadoop Lz4Codec
> --
>
> Key: PARQUET-1878
> URL: https://issues.apache.org/jira/browse/PARQUET-1878
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Steve M. Kim
>Priority: Major
>
> As described in HADOOP-12990, the Hadoop {{Lz4Codec}} uses the lz4 block 
> format, and it prepends 8 extra bytes before the compressed data. I believe 
> that lz4 implementation in parquet-cpp also uses the lz4 block format, but it 
> does not prepend these 8 extra bytes.
>  
> Using Java parquet-mr, I wrote a Parquet file with lz4 compression:
> {code:java}
> $ parquet-tools meta 
> /tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> file:
> file:/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1)file schema:
> 
> c1:  REQUIRED INT64 R:0 D:0
> c0:  REQUIRED BINARY R:0 D:0
> v0:  REQUIRED INT64 R:0 D:0row group 1: RC:5007 TS:28028 OFFSET:4
> 
> c1:   INT64 LZ4 DO:0 FPO:4 SZ:24797/25694/1.04 VC:5007 
> ENC:DELTA_BINARY_PACKED ST:[min: 1566330126476659000, max: 
> 1571211622650188000, num_nulls: 0]
> c0:   BINARY LZ4 DO:0 FPO:24801 SZ:279/260/0.93 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  max: 
> 0x7471732F62656566616C6F2F746F6D6163636F2D66782D6D6B74646174612D6C69766573747265616D,
>  num_nulls: 0]
> v0:   INT64 LZ4 DO:0 FPO:25080 SZ:1348/2074/1.54 VC:5007 
> ENC:PLAIN,RLE_DICTIONARY ST:[min: 0, max: 9, num_nulls: 0] {code}
> When I attempted to read this file with parquet-cpp, I got the following 
> error:
> {code:java}
> >>> import pyarrow.parquet as pq
> >>> pq.read_table('/tmp/f4a1c7f57cb1c98c2b9da3b25b16d027df5d2f1cf55adb79374c154fbd79011f')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1536, in read_table
> return pf.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1260, in read
> table = piece.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 707, in read
> table = reader.read(**options)
>   File 
> "/home/skim/miniconda3/envs/arrow/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 336, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> OSError: IOError: Corrupt Lz4 compressed data. {code}
>  
> [https://github.com/apache/arrow/issues/3491] reported incompatibility in the 
> other direction, using Spark (which uses the Hadoop lz4 codec) to read a 
> parquet file that was written with parquet-cpp.
>  
> Given that the Hadoop lz4 codec has long been in use, and users have 
> accumulated Parquet files that were written with this implementation, I 
> propose changing parquet-cpp to match the Hadoop implementation.
>  
> See also:
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16574328#comment-16574328
>  * 
> https://issues.apache.org/jira/browse/PARQUET-1241?focusedCommentId=16585288#comment-16585288



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1241) [C++] Use LZ4 frame format

2020-06-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1241:
--
Fix Version/s: cpp-1.6.0

> [C++] Use LZ4 frame format
> --
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1877) [C++] Reconcile container size with string size for memory issues

2020-06-17 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1877.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7465
[https://github.com/apache/arrow/pull/7465]

> [C++] Reconcile container size with string size for memory issues
> -
>
> Key: PARQUET-1877
> URL: https://issues.apache.org/jira/browse/PARQUET-1877
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Right now the size can cause allocations an order of magnitude larger then 
> string size limits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1859:
-

Assignee: (was: Wes McKinney)

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: (was: Wes McKinney)

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1352) [CPP] Trying to write an arrow table with structs to a parquet file

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1352:
-

Assignee: (was: Wes McKinney)

> [CPP] Trying to write an arrow table with structs to a parquet file
> ---
>
> Key: PARQUET-1352
> URL: https://issues.apache.org/jira/browse/PARQUET-1352
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.4.0
>Reporter: Dragan Markovic
>Priority: Major
>
> Relevant issue:[https://github.com/apache/arrow/issues/2287]
>  
> I'm creating a struct with the following schema in arrow: 
> https://pastebin.com/Cc8nreBP
>  
> When I try to convert that table to a .parquet file, the file gets created 
> with a valid schema (the one I posted above) and then throws this exception: 
> "lemented: Level generation for Struct not supported yet".
>  
> Here's the code: [https://ideone.com/DJkKUF]
>  
> Is there any way to write arrow table of structs to a .parquet file in cpp? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: (was: Wes McKinney)

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-838) [CPP] Unable to read files written by parquet-cpp from parquet-tools

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-838:


Assignee: (was: Wes McKinney)

> [CPP] Unable to read files written by parquet-cpp from parquet-tools
> 
>
> Key: PARQUET-838
> URL: https://issues.apache.org/jira/browse/PARQUET-838
> Project: Parquet
>  Issue Type: Bug
>Reporter: Deepak Majeti
>Priority: Major
> Attachments: parquet_cpp_example.parquet
>
>
> I could not read files written by parquet-cpp from parquet-tools and Hive.
> Setting field ids in the schema metadata seems to be the problem. We should 
> make setting the field_id optional.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1385) [C++] bloom_filter-test is very slow under valgrind

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1385:
-

Assignee: Wes McKinney

> [C++] bloom_filter-test is very slow under valgrind
> ---
>
> Key: PARQUET-1385
> URL: https://issues.apache.org/jira/browse/PARQUET-1385
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>
> This test takes ~5 minutes to run under valgrind in Travis CI
> {code}
> 1: [==] Running 6 tests from 6 test cases.
> 1: [--] Global test environment set-up.
> 1: [--] 1 test from Murmur3Test
> 1: [ RUN  ] Murmur3Test.TestBloomFilter
> 1: [   OK ] Murmur3Test.TestBloomFilter (19 ms)
> 1: [--] 1 test from Murmur3Test (34 ms total)
> 1: 
> 1: [--] 1 test from ConstructorTest
> 1: [ RUN  ] ConstructorTest.TestBloomFilter
> 1: [   OK ] ConstructorTest.TestBloomFilter (101 ms)
> 1: [--] 1 test from ConstructorTest (101 ms total)
> 1: 
> 1: [--] 1 test from BasicTest
> 1: [ RUN  ] BasicTest.TestBloomFilter
> 1: [   OK ] BasicTest.TestBloomFilter (49 ms)
> 1: [--] 1 test from BasicTest (49 ms total)
> 1: 
> 1: [--] 1 test from FPPTest
> 1: [ RUN  ] FPPTest.TestBloomFilter
> 1: [   OK ] FPPTest.TestBloomFilter (308731 ms)
> 1: [--] 1 test from FPPTest (308741 ms total)
> 1: 
> 1: [--] 1 test from CompatibilityTest
> 1: [ RUN  ] CompatibilityTest.TestBloomFilter
> 1: [   OK ] CompatibilityTest.TestBloomFilter (62 ms)
> 1: [--] 1 test from CompatibilityTest (62 ms total)
> 1: 
> 1: [--] 1 test from OptimalValueTest
> 1: [ RUN  ] OptimalValueTest.TestBloomFilter
> 1: [   OK ] OptimalValueTest.TestBloomFilter (27 ms)
> 1: [--] 1 test from OptimalValueTest (27 ms total)
> 1: 
> 1: [--] Global test environment tear-down
> 1: [==] 6 tests from 6 test cases ran. (309081 ms total)
> 1: [  PASSED  ] 6 tests.
> {code}
> Either we should change the FPPTest parameters to be faster, or we should not 
> run that test when using valrind



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-443) Schema resolution: map encoding

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-443:


Assignee: (was: Wes McKinney)

> Schema resolution: map encoding
> ---
>
> Key: PARQUET-443
> URL: https://issues.apache.org/jira/browse/PARQUET-443
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Priority: Major
>
> Related: PARQUET-441 and PARQUET-442



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-441) Schema resolution: one, two, and three-level array encoding

2020-06-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-441:


Assignee: (was: Wes McKinney)

> Schema resolution: one, two, and three-level array encoding
> ---
>
> Key: PARQUET-441
> URL: https://issues.apache.org/jira/browse/PARQUET-441
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Priority: Major
>
> While the Parquet spec recommends the "three-level" array encoding, two other 
> styles are possible in the wild, see for example:
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/exec/hdfs-parquet-scanner.cc#L1986



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1869) [C++] Large decimal values don't roundtrip correctly

2020-06-02 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123699#comment-17123699
 ] 

Wes McKinney commented on PARQUET-1869:
---

I'm pretty sure this is a problem with conversion from Arrow format to the 
Parquet fixed-size-binary storage representation, so might move this issue to 
the ARROW issue tracker. Either way we should definitely try to fix this before 
the next major Arrow release

> [C++] Large decimal values don't roundtrip correctly
> 
>
> Key: PARQUET-1869
> URL: https://issues.apache.org/jira/browse/PARQUET-1869
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-cpp
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Reproducer with python:
> {code}
> import decimal
> import pyarrow as pa
> import pyarrow.parquet as pq
> arr = pa.array([decimal.Decimal('9223372036854775808'), 
> decimal.Decimal('1.111')])
> print(arr)
> pq.write_table(pa.table({'a': arr}), "test_decimal.parquet") 
> result = pq.read_table("test_decimal.parquet")
> print(result.column('a'))
> {code}
> gives
> {code}
> # before writing
> 
> [
>   9223372036854775808.000,
>   1.111
> ]
> # after reading
> 
> [
>   [
>     -221360928884514619.392,
>     1.111
>   ]
> ]
> {code}
> I tried reading the file with a different parquet implementation (fastparquet 
> python package), and that gives the same values on read, so the issue might 
> possibly rather be on the write side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1855) [C++] Improve documentation on MetaData ownership

2020-05-24 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1855:
-

Assignee: Francois Saint-Jacques

> [C++] Improve documentation on MetaData ownership
> -
>
> Key: PARQUET-1855
> URL: https://issues.apache.org/jira/browse/PARQUET-1855
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I had to look at the implementation to understand what are the lifetime 
> relationship for the following objects:
> * FileMetaData
> * RowGroupMetaData
> * ColumnChunkMetaData
> From what I gather, a reference to the top-level FileMetaData must be hold 
> for any of the children objects (RowGroupMetaData and ColumnChunkMetaData) 
> lifetime. It is unclear if the original buffer from which the metadata was 
> deserialized must be hold for the lifetime of the FIleMetaData object, I 
> suspect it does not need to be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1855) [C++] Improve documentation on MetaData ownership

2020-05-24 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1855.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7244
[https://github.com/apache/arrow/pull/7244]

> [C++] Improve documentation on MetaData ownership
> -
>
> Key: PARQUET-1855
> URL: https://issues.apache.org/jira/browse/PARQUET-1855
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I had to look at the implementation to understand what are the lifetime 
> relationship for the following objects:
> * FileMetaData
> * RowGroupMetaData
> * ColumnChunkMetaData
> From what I gather, a reference to the top-level FileMetaData must be hold 
> for any of the children objects (RowGroupMetaData and ColumnChunkMetaData) 
> lifetime. It is unclear if the original buffer from which the metadata was 
> deserialized must be hold for the lifetime of the FIleMetaData object, I 
> suspect it does not need to be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: definition written before repetition?

2020-05-22 Thread Wes McKinney
Sorry, I'm wrong -- C++ is doing it correctly, I was looking at the
wrong code. False alarm!

https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L685

I was shocked that such a blatant correctness issue might have existed
but since people have been able to read nested data files with Spark
and other systems everything is fine in C++.

On Fri, May 22, 2020 at 12:53 PM Wes McKinney  wrote:
>
> If that's the case (and according to the Format documentation it is)
> then we are doing it incorrectly in C++. How depressing
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097
>
> This is unfortunately what happens when you don't have more rigorous
> integration tests.
>
>
> On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky  wrote:
> >
> > Hi ZJ,
> >
> > parquet-mr clearly writes repetition levels and definition levels according
> > to the specification. See the following code references.
> > For V1 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> > For V2 pages:
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
> >
> > Regards,
> > Gabor
> >
> >
> > On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai  wrote:
> >
> > > I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60
> > >
> > > where the official parquet documentation states that repetition levels are
> > > written before definition levels. However, in the Julia Parquet package 
> > > the
> > > parquet implementation reads definition before the repetition levels and
> > > the author insists on him being right but did not provide further 
> > > evidence.
> > >
> > > I wanted to double-check this with the parquet dev community? Is it true
> > > that definitions levels need to be written before repetition levels? If
> > > true then the parquet documentation is wrong then I am happy to PR a fix.
> > >
> > > Regards
> > > --
> > > ZJ
> > >
> > > zhuojia@gmail.com
> > >


Re: definition written before repetition?

2020-05-22 Thread Wes McKinney
If that's the case (and according to the Format documentation it is)
then we are doing it incorrectly in C++. How depressing

https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1097

This is unfortunately what happens when you don't have more rigorous
integration tests.


On Fri, May 22, 2020 at 3:14 AM Gabor Szadovszky  wrote:
>
> Hi ZJ,
>
> parquet-mr clearly writes repetition levels and definition levels according
> to the specification. See the following code references.
> For V1 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L60
> For V2 pages:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L655
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L221-L225
>
> Regards,
> Gabor
>
>
> On Fri, May 22, 2020 at 6:35 AM Zhuo Jia Dai  wrote:
>
> > I raise this issue  https://github.com/JuliaIO/Parquet.jl/issues/60
> >
> > where the official parquet documentation states that repetition levels are
> > written before definition levels. However, in the Julia Parquet package the
> > parquet implementation reads definition before the repetition levels and
> > the author insists on him being right but did not provide further evidence.
> >
> > I wanted to double-check this with the parquet dev community? Is it true
> > that definitions levels need to be written before repetition levels? If
> > true then the parquet documentation is wrong then I am happy to PR a fix.
> >
> > Regards
> > --
> > ZJ
> >
> > zhuojia@gmail.com
> >


[jira] [Resolved] (PARQUET-1861) [Documentation][C++] Explain ReaderProperters.buffer_stream*

2020-05-21 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1861.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7221
[https://github.com/apache/arrow/pull/7221]

> [Documentation][C++] Explain ReaderProperters.buffer_stream*
> 
>
> Key: PARQUET-1861
> URL: https://issues.apache.org/jira/browse/PARQUET-1861
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[ANNOUNCE] New Parquet committers: Micah Kornfield and Antoine Pitrou

2020-05-21 Thread Wes McKinney
On behalf of the Parquet PMC, I'm pleased to announce that Micah and
Antoine have been invited to be Parquet committers and they have both
accepted. Welcome, and thank you for your contributions!


[jira] [Resolved] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1865.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7237
[https://github.com/apache/arrow/pull/7237]

> [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc
> --
>
> Key: PARQUET-1865
> URL: https://issues.apache.org/jira/browse/PARQUET-1865
> Project: Parquet
>  Issue Type: Bug
>    Reporter: Wes McKinney
>    Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code}
> ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
> ../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> ^
> , ""
> ../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1865:
-

Assignee: Wes McKinney

> [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc
> --
>
> Key: PARQUET-1865
> URL: https://issues.apache.org/jira/browse/PARQUET-1865
> Project: Parquet
>  Issue Type: Bug
>    Reporter: Wes McKinney
>    Assignee: Wes McKinney
>Priority: Major
>
> {code}
> ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
> ../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> ^
> , ""
> ../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
> message is a C++17 extension [-Werror,-Wc++17-extensions]
>   static_assert(sizeof(CType) == sizeof(*raw_values));
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1865) [C++] Failure from C++17 feature used in parquet/encoding_benchmark.cc

2020-05-20 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1865:
-

 Summary: [C++] Failure from C++17 feature used in 
parquet/encoding_benchmark.cc
 Key: PARQUET-1865
 URL: https://issues.apache.org/jira/browse/PARQUET-1865
 Project: Parquet
  Issue Type: Bug
Reporter: Wes McKinney


{code}
ir/encoding_benchmark.cc.o -c ../src/parquet/encoding_benchmark.cc
../src/parquet/encoding_benchmark.cc:242:53: error: static_assert with no 
message is a C++17 extension [-Werror,-Wc++17-extensions]
  static_assert(sizeof(CType) == sizeof(*raw_values));
^
, ""
../src/parquet/encoding_benchmark.cc:286:53: error: static_assert with no 
message is a C++17 extension [-Werror,-Wc++17-extensions]
  static_assert(sizeof(CType) == sizeof(*raw_values));
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet - 41

2020-05-14 Thread Wes McKinney
OK -- with the comments about Bloom filters it makes me concerned that the
scope of what you are working on for PARQUET-1404 is expanding without the
past issues being resolved, so we could end up with a large patch that may
not be able to be merged without a lot of additional work. It would be best
to break up the work into smaller patches if possible and work to get them
merged into the project.

On Wed, May 13, 2020, 10:43 PM Lekshmi Narayanan, Arun Balajiee <
arl...@pitt.edu> wrote:

> Firstly, thanks for adding me.
>
> Yes I want to do this in relation with PARQUET - 1404. I  completed read
> and write index api, but the at moment to make it approvable with a PR I
> have to remove all the other file changes and address your comments as
> well. I can come back those, when I complete my thesis defense at my
> school, if that is okay
>
> Regards
> Arun Balajiee
>
>
> Regards,
>
> Arun Balajiee
>
> 
> From: Wes McKinney 
> Sent: Wednesday, May 13, 2020 10:27:29 PM
> To: Parquet Dev 
> Subject: Re: Parquet - 41
>
> I just added Arun as a contributor.
>
> @Arun -- are you planning to do this in relation to PARQUET-1404?
> Where does that project stand?
>
> On Wed, May 13, 2020 at 9:22 PM Junjie Chen 
> wrote:
> >
> > You need a committer to add you as a contributor to the project. I'm not
> a
> > committer yet...  @Gabor, could you please help to assign this?
> >
> > On Thu, May 14, 2020 at 7:28 AM Lekshmi Narayanan, Arun Balajiee <
> > arl...@pitt.edu> wrote:
> >
> > > Hi
> > >
> > > Just wanted to re-confirm that I am working on the C++ implementation
> of
> > > Bloom Filters in Arrow. I don't have the access level to complete this.
> > > Could you assign me to this ticket?
> > >
> > >
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1327&data=02%7C01%7CARL122%40pitt.edu%7C1fbe5a68e3024b27d17408d7f7ae72c9%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637250200931681689&sdata=DAhCLYR9ggvioaSLXrQLXDoF7tEbUo9rMSlq7nvcuHw%3D&reserved=0
> > > username: encodedgeek
> > >
> > > Regards
> > > 
> > > From: Lekshmi Narayanan, Arun Balajiee 
> > > Sent: 21 April 2020 06:57
> > > To: dev@parquet.apache.org 
> > > Subject: Re: Parquet - 41
> > >
> > > Yes. I would like to contribute to bloom filters in Arrow
> > >
> > > I also wanted to check, would it be a good idea to add Bloom filters in
> > > Column Indices ( PARQUET-1404<
> > >
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FPARQUET%2Fissues%2FPARQUET-1404%3Ffilter%3Dallopenissues&data=02%7C01%7CARL122%40pitt.edu%7C1fbe5a68e3024b27d17408d7f7ae72c9%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637250200931681689&sdata=BsZ7HqzJAKqTIJiur07qbBEbDN1LgrI3MC81nULXneE%3D&reserved=0
> >
> > > )
> > >
> > > Regards
> > > Arun Balajiee
> > >
> > > 
> > > From: Junjie Chen 
> > > Sent: 20 April 2020 22:20
> > > To: dev@parquet.apache.org 
> > > Subject: Re: Parquet - 41
> > >
> > > As far as I know, not implemented yet. The thrift is update-to-date
> now,
> > > would you like to contribute?
> > >
> > > Things we need are:
> > > 1. xxhash c++ implementation
> > > 2. reader and writer for the bloom filter
> > > 3. filtering logic for row group
> > >
> > > Implementing the reader would be a good start.
> > >
> > > On Tue, Apr 21, 2020 at 8:52 AM  wrote:
> > >
> > > > Hi
> > > >
> > > > Is the  C++ version of bloom filter implemented in Arrow Parquet C++?
> > > >
> > > >
> > >
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41&data=02%7C01%7CARL122%40pitt.edu%7C1fbe5a68e3024b27d17408d7f7ae72c9%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637250200931681689&sdata=77V7VEoLrkZvTtqEJQO2NwRZrLfHmexBcwzWcKl%2Fcfw%3D&reserved=0
> > > > [PARQUET-41] Add bloom filters to parquet statistics - ASF JIRA<
> > > >
> > >
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41&data=02%7C01%7CARL122%40pitt.edu%7C1fbe5a68e3024b27d17408d7f7ae72c9%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637250200931691684&sdata=IZi%2F2rkwXJjMuDQWoSzWDEeYpozgS%2BzQKjxh%2BXGXqG4%3D&reserved=0
> > > >
> > > > For row groups with no dictionary, we could still produce a bloom
> filter.
> > > > This could be very useful in filtering entire row groups. Pull
> request:
> > > >
> > >
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F&data=02%7C01%7CARL122%40pitt.edu%7C1fbe5a68e3024b27d17408d7f7ae72c9%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637250200931691684&sdata=vPhj2G%2B3IJCjEDoR3vMpgXT%2Brm8rF2eXlKUxw8pGKb0%3D&reserved=0
> > > ...
> > > > issues.apache.org
> > > > Regards
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> >
> >
> > --
> > Best Regards
>


Re: Parquet - 41

2020-05-13 Thread Wes McKinney
I just added Arun as a contributor.

@Arun -- are you planning to do this in relation to PARQUET-1404?
Where does that project stand?

On Wed, May 13, 2020 at 9:22 PM Junjie Chen  wrote:
>
> You need a committer to add you as a contributor to the project. I'm not a
> committer yet...  @Gabor, could you please help to assign this?
>
> On Thu, May 14, 2020 at 7:28 AM Lekshmi Narayanan, Arun Balajiee <
> arl...@pitt.edu> wrote:
>
> > Hi
> >
> > Just wanted to re-confirm that I am working on the C++ implementation of
> > Bloom Filters in Arrow. I don't have the access level to complete this.
> > Could you assign me to this ticket?
> >
> > https://issues.apache.org/jira/browse/PARQUET-1327
> > username: encodedgeek
> >
> > Regards
> > 
> > From: Lekshmi Narayanan, Arun Balajiee 
> > Sent: 21 April 2020 06:57
> > To: dev@parquet.apache.org 
> > Subject: Re: Parquet - 41
> >
> > Yes. I would like to contribute to bloom filters in Arrow
> >
> > I also wanted to check, would it be a good idea to add Bloom filters in
> > Column Indices ( PARQUET-1404<
> > https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1404?filter=allopenissues>
> > )
> >
> > Regards
> > Arun Balajiee
> >
> > 
> > From: Junjie Chen 
> > Sent: 20 April 2020 22:20
> > To: dev@parquet.apache.org 
> > Subject: Re: Parquet - 41
> >
> > As far as I know, not implemented yet. The thrift is update-to-date now,
> > would you like to contribute?
> >
> > Things we need are:
> > 1. xxhash c++ implementation
> > 2. reader and writer for the bloom filter
> > 3. filtering logic for row group
> >
> > Implementing the reader would be a good start.
> >
> > On Tue, Apr 21, 2020 at 8:52 AM  wrote:
> >
> > > Hi
> > >
> > > Is the  C++ version of bloom filter implemented in Arrow Parquet C++?
> > >
> > >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41&data=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549&sdata=hCOC43WB5QLrk3nbM19kp%2BrSrllsrI3LuCUF6oiIYu4%3D&reserved=0
> > > [PARQUET-41] Add bloom filters to parquet statistics - ASF JIRA<
> > >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-41&data=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549&sdata=hCOC43WB5QLrk3nbM19kp%2BrSrllsrI3LuCUF6oiIYu4%3D&reserved=0
> > >
> > > For row groups with no dictionary, we could still produce a bloom filter.
> > > This could be very useful in filtering entire row groups. Pull request:
> > >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2F&data=02%7C01%7CARL122%40pitt.edu%7C077d6ee2886a4fa6aa9908d7e59b839a%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637230328401496549&sdata=9XFJB4y9X%2FpeWAqpO%2BQdJnHM6oXYRU37lZ0XhodRlxc%3D&reserved=0
> > ...
> > > issues.apache.org
> > > Regards
> > >
> >
> >
> > --
> > Best Regards
> >
>
>
> --
> Best Regards


Re: [Python][Documentation] Add column limit recommendations Parquet page

2020-05-09 Thread Wes McKinney
hi Maarten,

I added dev@parquet.apache.org to this (if you are not subscribed to
this list you may want to)

I made a quick notebook to help illustrate:

https://gist.github.com/wesm/cabf684db3ce8fdd6df27cf782f7226e

Summary:

* Files with 1000+ columns can see the metadata-to-data ratio exceed
10% (in the example I made it's 15-20%).
* The time to deserialize whole files with many columns starts to
balloon superlinearly with extremely wide files

On Sat, May 9, 2020 at 4:28 PM Maarten Ballintijn  wrote:
>
> Wes,
>
> "Users would be well advised to not write columns with large numbers (> 1000) 
> of columns"
> You've mentioned this before and as this is in my experience not an uncommon 
> use-case can you maybe expand a bit on the following related questions. 
> (use-cases include daily or minute data for a few 10's of thousands items 
> like stocks or other financial instruments, IoT sensors, etc).
>
> Parquet Standard - Is the issue intrinsic to the Parquet standard you think? 
> The ability to read a sub-set of the columns and/or row-groups, compact 
> storage through the use of RLE, categoricals etc, all seem to point to the 
> format being well suited for these use-cases

Parquet files by design are pretty heavy on metadata -- which is fine
when the number of columns is small. When files have many columns, the
costs associated with dealing with the file metadata really add up
because the ratio of metadata to data in the file becomes skewed.
Also, the common FileMetaData must be entirely parsed even when you
only want to read one column.

> Parquet-C++ implementation - Is the issue with current Parquet-C++ 
> implementation, or any of the dependencies? Is it something which could be 
> fixed? Would a specialized implementation help? Is the problem related to 
> going from Parquet -> Arrow -> Python/Pandas? E.g. would a Parquet -> numpy 
> reader work better?

No, it's not an issue specific to the C++ implementation.

> Alternatives - What would you recommend as a superior solution? Store this 
> data tall i.s.o wide? Use another storage format?

It really depends on your particular use case. You can try other
solutions (e.g. Arrow IPC / Feather files, or row-oriented data
formats) and see what works best

> Appreciate your (and others) insights.
>
> Cheers, Maarten.


[jira] [Updated] (PARQUET-1861) [Documentation][C++] Explain ReaderProperters.buffer_stream*

2020-05-08 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1861:
--
Summary: [Documentation][C++] Explain ReaderProperters.buffer_stream*  
(was: [Documentation] Explain ReaderProperters.buffer_stream*)

> [Documentation][C++] Explain ReaderProperters.buffer_stream*
> 
>
> Key: PARQUET-1861
> URL: https://issues.apache.org/jira/browse/PARQUET-1861
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1857.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 7108
[https://github.com/apache/arrow/pull/7108]

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>    Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Attachments: test.parquet.tgz, test_2.parquet.tgz
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site

[jira] [Commented] (PARQUET-1858) [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups

2020-05-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100758#comment-17100758
 ] 

Wes McKinney commented on PARQUET-1858:
---

Yes it looks like the file written by Rust is malformed. That two independent 
implementations fail is good evidence of that. 

> [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row 
> groups
> ---
>
> Key: PARQUET-1858
> URL: https://issues.apache.org/jira/browse/PARQUET-1858
> Project: Parquet
>  Issue Type: Bug
>Reporter: Novice
>Priority: Major
> Attachments: test_2.parquet.tgz
>
>
> Here is the error I got:
> Pyarrow:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1281, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1137, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 605, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> fastparquet:
> ```
>  >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy8 = numba.jitclass(spec8)(NumpyIO)
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy32 = numba.jitclass(spec32)(NumpyIO)
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 201, in read
>  return parquet_file.to_pandas(columns=columns, **kwargs)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 399, in to_pandas
>  index=index, assign=parts)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/api.py", 
> line 228, in read_row_group
>  scheme=self.file_scheme)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 354, in read_row_group
>  cats, selfmade, assign=assign)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 331, in read_row_group_arrays
>  catdef=out.get(name+'-catdef', None))
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 245, in read_col
>  skip_nulls, selfmade=selfmade)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/core.py", 
> line 99, in read_data_page
>  raw_bytes = _read

[jira] [Assigned] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-05-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1859:
-

Assignee: Wes McKinney

> [C++] Require error message when using ParquetException::EofException
> -
>
> Key: PARQUET-1859
> URL: https://issues.apache.org/jira/browse/PARQUET-1859
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> "Unexpected end of stream" (the defaults) gives no clue where the failure 
> occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1858) [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row groups

2020-05-05 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100260#comment-17100260
 ] 

Wes McKinney commented on PARQUET-1858:
---

The PLAIN encoding for the boolean type is possibly malformed. I opened 
PARQUET-1859 about providing better error messages, but here is what the 
failure is

{code}
$ python test.py 
Traceback (most recent call last):
  File "test.py", line 7, in 
pq.read_table(path)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1539, in 
read_table
use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 1264, in read
use_pandas_metadata=use_pandas_metadata)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 707, in read
table = reader.read(**options)
  File "/home/wesm/code/arrow/python/pyarrow/parquet.py", line 337, in read
use_threads=use_threads)
  File "pyarrow/_parquet.pyx", line 1130, in 
pyarrow._parquet.ParquetReader.read_all
check_status(self.reader.get()
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
raise IOError(message)
OSError: Unexpected end of stream: Failed to decode 100 bits for boolean 
PLAIN encoding only decoded 2048
In ../src/parquet/arrow/reader.cc, line 844, code: final_status
{code}

Can this file be read by the Java library?

> [Python] [Rust] Parquet read file fails with batch size 1_000_000 and 41 row 
> groups
> ---
>
> Key: PARQUET-1858
> URL: https://issues.apache.org/jira/browse/PARQUET-1858
> Project: Parquet
>  Issue Type: Bug
>Reporter: Novice
>Priority: Major
> Attachments: test_2.parquet.tgz
>
>
> Here is the error I got:
> Pyarrow:
> ```
> >>> df = pd.read_parquet("test.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1281, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1137, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 605, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 253, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1136, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> fastparquet:
> ```
>  >>> df = pd.read_parquet("test.parquet", engine="fastparquet")
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:222:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy8 = numba.jitclass(spec8)(NumpyIO)
>  
> /home/miniconda3/envs/ds/lib/python3.7/site-packages/fastparquet/encoding.py:224:
>  NumbaDeprecationWarning: The 'numba.jitclass' decorator has moved to 
> 'numba.experimental.jitclass' to better reflect the experimental nature of 
> the functionality. Please update your imports to accommodate this change and 
> see 
> [http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#change-of-jitclass-location]
>  for the time frame.
>  Numpy32 = numba.jitclass(spec32)(NumpyIO)
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py

[jira] [Created] (PARQUET-1859) [C++] Require error message when using ParquetException::EofException

2020-05-05 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1859:
-

 Summary: [C++] Require error message when using 
ParquetException::EofException
 Key: PARQUET-1859
 URL: https://issues.apache.org/jira/browse/PARQUET-1859
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


"Unexpected end of stream" (the defaults) gives no clue where the failure 
occurred



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-05 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100084#comment-17100084
 ] 

Wes McKinney commented on PARQUET-1857:
---

I put up a PR for the first problem you reported. If there are failures with < 
32768 row groups, then can you open a new JIRA and post the file since that 
will have to be investigated separately?

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Attachments: test.parquet.tgz, test_2.parquet.tgz
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path

[jira] [Moved] (PARQUET-1857) [C++][Parquet] ParquetFileReader unable to read files with more than 32767 row groups

2020-05-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved ARROW-8677 to PARQUET-1857:
--

  Component/s: (was: Rust)
   (was: Python)
   parquet-cpp
  Key: PARQUET-1857  (was: ARROW-8677)
Affects Version/s: (was: 0.17.0)
 Workflow: patch-available, re-open possible  (was: jira)
  Environment: (was: Linux debian
)
  Project: Parquet  (was: Apache Arrow)

> [C++][Parquet] ParquetFileReader unable to read files with more than 32767 
> row groups
> -
>
> Key: PARQUET-1857
> URL: https://issues.apache.org/jira/browse/PARQUET-1857
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Novice
>    Assignee: Wes McKinney
>Priority: Critical
> Attachments: test.parquet.tgz
>
>
> I am using Rust to write Parquet file and read from Python.
> When write_batch with 1 batch size, reading the Parquet file from Python 
> gives the error below:
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home//.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home//miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: Unexpected end of stream
> ```
> Also, when using batch size 1 and then read from Python, there is error too: 
> ```
> >>> pd.read_parquet("some.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  return impl.read(path, columns=columns, **kwargs)
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 125, in read
>  path, columns=columns, **kwargs
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1537, in read_table
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 1262, in read
>  use_pandas_metadata=use_pandas_metadata)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 707, in read
>  table = reader.read(**options)
>  File 
> "/home/miniconda3/envs/ds/lib/python3.7/site-packages/pyarrow/parquet.py", 
> line 337, in read
>  use_threads=use_threads)
>  File "pyarrow/_parquet.pyx", line 1130, in 
> pyarrow._parquet.ParquetReader.read_all
>  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
>  OSError: The file only has 0 columns, requested metadata for column: 6
> ```
> Using batch size 1000 is fine.
> Note that my data has 450047 rows. Schema:
> ```
> message schema
> { REQUIRED INT32 a; REQUIRED INT32 b; REQUIRED INT32 c; REQUIRED INT64 d; 
> REQUIRED INT32 e; REQUIRED BYTE_ARRAY f (UTF8); REQUIRED BOOLEAN g; }
> ```
>  
> EDIT: as I add more rows (estimated 80 millions), using batch size 1000 does 
> not work too:
> ```
> >>> df = pd.read_parquet("data/ping_pong.parquet", engine="pyarrow")
>  Traceback (most recent call last):
>  File "", line 1, in 
>  File "/home/.local/lib/python3.7/site-packages/pandas/io/parquet.py", line 
> 296, in read_parquet
>  retur

[jira] [Created] (PARQUET-1856) [C++] Test suite assumes that Snappy support is built

2020-05-04 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1856:
-

 Summary: [C++] Test suite assumes that Snappy support is built
 Key: PARQUET-1856
 URL: https://issues.apache.org/jira/browse/PARQUET-1856
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


The test suite fails if {{-DARROW_WITH_SNAPPY=OFF}}

{code}
[--] 1 test from TestStatisticsSortOrder/0, where TypeParam = 
parquet::PhysicalType<(parquet::Type::type)1>
[ RUN  ] TestStatisticsSortOrder/0.MinMax
unknown file: Failure
C++ exception with description "NotImplemented: Snappy codec support not built" 
thrown in the test body.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1820.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6744
[https://github.com/apache/arrow/pull/6744]

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1820:
-

Assignee: David Li

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1820) [C++] Use a column filter hint to inform read prefetching in Arrow reads

2020-05-01 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1820:
--
Summary: [C++] Use a column filter hint to inform read prefetching in Arrow 
reads  (was: [C++] Use a column filter hint to inform read prefetching)

> [C++] Use a column filter hint to inform read prefetching in Arrow reads
> 
>
> Key: PARQUET-1820
> URL: https://issues.apache.org/jira/browse/PARQUET-1820
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> As a follow up to PARQUET-1698 and ARROW-7995, we should use the I/O 
> coalescing facility (where available and enabled), in combination with a 
> column filter hint, to compute and prefetch the exact byte ranges we will be 
> reading (using the metadata). This should further improve performance on 
> remote object stores like Amazon S3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2020-04-23 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090617#comment-17090617
 ] 

Wes McKinney commented on PARQUET-1404:
---

Do you want to keep the discussion in one place, i.e. on the mailing list?

> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Once PARQUET-922 is completed we can port such implementation to parquet-cpp 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1327) [C++] Bloom filter read/write implementation

2020-04-23 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1327:
--
Summary: [C++] Bloom filter read/write implementation  (was: [C++]Bloom 
filter read/write implementation)

> [C++] Bloom filter read/write implementation
> 
>
> Key: PARQUET-1327
> URL: https://issues.apache.org/jira/browse/PARQUET-1327
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Junjie Chen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Filtering GitBox e-mails out of dev@?

2020-04-21 Thread Wes McKinney
hi,

Would someone please take a look at this?

Thanks

On Mon, Apr 20, 2020 at 8:08 AM Wes McKinney  wrote:
>
> Infra made some changes to ensure that GitHub notifications are
> archived, but that has resulted in new e-mails being sent to dev@
>
> In Arrow, we didn't want these so we have
>
> * https://issues.apache.org/jira/browse/INFRA-20149
> * https://issues.apache.org/jira/browse/ARROW-8520
> * Final solution:
> https://github.com/apache/arrow/commit/aa55967e6b9cf6fc8b4d2f6ac9ec75f8c28c80f5
>
> You may want to implement the same thing for apache/parquet-mr
>
> - Wes


Filtering GitBox e-mails out of dev@?

2020-04-20 Thread Wes McKinney
Infra made some changes to ensure that GitHub notifications are
archived, but that has resulted in new e-mails being sent to dev@

In Arrow, we didn't want these so we have

* https://issues.apache.org/jira/browse/INFRA-20149
* https://issues.apache.org/jira/browse/ARROW-8520
* Final solution:
https://github.com/apache/arrow/commit/aa55967e6b9cf6fc8b4d2f6ac9ec75f8c28c80f5

You may want to implement the same thing for apache/parquet-mr

- Wes


[jira] [Updated] (PARQUET-1828) [C++] Add a SSE2 path for the ByteStreamSplit encoder implementation

2020-04-19 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1828:
--
Summary: [C++] Add a SSE2 path for the ByteStreamSplit encoder 
implementation  (was: Add a SSE2 path for the ByteStreamSplit encoder 
implementation)

> [C++] Add a SSE2 path for the ByteStreamSplit encoder implementation
> 
>
> Key: PARQUET-1828
> URL: https://issues.apache.org/jira/browse/PARQUET-1828
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> The encode path for the byte stream split encoding can have better 
> performance if SSE2 intrinsics are used.
> The decode path already uses sse2 intrinsics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1846) [C++] Remove deprecated IO classes and related functions

2020-04-19 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1846:
-

 Summary: [C++] Remove deprecated IO classes and related functions
 Key: PARQUET-1846
 URL: https://issues.apache.org/jira/browse/PARQUET-1846
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


These were added almost a year ago, so there has been ample time for users to 
migrate



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1835) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1835.
---
Resolution: Fixed

Issue resolved by pull request 6848
[https://github.com/apache/arrow/pull/6848]

> [C++] Fix crashes on invalid input (OSS-Fuzz)
> -
>
> Key: PARQUET-1835
> URL: https://issues.apache.org/jira/browse/PARQUET-1835
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Fix more issues found by OSS-Fuzz.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1834) Add Apache 2.0 license to README.md files in parquet-testing

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1834:
--
Fix Version/s: cpp-1.6.0

> Add Apache 2.0 license to README.md files in parquet-testing
> 
>
> Key: PARQUET-1834
> URL: https://issues.apache.org/jira/browse/PARQUET-1834
> Project: Parquet
>  Issue Type: Task
>Reporter: Maya Anderson
>Assignee: Maya Anderson
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> parquet-testing files can be used for interop tests in parquet-mr. 
> However, if it is added as a submodule, then the 3 README.md files fail the 
> license check and hence fail build of parquet-mr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1834) Add Apache 2.0 license to README.md files in parquet-testing

2020-04-06 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1834.
---
Resolution: Fixed

Resolved by PR 
https://github.com/apache/parquet-testing/commit/bcd9ebcf9204a346df47204fe21b85c8d0498816

> Add Apache 2.0 license to README.md files in parquet-testing
> 
>
> Key: PARQUET-1834
> URL: https://issues.apache.org/jira/browse/PARQUET-1834
> Project: Parquet
>  Issue Type: Task
>Reporter: Maya Anderson
>Assignee: Maya Anderson
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> parquet-testing files can be used for interop tests in parquet-mr. 
> However, if it is added as a submodule, then the 3 README.md files fail the 
> license check and hence fail build of parquet-mr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow 1404: Adding index for Page-level Skipping

2020-04-02 Thread Wes McKinney
I just left comments on the PR. The new APIs (their semantics and what
should be passed as arguments) are still not adequately documented (in
other words, I wouldn't know how to use them just from reading the
header file), so I think we should focus on that for the moment. In
fairness documentation for other functions in these headers in poor,
but they also have the semantics of "read all data in the file from
start to finish". These new APIs appear to do something different, so
we need to write that down in detail in Doxygen-style comments

On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> Would my pull request be useful for the discussion from here?
> https://github.com/apache/arrow/pull/6807
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > 
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> > >
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 11:24 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > We can keep the discussion going on here and on GitHub when you have a
> > > pull request to discuss. There are a number of diff

[jira] [Resolved] (PARQUET-1829) [C++] Fix crashes on invalid input (OSS-Fuzz)

2020-03-26 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1829.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6728
[https://github.com/apache/arrow/pull/6728]

> [C++] Fix crashes on invalid input (OSS-Fuzz)
> -
>
> Key: PARQUET-1829
> URL: https://issues.apache.org/jira/browse/PARQUET-1829
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> There are remaining issues open in OSS-Fuzz. We should fix most of them 
> (except some out-of-memory conditions which may not easily be fixable).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-458) [C++] Implement support for DataPageV2

2020-03-25 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-458.
--
Resolution: Fixed

Issue resolved by pull request 6481
[https://github.com/apache/arrow/pull/6481]

> [C++] Implement support for DataPageV2
> --
>
> Key: PARQUET-458
> URL: https://issues.apache.org/jira/browse/PARQUET-458
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Assignee: Hatem Helal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1786) [C++] Use simd to improve BYTE_STREAM_SPLIT decoding performance

2020-03-24 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066085#comment-17066085
 ] 

Wes McKinney commented on PARQUET-1786:
---

Please leave resolved issues in "Resolved" state otherwise they will not show 
up in changelogs

> [C++] Use simd to improve BYTE_STREAM_SPLIT decoding performance
> 
>
> Key: PARQUET-1786
> URL: https://issues.apache.org/jira/browse/PARQUET-1786
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> BYTE_STREAM_SPLIT essentially does a scatter/gather operation in the 
> encode/decoder paths. Unfortunately, it is not as fast as memcpy when the 
> data is cached. That can be improved through using simd intrinsics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1823) [C++] Invalid RowGroup returned when reading with parquet::arrow::FileReader->RowGroup(i)->Column(j)

2020-03-20 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1823.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6674
[https://github.com/apache/arrow/pull/6674]

> [C++] Invalid RowGroup returned when reading with 
> parquet::arrow::FileReader->RowGroup(i)->Column(j)
> 
>
> Key: PARQUET-1823
> URL: https://issues.apache.org/jira/browse/PARQUET-1823
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Critical
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Originally reported as ARROW-8138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1819) [C++] Fix crashes on corrupt IPC input (OSS-Fuzz)

2020-03-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1819.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6659
[https://github.com/apache/arrow/pull/6659]

> [C++] Fix crashes on corrupt IPC input (OSS-Fuzz)
> -
>
> Key: PARQUET-1819
> URL: https://issues.apache.org/jira/browse/PARQUET-1819
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1814) [C++] TestInt96ParquetIO failure on Windows

2020-03-13 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1814:
--
Fix Version/s: cpp-1.6.0

> [C++] TestInt96ParquetIO failure on Windows
> ---
>
> Key: PARQUET-1814
> URL: https://issues.apache.org/jira/browse/PARQUET-1814
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> {code}
> [ RUN  ] TestInt96ParquetIO.ReadIntoTimestamp
> C:/t/arrow/cpp/src/arrow/testing/gtest_util.cc(77): error: Failed
> @@ -0, +0 @@
> -1970-01-01 00:00:00.145738543
> +1970-01-02 11:35:00.145738543
> C:/t/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1034): error: 
> Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't generate new 
> fatal failures in the current thread.
>   Actual: it does.
> [  FAILED  ] TestInt96ParquetIO.ReadIntoTimestamp (47 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1813) [C++] Remove logging statement in unit test

2020-03-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1813:
--
Summary: [C++] Remove logging statement in unit test  (was: [C++] Weird 
error output in tests)

> [C++] Remove logging statement in unit test
> ---
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1813) [C++] Weird error output in tests

2020-03-12 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1813:
-

Assignee: Wes McKinney

> [C++] Weird error output in tests
> -
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Wes McKinney
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1813) [C++] Weird error output in tests

2020-03-12 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058286#comment-17058286
 ] 

Wes McKinney commented on PARQUET-1813:
---

I missed the debug output in my code review 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow_schema_test.cc#L989.
 Will fix

> [C++] Weird error output in tests
> -
>
> Key: PARQUET-1813
> URL: https://issues.apache.org/jira/browse/PARQUET-1813
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Priority: Major
>
> It doesn't appear to fail the test, but I still get this weird output on 
> Windows:
> {code}
> [ RUN  ] TestConvertArrowSchema.ParquetMaps
> C:/t/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:989: my_map: 
> map not null
> [   OK ] TestConvertArrowSchema.ParquetMaps (0 ms)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1663) [C++] Provide API to check the presence of complex data types

2020-03-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1663.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 5490
[https://github.com/apache/arrow/pull/5490]

> [C++] Provide API to check the presence of complex data types
> -
>
> Key: PARQUET-1663
> URL: https://issues.apache.org/jira/browse/PARQUET-1663
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Zherui Cao
>Assignee: Zherui Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> we need functions like
> hasMapType()
> hasArrayType()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [C++] Adding RLEEncoder/Decoder to parquet API column writer/reader API

2020-03-09 Thread Wes McKinney
hi Micah,

This sounds like a good idea to me. Not sure why we didn't think of it
before, but better to do this late than never.

- Wes

On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield  wrote:
>
> The current API for writing repetition and definition levels takes arrays
> of int16_t values for the levels.  This seems inefficient when there are
> "runs" of the same level.  When there are runs, writers that don't already
> have the data in array form, need to explode it out to an array which
> ultimately gets run-length encoded again via RLEEncoder [1]. A similar
> inefficiency exists when reading.
>
> I was wondering if the community has considered adding APIs to the
> column_writer/reader [2][3] that expose the encoder/decoder directly or a
> facade around them?
>
> At least for encoding it seems like an extra wrapper for RLEEncoder would
> be needed since the required memory size wouldn't be known up front.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h
> [2]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158
> [3]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151


[jira] [Commented] (PARQUET-1300) [C++] Parquet modular encryption

2020-03-06 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053455#comment-17053455
 ] 

Wes McKinney commented on PARQUET-1300:
---

Anyone interested in looking at packaging issues for encryption? I don't think 
it's being shipped in Arrow packages yet

> [C++] Parquet modular encryption
> 
>
> Key: PARQUET-1300
> URL: https://issues.apache.org/jira/browse/PARQUET-1300
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Gidon Gershinsky
>Assignee: Deepak Majeti
>Priority: Major
>  Labels: pull-request-available
> Attachments: column_reader.cc, column_writer.cc, file_reader.cc, 
> file_writer.cc, thrift.h
>
>  Time Spent: 34h
>  Remaining Estimate: 0h
>
> CPP version of a mechanism for modular encryption and decryption of Parquet 
> files. Allows to keep the data fully encrypted in the storage, while enabling 
> a client to extract a required subset (footer, column(s), pages) and to 
> authenticate / decrypt the extracted data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1810) [C++] Fix undefined behaviour on invalid enum values (OSS-Fuzz)

2020-03-05 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1810.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6537
[https://github.com/apache/arrow/pull/6537]

> [C++] Fix undefined behaviour on invalid enum values (OSS-Fuzz)
> ---
>
> Key: PARQUET-1810
> URL: https://issues.apache.org/jira/browse/PARQUET-1810
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1780) [C++] Set ColumnMetadata.encoding_stats field

2020-03-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1780.
---
Resolution: Fixed

Issue resolved by pull request 6370
[https://github.com/apache/arrow/pull/6370]

> [C++] Set ColumnMetadata.encoding_stats field
> -
>
> Key: PARQUET-1780
> URL: https://issues.apache.org/jira/browse/PARQUET-1780
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Assignee: Gamage Omega Ishendra
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This metadata field is not set in the C++ library. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1780) [C++] Set ColumnMetadata.encoding_stats field

2020-03-02 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1780:
-

Assignee: Gamage Omega Ishendra

> [C++] Set ColumnMetadata.encoding_stats field
> -
>
> Key: PARQUET-1780
> URL: https://issues.apache.org/jira/browse/PARQUET-1780
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>    Reporter: Wes McKinney
>Assignee: Gamage Omega Ishendra
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> This metadata field is not set in the C++ library. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-26 Thread Wes McKinney
I don't think so.

On Mon, Feb 24, 2020 at 5:14 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Will using a DataPage V2 or DataPage V1 cause any difference for this ticket?
>
> Regards,
> Arun Balajiee
>
> ________
> From: Wes McKinney 
> Sent: Friday, February 21, 2020 3:06:58 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> The data page statistics aren't currently being used during the "scan
> to Arrow" procedure. That's likely to change at some point since the
> Arrow Datasets project will provide a higher level API to indicate
> filter predicates
>
> On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Thanks Wes. I got it now. I am working on that. But I have a general 
> > question though, were page indices  which store min/max values implemented 
> > in arrow parquet ( not referring to column indices or offset indices, just 
> > page indices)
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 18, 2020 3:34 AM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > That's helpful, but I think it would be a good idea to have enough
> > information in the header files to determine what the new APIs do
> > without reading example code.
> >
> > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > I also made changes in the low-level-api folder, couldn’t capture in that 
> > > link I think
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661797398&sdata=slrrTS3YTiloexbzqsZ6GTy72Ok%2FimFBb%2F8%2Fl2fNDlM%3D&reserved=0
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > 
> > > From: Wes McKinney 
> > > Sent: Monday, February 17, 2020 8:11:09 AM
> > > To: Parquet Dev 
> > > Cc: Deepak Majeti ; Anatoli Shein 
> > > 
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > By "public APIs" I was referring to changes in the public header
> > > files. I see there are some changes to parquet/file_reader.h and
> > > metadata.h
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=DmAizgy3EKwENlRFBfxgvNAXE2Pq%2FctKlZaymn5dUxY%3D&reserved=0
> > >
> > > Can you add some Doxygen comments to the new APIs that explain how
> > > these APIs are to be used (and what the parameters mean)? The hope
> > > would be that a user could make use of the column index functionality
> > > by reading the .h files only.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > Hi
> > > > I have made my changes for api here, does it look good and is this what 
> > > > you were seeking from me? The writer- api is still in the works and I 
> > > > need to make the reader more generic to support all class data types.
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C6925e0cbd68348c5df6f08d7b6a520a3%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637178692661802376&sdata=hqOXB0h%2FI%2FhLgD6FDFFjw2RH4xAKxqWTPjM7rMJ8llw%3D&reserved=0
> > > >
> > > >
> > >

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-21 Thread Wes McKinney
The data page statistics aren't currently being used during the "scan
to Arrow" procedure. That's likely to change at some point since the
Arrow Datasets project will provide a higher level API to indicate
filter predicates

On Thu, Feb 20, 2020 at 3:25 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Thanks Wes. I got it now. I am working on that. But I have a general question 
> though, were page indices  which store min/max values implemented in arrow 
> parquet ( not referring to column indices or offset indices, just page 
> indices)
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 18, 2020 3:34 AM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> That's helpful, but I think it would be a good idea to have enough
> information in the header files to determine what the new APIs do
> without reading example code.
>
> On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > I also made changes in the low-level-api folder, couldn’t capture in that 
> > link I think
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> >
> > Regards,
> > Arun Balajiee
> >
> > 
> > From: Wes McKinney 
> > Sent: Monday, February 17, 2020 8:11:09 AM
> > To: Parquet Dev 
> > Cc: Deepak Majeti ; Anatoli Shein 
> > 
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > By "public APIs" I was referring to changes in the public header
> > files. I see there are some changes to parquet/file_reader.h and
> > metadata.h
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=lBWBkrHBuqWjCzQ5t5JLUAw6NfIHbVFGC990L%2BDjGoA%3D&reserved=0
> >
> > Can you add some Doxygen comments to the new APIs that explain how
> > these APIs are to be used (and what the parameters mean)? The hope
> > would be that a user could make use of the column index functionality
> > by reading the .h files only.
> >
> > Thanks
> > Wes
> >
> > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi
> > > I have made my changes for api here, does it look good and is this what 
> > > you were seeking from me? The writer- api is still in the works and I 
> > > need to make the reader more generic to support all class data types.
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C9ce829844ee2476da66b08d7b44d598f%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637176116627309524&sdata=T%2Fo7CdxHvvN11Eox9JR6mKAWx75s1aGJUqONVBjVK08%3D&reserved=0
> > >
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 11:24 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > hi Arun,
> > >
> > > We can keep the discussion going on here and on GitHub when you have a
> > > pull request to discuss. There are a number of different people who
> > > can give advice.
> > >
> > > Thanks
> > >
> > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >

[jira] [Updated] (PARQUET-1797) [C++] Fix fuzzing errors

2020-02-18 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1797:
--
Summary: [C++] Fix fuzzing errors  (was: Fix fuzzing errors)

> [C++] Fix fuzzing errors
> 
>
> Key: PARQUET-1797
> URL: https://issues.apache.org/jira/browse/PARQUET-1797
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-18 Thread Wes McKinney
That's helpful, but I think it would be a good idea to have enough
information in the header files to determine what the new APIs do
without reading example code.

On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> I also made changes in the low-level-api folder, couldn’t capture in that 
> link I think
> https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc
>
> Regards,
> Arun Balajiee
>
> ________
> From: Wes McKinney 
> Sent: Monday, February 17, 2020 8:11:09 AM
> To: Parquet Dev 
> Cc: Deepak Majeti ; Anatoli Shein 
> 
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> By "public APIs" I was referring to changes in the public header
> files. I see there are some changes to parquet/file_reader.h and
> metadata.h
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0
>
> Can you add some Doxygen comments to the new APIs that explain how
> these APIs are to be used (and what the parameters mean)? The hope
> would be that a user could make use of the column index functionality
> by reading the .h files only.
>
> Thanks
> Wes
>
> On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi
> > I have made my changes for api here, does it look good and is this what you 
> > were seeking from me? The writer- api is still in the works and I need to 
> > make the reader more generic to support all class data types.
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow-level-api%2Freader-writer-with-index.cc&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=ui7ptlMyyUdlKKVdORLvjKCXidQ4yOIQqTqLFIyOVGY%3D&reserved=0
> >
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 11:24 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > hi Arun,
> >
> > We can keep the discussion going on here and on GitHub when you have a
> > pull request to discuss. There are a number of different people who
> > can give advice.
> >
> > Thanks
> >
> > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Actually I made some changes after the date on the pull request ( even in 
> > > this year), which are not getting reflected on this compare link
> > >
> > > Regards,
> > > Arun Balajiee
> > >
> > > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > > Sent: Tuesday, February 4, 2020 6:43 PM
> > > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > > Shein<mailto:sh...@microfocus.com>
> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> > >
> > > Here's a compare link in case others want to have a look
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C22c38deb3167458e1a7108d7b3aaf442%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637175419145204988&sdata=DiHACPq1Ovrn0J3xOHSmLxfm6Akka%2B%2FgMt8tWglSCfs%3D&reserved=0
> > >
> > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> > > >
> > > > hi Arun,
> > > >
> > > > I took a brief look at your branch. One thing that is missing is the
> > > > proposed public APIs that use the index pages -- that would be very
> > > > helpful for this di

Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-17 Thread Wes McKinney
hi Arun,

By "public APIs" I was referring to changes in the public header
files. I see there are some changes to parquet/file_reader.h and
metadata.h

https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

Can you add some Doxygen comments to the new APIs that explain how
these APIs are to be used (and what the parameters mean)? The hope
would be that a user could make use of the column index functionality
by reading the .h files only.

Thanks
Wes

On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi
> I have made my changes for api here, does it look good and is this what you 
> were seeking from me? The writer- api is still in the works and I need to 
> make the reader more generic to support all class data types.
>
> https://github.com/a2un/arrow/blob/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp/examples/parquet/low-level-api/reader-writer-with-index.cc
>
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 11:24 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> hi Arun,
>
> We can keep the discussion going on here and on GitHub when you have a
> pull request to discuss. There are a number of different people who
> can give advice.
>
> Thanks
>
> On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Actually I made some changes after the date on the pull request ( even in 
> > this year), which are not getting reflected on this compare link
> >
> > Regards,
> > Arun Balajiee
> >
> > From: Wes McKinney<mailto:wesmck...@gmail.com>
> > Sent: Tuesday, February 4, 2020 6:43 PM
> > To: Parquet Dev<mailto:dev@parquet.apache.org>
> > Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> > Shein<mailto:sh...@microfocus.com>
> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping
> >
> > Here's a compare link in case others want to have a look
> >
> > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=uN6KpqxuoRrTuhoysKHkN8N9XVF8dMQTa2BfBupVCpE%3D&reserved=0
> >
> > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> > >
> > > hi Arun,
> > >
> > > I took a brief look at your branch. One thing that is missing is the
> > > proposed public APIs that use the index pages -- that would be very
> > > helpful for this discussion.
> > >
> > > I don't think we have any code for doing random access of a particular
> > > data page in a column chunk, so having as an initial matter would also
> > > be helpful.
> > >
> > > - Wes
> > >
> > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> > >  wrote:
> > > >
> > > > Hi Parquet dev
> > > >
> > > > Deepak Majeti was my dev lead during my summer internship, from when I 
> > > > am trying to add a few changes in the Arrow Parquet Project for the 
> > > > ticket below
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890368849&sdata=6ae98Gu1roe4pGw5moc8D4nwdKNNJ4HC058Ktdo8%2F8I%3D&reserved=0
> > > >  (Assigned to Deepak)
> > > >
> > > > With this regard, I am making a few changes to 
> > > > src/parquet/file_reader.cc ( in a fork on my repository)
> > > >
> > > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7C81d483c7190248e9b6d908d7a9f35550%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164734890378845&sdata=gefWxwn8DMq7LnCLQZLpWmml%2FeNcy2XvDR2iL%2BfteKw%3D&reserved=0
> > > >
> > > > I am stuck at trying to read a particular row using the index that I 
> > > > get in the page_location array struct of offset index. Could you help 
> > > > me with this ? and if there have been discussions on the forums for 
> > > > this as well, could you direct me to that link?
> > > >
> > > > Regards,
> > > > Arun Balajiee
> > > >
> >
>


[jira] [Created] (PARQUET-1798) [C++] Review logic around automatic assignment of field_id's

2020-02-14 Thread Wes McKinney (Jira)
Wes McKinney created PARQUET-1798:
-

 Summary: [C++] Review logic around automatic assignment of 
field_id's
 Key: PARQUET-1798
 URL: https://issues.apache.org/jira/browse/PARQUET-1798
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.6.0


At schema deserialization (from Thrift) time, we are assigning a default 
field_id to the Schema node based on a depth-first ordering of notes. This 
means that a round trip (load, then write) will cause field_id's to be written 
that weren't there before. I'm not sure this is the desired behavior.

We should examine this in more detail and possible change it. See also 
discussion in ARROW-7080 https://github.com/apache/arrow/pull/6408



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1788) [C++] ColumnWriter has undefined behavior when writing arrow chunks

2020-02-10 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1788.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6378
[https://github.com/apache/arrow/pull/6378]

> [C++] ColumnWriter has undefined behavior when writing arrow chunks
> ---
>
> Key: PARQUET-1788
> URL: https://issues.apache.org/jira/browse/PARQUET-1788
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We blindly add offset to dep_level and rep_level inside chunking callbacks 
> when these are nullptrs (I believe this occurs if the schema is flat) we 
> still apply the offset which triggers UBSan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun,

We can keep the discussion going on here and on GitHub when you have a
pull request to discuss. There are a number of different people who
can give advice.

Thanks

On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Actually I made some changes after the date on the pull request ( even in 
> this year), which are not getting reflected on this compare link
>
> Regards,
> Arun Balajiee
>
> From: Wes McKinney<mailto:wesmck...@gmail.com>
> Sent: Tuesday, February 4, 2020 6:43 PM
> To: Parquet Dev<mailto:dev@parquet.apache.org>
> Cc: Deepak Majeti<mailto:deepak.maj...@microfocus.com>; Anatoli 
> Shein<mailto:sh...@microfocus.com>
> Subject: Re: Arrow 1404: Adding index for Page-level Skipping
>
> Here's a compare link in case others want to have a look
>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare%2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=uGV8GSSL1e9CmaxKfkkStdcgQHf0RxLizO72NRKRrrg%3D&reserved=0
>
> On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
> >
> > hi Arun,
> >
> > I took a brief look at your branch. One thing that is missing is the
> > proposed public APIs that use the index pages -- that would be very
> > helpful for this discussion.
> >
> > I don't think we have any code for doing random access of a particular
> > data page in a column chunk, so having as an initial matter would also
> > be helpful.
> >
> > - Wes
> >
> > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
> >  wrote:
> > >
> > > Hi Parquet dev
> > >
> > > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > > below
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET-1404&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=aGvdRxYzQdWAo%2FC8ADw6Br5WDMxiVaeBXO7QuSYK8TU%3D&reserved=0
> > >  (Assigned to Deepak)
> > >
> > > With this regard, I am making a few changes to src/parquet/file_reader.cc 
> > > ( in a fork on my repository)
> > >
> > > https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu%7Cae7f0408b49c4ab408d7a9cbfbd5%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C637164565879592140&sdata=cNkK9cL7v6bqI6%2FM50SyLDs%2BPQ0IVmYvvc9MnYD9WgA%3D&reserved=0
> > >
> > > I am stuck at trying to read a particular row using the index that I get 
> > > in the page_location array struct of offset index. Could you help me with 
> > > this ? and if there have been discussions on the forums for this as well, 
> > > could you direct me to that link?
> > >
> > > Regards,
> > > Arun Balajiee
> > >
>


[jira] [Resolved] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1716.
---
Fix Version/s: cpp-1.6.0
   Resolution: Fixed

Issue resolved by pull request 6005
[https://github.com/apache/arrow/pull/6005]

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> 
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>   Original Estimate: 72h
>  Time Spent: 14h
>  Remaining Estimate: 58h
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1716) [C++] Add support for BYTE_STREAM_SPLIT encoding

2020-02-04 Thread Wes McKinney (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1716:
-

Assignee: Martin Radev

> [C++] Add support for BYTE_STREAM_SPLIT encoding
> 
>
> Key: PARQUET-1716
> URL: https://issues.apache.org/jira/browse/PARQUET-1716
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 13h 50m
>  Remaining Estimate: 58h 10m
>
> *From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 
> ):*
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
>  [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view
> *Apache Arrow can benefit from the reduced requirements for storing FP 
> parquet column data and improvements in decompression speed.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
hi Arun,

I took a brief look at your branch. One thing that is missing is the
proposed public APIs that use the index pages -- that would be very
helpful for this discussion.

I don't think we have any code for doing random access of a particular
data page in a column chunk, so having as an initial matter would also
be helpful.

- Wes

On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
 wrote:
>
> Hi Parquet dev
>
> Deepak Majeti was my dev lead during my summer internship, from when I am 
> trying to add a few changes in the Arrow Parquet Project for the ticket below
>
> https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak)
>
> With this regard, I am making a few changes to src/parquet/file_reader.cc ( 
> in a fork on my repository)
>
> https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp
>
> I am stuck at trying to read a particular row using the index that I get in 
> the page_location array struct of offset index. Could you help me with this ? 
> and if there have been discussions on the forums for this as well, could you 
> direct me to that link?
>
> Regards,
> Arun Balajiee
>


Re: Arrow 1404: Adding index for Page-level Skipping

2020-02-04 Thread Wes McKinney
Here's a compare link in case others want to have a look

https://github.com/apache/arrow/compare/master...a2un:PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp

On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney  wrote:
>
> hi Arun,
>
> I took a brief look at your branch. One thing that is missing is the
> proposed public APIs that use the index pages -- that would be very
> helpful for this discussion.
>
> I don't think we have any code for doing random access of a particular
> data page in a column chunk, so having as an initial matter would also
> be helpful.
>
> - Wes
>
> On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee
>  wrote:
> >
> > Hi Parquet dev
> >
> > Deepak Majeti was my dev lead during my summer internship, from when I am 
> > trying to add a few changes in the Arrow Parquet Project for the ticket 
> > below
> >
> > https://issues.apache.org/jira/browse/PARQUET-1404 (Assigned to Deepak)
> >
> > With this regard, I am making a few changes to src/parquet/file_reader.cc ( 
> > in a fork on my repository)
> >
> > https://github.com/a2un/arrow/tree/PARQUET-1404-Add-index-pages-to-the-format-to-support-efficient-page-skipping-to-parquet-cpp/cpp
> >
> > I am stuck at trying to read a particular row using the index that I get in 
> > the page_location array struct of offset index. Could you help me with this 
> > ? and if there have been discussions on the forums for this as well, could 
> > you direct me to that link?
> >
> > Regards,
> > Arun Balajiee
> >


[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030199#comment-17030199
 ] 

Wes McKinney commented on PARQUET-1783:
---

I suppose it's good at least that the min/max are not "incorrect" when used for 
predicate pushdown, but yes this should be fixed. 

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1783) [C++] Parquet statistics wrong for dictionary type

2020-02-04 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030200#comment-17030200
 ] 

Wes McKinney commented on PARQUET-1783:
---

Do we need to create a corresponding Arrow issue so this does not pass out of 
mind?

> [C++] Parquet statistics wrong for dictionary type
> --
>
> Key: PARQUET-1783
> URL: https://issues.apache.org/jira/browse/PARQUET-1783
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.6.0
>Reporter: Florian Jetter
>Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer 
> to the entire {{CategoricalDtype}} instead of the data included in the row 
> group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual 
> row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> 
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / 
> essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Contributing for the first time in parquet

2020-01-28 Thread Wes McKinney
Which part of the Parquet project are you interested in contributing
to? You can find the project's issues under the PARQUET project on
http://issues.apache.org/jira.

On Tue, Jan 28, 2020 at 12:34 PM Pritam Pathak  wrote:
>
> Hi team,
> I'm currently working in Microsoft and our team works a lot with parquet on
> a day-to-day basis so I was thinking of learning via contributing.
> Can you folks please help me out with issues I can pick up in the initial
> stage so as to familiarize myself with the entire repository.
>
> Thanks,
> Pritam Pathak,
> Software Development Engineer,
> Microsoft India


Re: Following up on Parquet file format issue posted on StackOverflow.

2020-01-28 Thread Wes McKinney
hi Raphael,

yes, to elaborate on my comment from SO -- Parquet files apply a
"layered" encoding and compression strategy that works really well on
datasets with a lot of repeated values. This can yield substantially
better compression ratios than naive compression (simply compressing
bytes using a compressor like Snappy, ZLIB, or LZ4).

Where Parquet may perform less well is in cases where the values are
mostly unique. In the Python case you showed, the problem is made
worse by the fact that the unique strings have to be converted into
PyString objects using the C API PyString_FromStringAndSize (pickle
has to do this, too, but there's more decoding / decompression effort
that has to be done first when reading the Parquet file).

Parquet has a lot of benefits over Python pickle, not the least that
it can be read by many different systems and can be processed in
chunks (versus an all-or-nothing file load)

Others may have other comments. Hope this helps

- Wes

On Tue, Jan 28, 2020 at 3:27 PM Attie, Raphael (GSFC-671.0)[GEORGE
MASON UNIVERSITY]  wrote:
>
> Dear Wes,
>
> I am responding to your offer to discuss my post on Stackoverflow at: 
> https://stackoverflow.com/questions/59432045/pandas-dataframe-slower-to-read-from-parquet-than-from-pickle-file?noredirect=1#comment105050134_59432045
>
> You have explained in the comment section that the kind of dataset I was 
> manipulating is not ideal for Parquet. I would be happy to know more about 
> this.
>
> Here is more context:
> I am working at NASA Goddard Space Flight Center on data from the Solar 
> Dynamics Observatory. It sends 1.5 TB of data per day of observations of the 
> Sun. The dataset I am currently working, and described in my SO post are a 
> subset of the metadata associated with those observations.
>
> I got interested in using Parquet as I got an error using HDF5. My too big 
> dataset resulted simply in an error, whereas a smaller version of it had no 
> problem. Parquet was working regardless of the size of the dataset. Also, I 
> am going to use dask and dask withing GPUs ( from Nvidia RAPIDS) which I 
> believe support Parquet. This is what got me interested in using this format.
>
> Thanks
>
> Raphael Attie
>
>
> - - - - - - - - - -- - - - -- - - - -- - - - -- - - - -- -
> Dr. Raphael Attié (GSFC-6710)
> NASA / Goddard Space Flight Center
> George Mason University
> Office (NASA GSFC, room 041): 301-286-0360
> Cell: 301-631-4954
> Email (1): 
> raphael.at...@nasa.gov
> Email (2): rat...@gmu.edu
>


  1   2   3   4   5   6   7   8   9   10   >