[jira] [Resolved] (PARQUET-1402) [C++] incorrect calculation column start offset for files created by parquet-mr 1.8.1

2019-05-21 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1402.
---
   Resolution: Fixed
Fix Version/s: cpp-1.6.0

Issue resolved by pull request 4359
[https://github.com/apache/arrow/pull/4359]

> [C++] incorrect calculation column start offset for files created by 
> parquet-mr 1.8.1
> -
>
> Key: PARQUET-1402
> URL: https://issues.apache.org/jira/browse/PARQUET-1402
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
> Attachments: test.parquet
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> parquet-mr (at least version 1.8.1-fast-201712141648170019-ab0622b)
> writes to ColumnChunk's metadata dictionary_page_offset == 0 when it is 
> (supposed?) equal to data_page_offset.
> calculation of col_start in std::unique_ptr 
> GetColumnPageReader(int i)
> works incorrectly in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1404) [C++] Add index pages to the format to support efficient page skipping to parquet-cpp

2019-05-21 Thread Deepak Majeti (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti reassigned PARQUET-1404:
--

Assignee: Deepak Majeti

> [C++] Add index pages to the format to support efficient page skipping to 
> parquet-cpp
> -
>
> Key: PARQUET-1404
> URL: https://issues.apache.org/jira/browse/PARQUET-1404
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Renato Javier Marroquín Mogrovejo
>Assignee: Deepak Majeti
>Priority: Major
>
> Once PARQUET-922 is completed we can port such implementation to parquet-cpp 
> as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Writing INT96 timestamp in parquet from either avro/protobuf records

2019-05-21 Thread Zoltan Ivanfi
Hi Ying,

I'm not familiar with Presto, so I can't answer this question for sure, but
your assumption sounds logical to me.

Br,

Zoltan

On Mon, May 20, 2019 at 6:44 PM ying  wrote:

> Hi Zoltan:
>
> Thanks a lot for the information — wasn’t aware of these Jira efforts
> before.
>
> I assume once it’s in Hive, similar support will propagate to the original
> Presto parquet reader which uses Hive?
>
> Thanks.
>
> -
> Ying
>
> On Fri, May 17, 2019 at 4:00 AM Zoltan Ivanfi 
> wrote:
>
> > Hi Ying,
> >
> > Int64 timestamp support is already in the works for Hive, but merging it
> > into the codebase is blocked on the release of parquet-mr 1.11.0 at this
> > moment. Here are the JIRA-s you can track:
> >
> > - HIVE-21215: Read Parquet INT64 timestamp
> > - HIVE-21216: Write Parquet INT64 timestamp
> >
> > There is an ongoing effort for Spark as well:
> >
> > - SPARK-26797: Start using the new logical types API of Parquet 1.11.0
> > instead of the deprecated one
> >
> > Br,
> >
> > Zoltan
> >
> > On Thu, May 16, 2019 at 6:03 PM ying  wrote:
> >
> > > Hi Julien:
> > >
> > > Parquet appears to recommend using int64 for representation of
> timestamp
> > > (through the *timestamp-mills *and *timestamp-micros* logical types).
> > >
> > > However, in our use cases we are using Hive/Presto to load Parquet
> files.
> > > And found out that only int96 format is supported to represent
> timestamp
> > > (see below a number of related Hive JIRAs).  Specifically, although
> Hive
> > > supports different formats for timestamp, when *loading from Parquet*
> > only
> > > int96 is supported as timestamp.
> > >
> > > https://issues.apache.org/jira/browse/HIVE-15079
> > >
> > > https://issues.apache.org/jira/browse/HIVE-13435
> > >
> > > https://issues.apache.org/jira/browse/HIVE-3844
> > >
> > >
> > > Just to confirm the above are known issues to the Parquet community.
> And
> > > are you aware of past/future efforts to add support for loading Parquet
> > > *int64
> > > as *timestamp in Hive?
> > >
> > >
> > > Thanks.
> > >
> > > -
> > > Ying
> > >
> > > On 2019/05/10 18:03:10, Julien Le Dem  wrote:
> > > > Hi Arup,>
> > > > You are correct, you would have to use the lower level APIs or
> > > contribute>
> > > > the int96 support to either protobuf or avro integrations.>
> > > > However we are recommending users to migrate away from the int96 type
> > so
> > > I>
> > > > would not recommend adding that support.>
> > > > https://issues.apache.org/jira/browse/PARQUET-323>
> > > > Maybe check how the tools you use to query that data interpret int96
> > and>
> > > > int64, you might have a better solution moving to the new type and it
> > > being>
> > > > compatible.>
> > > >
> > > > On Fri, May 3, 2019 at 11:34 AM Arup Malakar 
> wrote:>
> > > >
> > > > > Following up on the thread, my current understanding is that INT96
> is
> > > not a>
> > > > > native type in either of protobuf/avro, so the corresponding high
> > > level>
> > > > > parquet writers don’t support that. But `INT96` is supported by low
> > > level>
> > > > > parquet writer apis. I was able to generate parquet files with
> INT96
> > > using>
> > > > > examples from:>
> > > > >>
> > > > >
> > >
> > >
> >
> https://stackoverflow.com/questions/54657496/how-to-write-timestamp-logical-type-int96-to-parquet-using-parquetwriter
> > > >
> > >
> > > > >>
> > > > > Arup>
> > > > >>
> > > > > On Wed, May 1, 2019 at 7:32 PM Arup Malakar 
> > wrote:>
> > > > >>
> > > > > > Hi parquet-dev,>
> > > > > >>
> > > > > > We have existing parquet files which were generated from json
> using
> > > hive,>
> > > > > > where timestamps live as INT96. We are changing the pipeline
> where
> > we
> > > are>
> > > > > > planning to use flink to generate parquet files from protobuf (or
> > > avro)>
> > > > > > using flink's StreamingFileSink. But from my research I am unable
> > to>
> > > > > find a>
> > > > > > way to write INT96 columns in the parquet either from avro or
> > > protobuf.>
> > > > > We>
> > > > > > would like to keep the same datatype on disk for historical and
> new
> > > data>
> > > > > so>
> > > > > > would like to stick to INT96, any suggestion how to achieve
> that?>
> > > > > >>
> > > > > > -->
> > > > > > Arup Malakar>
> > > > > >>
> > > > >>
> > > > >>
> > > > > -->
> > > > > Arup Malakar>
> > > > >>
> > > >
> > >
> >
>


[jira] [Resolved] (PARQUET-1583) [C++] Remove parquet::Vector class

2019-05-21 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1583.
--
Resolution: Fixed

Issue resolved by pull request 4354
[https://github.com/apache/arrow/pull/4354]

> [C++] Remove parquet::Vector class
> --
>
> Key: PARQUET-1583
> URL: https://issues.apache.org/jira/browse/PARQUET-1583
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I'm not sure this code is needed anymore, added during the early days of the 
> project in 2016



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)