[ANNOUNCE] Announcing Apache ORC 1.9.3

2024-03-20 Thread Gang Wu
Hi All. We are happy to announce the availability of Apache ORC 1.9.3! https://orc.apache.org/news/2024/03/20/ORC-1.9.3/ 1.9.3 is a maintenance release containing important fixes. It's available in Apache Downloads and Maven Central. https://downloads.apache.org/orc/orc-1.9.3/

[ANNOUNCE] Announcing ORC availability in Conan

2024-03-11 Thread Gang Wu
Hi All. We are happy to announce the availability of Apache ORC C++ library in the conan center, which is the home of the popular C++ package manager: https://conan.io/center/recipes/orc Currently we have added 2.0.0, 1.9.2, 1.8.6 and 1.7.10. You may find the recipe in the official conan

[ANNOUNCE] Announcing Apache ORC 1.8.5

2023-09-05 Thread Gang Wu
Hi All. We are happy to announce the availability of Apache ORC 1.8.5! https://orc.apache.org/news/2023/09/05/ORC-1.8.5/ 1.8.5 is a maintenance release containing important fixes. It's available in Apache Downloads and Maven Central. https://downloads.apache.org/orc/orc-1.8.5/

[ANNOUNCE] Announcing Apache ORC 1.7.9

2023-05-07 Thread Gang Wu
Hi All. We are happy to announce the availability of Apache ORC 1.7.9! https://orc.apache.org/news/2023/05/07/ORC-1.7.9/ 1.7.9 is a maintenance release containing important fixes. It's available in Apache Downloads and Maven Central. https://downloads.apache.org/orc/orc-1.7.9/

Re: Writing in c++ and data persistence

2023-03-29 Thread Gang Wu
er to the java meta tool. > Otherwise, it defaults to the end of the file. > > >> >> I would need to dig deeper in to how the writeMetadata() and >> writeFileFooter() are different between Java / C++ in order to understand >> what is going on. This Java WriterImpl.java code cau

Re: Writing in c++ and data persistence

2023-03-28 Thread Gang Wu
y writes the offsets to the > "_flush_length" file, though. I would also be interested in seeing when and > how the preliminary footers are written, too. > > Thanks, > Hinko > > From: Gang Wu > Sent: Tuesday, March 28, 2023

Re: Writing in c++ and data persistence

2023-03-28 Thread Gang Wu
Hi Hinko, Please see my inline answers below: On Tue, Mar 28, 2023 at 3:16 PM Hinko Kocevar wrote: > Hi, > > I have a couple of questions about the persistence and consistency of the > data when written to the file. In my use case I generally expect that the > data rate is high enough such

Re: [ANNOUNCE] Welcome Xin Zhang as an ORC committer!

2023-02-10 Thread Gang Wu
Welcome Xin! Best, Gang On Sat, Feb 11, 2023 at 1:07 PM William H. wrote: > The Apache ORC PMC recently added Xin Zhang > (https://github.com/coderex2522) as a committer. > > Please join me in welcoming Xin Zhang to the ORC community! > > Bests, > William, Chair >

Re: Question about Sargs and row index

2023-02-06 Thread Gang Wu
[(byte > offset of chunk1, decompressed size, # of values), (byte offset of > chunk2, decompressed size, # of values)]. Is that correct? > > On Tue, Feb 7, 2023 at 1:04 PM Gang Wu wrote: > > > > Not exactly. It starts with the byte offset of the compression chunk and > appe

Re: Question about Sargs and row index

2023-02-06 Thread Gang Wu
Hi Gang, > > Thanks for your reply. > A follow up question on Row Index, what is the exact meaning of > 'position' in RowIndexEntry? Is it the byte offset of the starting > position of the first compression chunk of that row group? > > On Thu, Feb 2, 2023 at 4:40 PM Gang Wu

Re: Question about Sargs and row index

2023-02-02 Thread Gang Wu
ulting ColumnVectorBatch > produced by C++ reader with PPD? > > On Thu, Jan 19, 2023 at 5:46 PM Xinyu Z wrote: > > > > Hi Gang, > > > > Thanks for your reply! It helps. > > > > Xinyu > > > > On Wed, Jan 18, 2023 at 10:42 AM Gang Wu wrote: > >

Re: Question about Sargs and row index

2023-01-17 Thread Gang Wu
Hi Xinyu, The C++ library does not provide lazy materialization. The java library supports row level filtering, please check it if interested: https://issues.apache.org/jira/browse/ORC-577 With regards to the IO magnification introduced by PPD, I think we have discussed this earlier and there is

Re: [C++] Setting rowIndexStride to a small value increases query time

2022-09-05 Thread Gang Wu
ession block will > be read and decompressed only once right? > > On Mon, Sep 5, 2022 at 11:31 PM Gang Wu wrote: > > > > Hi Xinyu, > > > > When the row group stride is set to 100, we end up with many row groups > and each contributes a protobuf object in the

Re: [C++] Setting rowIndexStride to a small value increases query time

2022-09-05 Thread Gang Wu
Hi Xinyu, When the row group stride is set to 100, we end up with many row groups and each contributes a protobuf object in the stripe index. That's why you see the most expensive function is loadStripeIndex(). I need to say that smaller row groups may not help reduce the I/Os since the

Re: Write intermediate footers in C++

2021-07-07 Thread Gang Wu
Hi David, Unfortunately the C++ writer does not support it yet. It is pretty straightforward to implement it. Do you like to contribute? Best, Gang On Wed, Jul 7, 2021 at 4:39 PM David Justen wrote: > Hey everyone, > > I am currently working on a project using ORC's C++ library. To reduce >

Re: About ORC API Library

2019-03-20 Thread Gang Wu
Hi Lei, Unfortunately we don't have a Go binding for the ORC writer. I am not sure if it is possible for you to use cgo package in Go to call C++ API in your application? Thanks, Gang On Tue, Mar 19, 2019 at 10:44 PM yanglei wrote: > Dear Team > > > > I am working on a project using golang to

Re: Row index questions

2019-03-11 Thread Gang Wu
illion rows. But still one row group per stripe. > > -- Korru > > > On Mar 11, 2019 6:00 PM, Gang Wu wrote: > > The default number of rows in a stripe is 1 which you can get from > Reader::RowIndexStride(). You probably need to create more rows of data to > verify t

Re: Row index questions

2019-03-11 Thread Gang Wu
The default number of rows in a stripe is 1 which you can get from Reader::RowIndexStride(). You probably need to create more rows of data to verify this. Thanks Gang On Mon, Mar 11, 2019 at 1:19 PM Korry Douglas wrote: > I’m making progress on predicate pushdown using the C++ ORC api. > >

Re: Question about using indexes/statistics

2019-03-06 Thread Gang Wu
The following function returns the stripe-level & row-group-level statistics of the stripe specified by input. *ORC_UNIQUE_PTR Reader::getStripeStatistics(uint64_t stripeIndex) const;* You need to call *StripeStatistics::getColumnStatistics *to get stripe-level stats and

Re: How to handle ColumnStatistics in C++

2019-03-01 Thread Gang Wu
Yes, you are right. This interface returns column statistics of all columns and their types can be found via type from the file footer.. On Fri, Mar 1, 2019 at 10:04 AM Korry Douglas wrote: > I think I’ve figured this out - I have to look at the column type and then > infer which of the

Re: access entire column in ORC files

2019-01-25 Thread Gang Wu
Unfortunately we don't have an API to return a row of data. You have to extract each column from the batches. For seekToRow(uint64_t rowNumber), you can jump to the row specified by rowNumber and then use rowReader->next() to get the batch. It is pretty straightforward. You can actually create

Re: access entire column in ORC files

2019-01-20 Thread Gang Wu
To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here:

Re: extract ORC contents without printing into strings first

2018-11-15 Thread Gang Wu
Yes, you can find the example in https://orc.apache.org/docs/core-cpp.html Calling orc::RowReader::next() will return the orc::ColumnVectorBatch data which has a specific batch for each type. All the public APIs that you can have is here:

Re: a question about retrieve column names from ORC files

2018-11-13 Thread Gang Wu
Hi Zhiyuan, Yes, you can see the following example which prints the names of the top level fields. orc::Reader * reader = ... const orc::Type& type = reader->getType(); for (uint64_t i = 0; i != type.getSubtypeCount(); ++i) { std::cout << type.getFieldName(i) << std::endl; } Best, Gang On

Re: Newbie question on update a row

2018-06-14 Thread Gang Wu
Hi Freddy, A simple answer is NO. IIUC, ORC is designed to work on HDFS which is append-only. Therefore ORC file is immutable once the writing is done. Hive provides an advanced feature for update and delete with ORC: https://orc.apache.org/docs/acid.html. Not sure if you are looking for