It is important to use the RowReaderOptions::include method since that is what controls whether the bytes are read and decompressed or not.
.. Owen On Jan 20, 2019, at 9:52 AM, Gang Wu <ust...@gmail.com> wrote: To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here: https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41 Gang On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zhiyuan.d...@gmail.com> wrote: > Hi Owen, > > Let me follow the github example link you provided. > > Appreciate the prompt response. Many thanks! > > Best, > > Zhiyuan > > On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omal...@gmail.com> > wrote: > >> Yes, ORC files are set up so that reading individual columns is much >> faster (and reads less data) than reading the entire row. >> >> You need to call RowReaderOptions::include or includeType depending on >> whether you want to select by name or id. >> >> Look at the tool code for file contents about how to do this. >> >> >> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77 >> >> .. Owen >> >> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.d...@gmail.com> >> wrote: >> >>> Hi >>> >>> I am working in marketing research field, and find that at times I need >>> to extract contents of ORC files into analytical packages like R, Julia, >>> etc, without using tools like JDBC, etc ( which offers ability to access >>> ORC files ) >>> >>> I have been using C++ to access ORC file contents, following examples >>> provided in the ORC file C++ distribution example, e.g. meta info, >>> contents, etc. My datasets are basic 2d tables, with rows and columns, each >>> column has very basic data types : int64, or double. I have found the ORC >>> file C++ access APIs very helpful and handy! >>> >>> Since R or Julia has column major storage format in their matrix, and I >>> would like to extract the contents of ORC files column by column. In the >>> example that gets the file contents made available on the ORC file C++ >>> official website, the C++ code reads the entire ORC file contents by >>> batches, and within each batch, it reads the contents row by row, creating >>> a string version of the data, JSON like. >>> >>> My question is : ( since I don't know how ORC file structure details ), >>> Can the user read ORC file contents column by column using the C++ APIs you >>> guys published ? is there speed advantage of doing this ( as opposed to >>> read in batches, and within each batch parse contents row by row ). >>> >>> if possible : Is there an example that I can follow to read contents >>> column by column? >>> >>> Is it possible that the example C++ codes can give a (char*) type >>> pointer to the user , each time it reads a row element within a column, so >>> that users can read that into desired data type, e.g. int64, double, etc, >>> directly without building the JSON like text output rows ? Or there are >>> even more there already to read a ORC file column directly into a in-memory >>> T* that stores the data with corresponding data type, e.g. int64, double, >>> etc. ? >>> >>> Many many thanks! >>> >>> Best, >>> >>> Zhiyuan >>> >> > > -- > Zhiyuan Dong, Ph.D. >