Re: List of Additions to Parquet 2

2016-06-16 Thread Wes McKinney
To add a one bit of context, we're looking at the handling of integers other than INT32 and INT64 from the perspective of Apache Arrow. It seems that in Parquet 1 files, you may not be able to recover the original integer types from the file alone. The question is, should we put this metadata in

Re: Parquet-cpp

2016-01-15 Thread Wes McKinney
y (Vertica) <stephen.walkaus...@hpe.com> > Sent: Thursday, January 14, 2016 3:23 PM > To: Sandryhaila, Aliaksei; dev@parquet.apache.org; Majeti, Deepak; > non...@gmail.com; Wes McKinney > Subject: Re: Parquet-cpp > > Yes, thanks for the introduction Julien. > > Nong and We

Re: Parquet-cpp

2016-02-05 Thread Wes McKinney
Wes On Wed, Jan 27, 2016 at 10:22 PM, Wes McKinney <w...@cloudera.com> wrote: > Yeah, if the Apache build queue is clogged up with other projects' builds, > and you have a green build on your personal repo, I suggest posting that on > the PR and the reviewer can accept the patch after che

Re: parquet-cpp patch queue status 2/10

2016-02-10 Thread Wes McKinney
o merge 1) very shortly but I'm going to be pretty busy in the > coming > week getting ready for Spark Summit. I probably won't have too much time to > look at these until after but feel free to review and merge the patches. I > can look > after if that'd be helpful. > > On Wed

parquet-cpp first 0.1 release planning & timeline

2016-02-13 Thread Wes McKinney
Dear friends, I made a pass through the JIRAs and feature roadmap and listed out essential tasks for reaching a milestone that would merit a versioned code release, see: https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit# I will be pressing for all of this to

Organizing functional components and a bottom-up testing plan for parquet-cpp

2016-01-29 Thread Wes McKinney
hi folks, Since there's so many moving pieces with creating a full-featured Parquet reader-writer, I propose we start planning out a plan to create test fixtures and tools to enable us to develop faster. Specifically, we need to achieve maximum decoupling between functional components. Every

Re: Parquet-cpp

2016-01-26 Thread Wes McKinney
builds are executing again. On Tue, Jan 26, 2016 at 4:00 PM, Wes McKinney <w...@cloudera.com> wrote: > There's 3 more patches outstanding that are causing blockage (418, 433, > and 451/453), so I think if we get them merged today or tomorrow when we > should be able to proceed wit

Re: Organizing functional components and a bottom-up testing plan for parquet-cpp

2016-01-31 Thread Wes McKinney
sure everything is on the same page. > CC'ing some folks who should probably chime in. > > > On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinney <w...@cloudera.com> wrote: > > > hi folks, > > > > Since there's so many moving pieces with creating a full-featured Parquet >

Re: Organizing functional components and a bottom-up testing plan for parquet-cpp

2016-01-31 Thread Wes McKinney
great doc Wes. Could add me as a commenter? > > On Sun, Jan 31, 2016 at 12:11 PM, Wes McKinney <w...@cloudera.com> wrote: > > > Dear all, > > > > I created a publicly available document where we can organize the > > parquet-cpp roadmap and outstanding JIRAs

Re: parquet-cpp first 0.1 release planning & timeline

2016-02-29 Thread Wes McKinney
Feb 26, 2016 at 8:33 AM, Wes McKinney <w...@cloudera.com> wrote: > >> If someone could kindly merge this patch (PARQUET-494): >> >> https://github.com/apache/parquet-cpp/pull/64 >> >> we'll then be able to close out the remaining JIRAs and hopefully tag >&

Re: parquet-cpp first 0.1 release planning & timeline

2016-02-23 Thread Wes McKinney
unless there are any other major functional requirements that we are missing? - Wes On Sat, Feb 20, 2016 at 8:40 AM, Wes McKinney <w...@cloudera.com> wrote: > I'll be available most of today and tomorrow as needed for code > reviews. We need to try to get the outstanding patch que

Re: parquet-cpp first 0.1 release planning & timeline

2016-02-26 Thread Wes McKinney
It should > be a straightforward extension to FLBA with the additional requirement of > swapping the bytes. > > On 02/23/2016 11:35 AM, Wes McKinney wrote: > It looks like we should be able to clear our current patch queue today which > puts us in very good shape for the 0.1 release. &g

Re: Parquet-cpp

2016-01-25 Thread Wes McKinney
his point, the patches need to be reviewed and approved by >>> Parquet committers in order to be committed to master. >>> >>> Unfortunately, there is not much activity on this side of the project. >>> The lack of response from current committers is holding us b

Re: Parquet-cpp

2016-01-26 Thread Wes McKinney
;>>>> >>>>>> I don't think any of those are that hard to demonstrate, but I'd be >>>>>> uncomfortable not validating committers like we normally do. >>>>>> Especially in this situation, where I could easily see the amount of >>>&

Re: parquet-cpp first 0.1 release planning & timeline

2016-02-19 Thread Wes McKinney
this time next week, and spend a few more days on code tidying, adding some example scripts and general use / hardening before we cut the release. Sound good? Thanks Wes On Tue, Feb 16, 2016 at 9:46 AM, Wes McKinney <w...@cloudera.com> wrote: > Thanks all. > > I'm gonna try to

parquet-cpp patch queue status 2/10

2016-02-10 Thread Wes McKinney
hello all, We're close to being back in the patch queue red zone. Let me try to make sense of what needs to reviewed and merged and in what order (from merge first to merge last): 1. PARQUET-167: Nong needs to sign off and merge. https://github.com/apache/parquet-cpp/pull/30 2. PARQUET-505: If

Re: Parquet use-case

2016-03-12 Thread Wes McKinney
hi Andrew, I can make some specific comments about parquet-cpp. Note that it's still very much at an alpha stage of development, so you may need to submit some patches for your needs, but such is the price of progress, right? =) On the bright side, there's a number of us here on the list who need

Re: Retrieving the full/expanded name of a column in parquet-cpp

2016-03-19 Thread Wes McKinney
hi Uwe, Thanks for bringing this up -- I haven't done any work with nested data yet so I didn't add any helper functions like you're describing yet! I had several thoughts about this work working on the schema tree building code in parquet/schema. One solution is that you can add a parent()

parquet-cpp 0.1 release candidate

2016-03-04 Thread Wes McKinney
Dear all, Since the JIRA burndown started to stabilize, I tagged the first 0.1 release candidate of parquet-cpp: https://github.com/apache/parquet-cpp/archive/release-0.1-rc0.tar.gz SHA1: 9aefdbd6c14adc141d8cd7a1af681ef9e1c9e8f4 Thank you everyone for the patches and advice/support on this

Re: Parquet-cpp dependency on C++11

2016-03-07 Thread Wes McKinney
hello, responses inline On Mon, Mar 7, 2016 at 8:22 AM, Aliaksei Sandryhaila wrote: > Hi Wes and Julien, > > At this point, parquet-cpp is heavily reliant on C++11 features and > semantics. Believe it or not :), there are plenty of companies still > running older versions

Re: Parquet sync up

2016-04-21 Thread Wes McKinney
I'm sorry that I'm not able to join either due to international travel (also due to European time zone), but my interests are much in line with Uwe's and I look forward to continuing to work together with him and Deepak and Aliaksei on parquet-cpp. We should engage in a conversation on the ML

Re: Parquet sync uo

2016-05-11 Thread Wes McKinney
I'm sorry I wasn't able to join today again (traveling). We could choose an early time Pacific time to make the meeting accessible to both Asia and Europe -- I would suggest 8 or 9 AM Pacific

Re: Parquet sync uo

2016-05-15 Thread Wes McKinney
Thanks Julien -- is it possible to arrange for some advance notice of the date and time of the sync up (or a shared google calendar perhaps)? On Thu, May 12, 2016 at 5:33 PM, Julien Le Dem wrote: > The next sync up will be around Strata London early June, where I'll happen >

Re: C++: API Documentation Style/Tool

2016-04-20 Thread Wes McKinney
I am fine with Doxygen style comments. I will make an effort to adopt this style as well (especially when we set up auto-generated HTML API documentation pages). - Wes On Wed, Apr 20, 2016 at 8:37 AM, Uwe Korn wrote: > Hello, > > I would start to make some API documentation

Re: Parquet Vectorized Read hackathon

2016-07-28 Thread Wes McKinney
;>>>> there around 10am >>>>>> There will be people to open the door earlier. >>>>>> >>>>>> Agenda/things that have been mentioned on the thread: >>>>>> - Parquet <-> Arrow >>>>>> - Parquet-c

Re: Parquet for High Energy Physics

2016-08-03 Thread Wes McKinney
hi Jim Cool to hear about this use case. My gut feeling is that we should not expand the scope of the parquet-cpp library itself too much beyond the computational details of constructing the encoded streams / metadata and writing to a file stream or decoding a file into the raw values stored in

Re: Parquet Vectorized Read hackathon

2016-07-05 Thread Wes McKinney
I may be available a good portion of the 14th (I will be on the road) and will try to participate remotely (can we set up a slack or something?). I am most interested in strategies / algorithms around batch conversion of nested data to and from Apache Arrow data structures. On Sat, Jul 2, 2016 at

Re: Parquet Vectorized Read hackathon

2016-07-08 Thread Wes McKinney
Do we yet have a Slack / IRC for Parquet? I will be joining remotely throughout the day. Anyone who is interested in algorithms for Arrow nested data <-> Parquet disassembly/reassembly, we should start a shared Google document to detail algorithms and various test cases we'll need to address in

Re: Next parquet sync

2017-01-31 Thread Wes McKinney
, too >> >> > Am 30.01.2017 um 04:15 schrieb Wes McKinney <wesmck...@gmail.com>: >> > >> > Does Monday 2/6 work? We could also do this coming Friday 2/3 >> > >> >> On Sat, Jan 28, 2017 at 1:30 AM, Julien Le Dem <jul...@ledem.net> &g

Re: Unable to compile thrift

2017-02-08 Thread Wes McKinney
hi Pradeep -- you can use Thrift 0.7 or higher (the instructions say "0.7+", perhaps we should call this out more explicitly). I recommend building Thrift 0.9.3 or 0.10 -- let us know if you have issues with these Thanks Wes On Wed, Feb 8, 2017 at 2:19 PM, Pradeep Gollakota

Re: Making parquet-cpp release?

2017-01-24 Thread Wes McKinney
project. > I’m not against starting at 0.5 but we should try not to convey to much > meaning in the version number related to the progress/increase in features. > > Julien > >> On Jan 24, 2017, at 6:42 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> h

Re: Next parquet sync

2017-01-26 Thread Wes McKinney
This falls during Spark Summit East -- not sure if anyone else has a conflict with this On Thu, Jan 26, 2017 at 7:02 PM, Julien Le Dem wrote: > Next parquet sync will happen Thursday February 9th at 10am PT on google > hangout >

Re: [PARQUET-CPP] Build failure on master

2017-02-22 Thread Wes McKinney
e:106: recipe for target > 'release/parquet-dump-schema' failed > make[2]: *** [release/parquet-dump-schema] Error 1 > CMakeFiles/Makefile2:716: recipe for target > 'tools/CMakeFiles/parquet-dump-schema.dir/all' failed > make[1]: *** [tools/CMakeFiles/parquet-dump-schema.dir/all] Error

Re: [PARQUET-CPP] Build failure on master

2017-02-22 Thread Wes McKinney
>> >> I actually had 2 versions of boost, I built 1.54 once for a different >> project (which I do not need anymore), I got rid of it and now it picks up >> 1.58 but still gives the same error. Will upload a complete shell >> transcript. >> >> Regards, >> Keit

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC3

2017-02-24 Thread Wes McKinney
@Uwe, I suggest we prefix the RC directory names with apache-parquet-cpp- in https://dist.apache.org/repos/dist/dev/parquet/ to help disambiguate the RCs of the different subcomponents. On Ubuntu 14.04: - Debug build and ran tests with valgrind --tool=memcheck with gcc 4.8.5 - Release build

Re: Day of Sync-up

2017-02-25 Thread Wes McKinney
Moving this to the Parquet mailing list. Other days of the week work OK for me generally. On Fri, Feb 24, 2017 at 5:48 PM, Julien Le Dem wrote: > Currently the Parquet sync-up is scheduled on Thursday 10 am PT every other > week. > Marcel mentioned that another day (same time)

[DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-25 Thread Wes McKinney
Dear Apache Kudu and Apache Impala (incubating) communities, (I'm not sure the best way to have a cross-list discussion, so I apologize if this does not work well) On the recent Apache Parquet sync call, we discussed C++ code sharing between the codebases in Apache Arrow and Apache Parquet, and

Re: [PARQUET-CPP] Build failure on master

2017-02-22 Thread Wes McKinney
value collect2: error: ld returned 1 exit status Patch forthcoming On Wed, Feb 22, 2017 at 1:29 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Hi Wes, > > No I don't have SNAPPY_HOME set. Yes this seems similar to 885 > > On Feb 22, 2017 10:25 AM, "Wes McKinney&qu

Re: [PARQUET-CPP] Build failure on master

2017-02-22 Thread Wes McKinney
st::match_results<__gnu_cxx::__normal_iterator const*, std::string>, > std::allocator<boost::sub_match<__gnu_cxx::__normal_iterator std::string> > > > const&)' > ../release/libparquet.a(metadata.cc.o): In function `perl_matcher': > . > And a lot more >

Re: [PARQUET-CPP] Build failure on master

2017-02-22 Thread Wes McKinney
n.com > > On Wed, Feb 22, 2017 at 10:30 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> I'm able to reproduce the issue on Ubuntu 14.04 >> >> Linking CXX shared library debug/libparquet.so >> /usr/bin/ld: /usr/lib/libsnappy.a(snappy.o): relocation R_X8

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC1

2017-02-19 Thread Wes McKinney
+1 (binding) - Verified signature - Built with -DPARQUET_ARROW=on and ran unit tests - Wes On Sun, Feb 19, 2017 at 1:01 PM, Uwe L. Korn wrote: > Small amendment to the previous mail: > > The vote will be open for the next ~72 hours ending at 18:45 CET, > February 22, 2017. >

Making parquet-cpp release?

2017-01-24 Thread Wes McKinney
hi folks, Since Uwe has set up the release-making bits recently, and the API is reasonably stable after the refactor to depend on libarrow, I propose we go ahead and make a first official parquet-cpp source release. I propose that we call this release 0.5.0 instead of 0.1.0 to reflect the

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC3

2017-02-27 Thread Wes McKinney
’t run tests. > > rb > > On Sun, Feb 26, 2017 at 11:10 AM, Wes McKinney wesmck...@gmail.com wrote: > > hi Deepak, > > Thank you very much for catching this. > > It appears that Travis CI silently upgraded our build image to Xcode > 7.3 last fall — we should have pegge

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC3

2017-02-26 Thread Wes McKinney
on Ubuntu 16.04 with GCC 4.9.4 > > +1 (non-binding) > > Thanks, Uwe. > > > On Fri, Feb 24, 2017 at 5:21 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > @Uwe, I suggest we prefix the RC directory names with apache-parquet-cpp- > > in > > >

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-26 Thread Wes McKinney
e opinions of others, and possible next steps. Thanks Wes On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote: > Thanks for bringing this up, Wes. > > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote: > >> Dear Apache Kudu an

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-27 Thread Wes McKinney
sharing code, we should figure out how exactly we'll manage the >> cases where we want to make some change in a common library that breaks an >> API used by other projects, given there's no way to make an atomic commit >> across many repositories. One option is that each "user&q

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC3

2017-02-26 Thread Wes McKinney
.9.4 and 5.4.0 on OSX. > Looks like the option '-stdlib=libc++' works only with Clang. > > On Sun, Feb 26, 2017 at 9:34 AM, Wes McKinney <wesmck...@gmail.com> wrote: > >> @Deepak: which version of XCode is the clang 3.6.0 from? I'd like to look >> into it >> >

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

2017-02-26 Thread Wes McKinney
ome (most) of it be added to APR <https://apr.apache.org/>? > > On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> hi Henry, >> >> Thank you for these comments. >> >> I think having a kind of "Apache Commons for [Modern] C++&quo

Re: Arrow cpp travis-ci build broken

2016-09-06 Thread Wes McKinney
hi Julien, I'm very sorry about the inconvenience with this and the delay in getting it sorted out. I will triage this evening by disabling the Parquet tests in Arrow until we get the current problems under control. When we re-enable the Parquet tests in Travis CI I agree we should pin the

Re: Arrow-Parquet integration location (Was: Arrow cpp travis-ci build broken)

2016-09-06 Thread Wes McKinney
he other way around. >> Arrow provides the API and each storage layer (Parquet, Kudu, Cassandra, >> ...) provides a way to produce Arrow Record Batches. >> thoughts? >> >>> On Tue, Sep 6, 2016 at 3:37 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>&g

Re: Python Parquet package

2016-09-21 Thread Wes McKinney
I don't agree with this approach right now. Here are my reasons: 1. The Parquet Python integration will need to depend both on PyArrow and the Arrow C++ libraries, so these libraries would generally need to be developed together 2. PyArrow would need to define and maintain a C++ or Cython API so

Re: Python Parquet package

2016-09-21 Thread Wes McKinney
for me. I will then to continue to implement the missing > interfaces for Parquet in pyarrow.parquet. > > @wesm Can you take care that we easily depend on a pinned version of > parquet-cpp in pyarrow’s travis builds? > > Uwe > >> Am 21.09.2016 um 20:07 schrieb Wes McKinney

Re: Next sync

2016-09-23 Thread Wes McKinney
+1 On Thu, Sep 22, 2016 at 8:18 PM, Julien Le Dem wrote: > The sync next week collides with strata Conf in NY. > I propose to move it to the following week. > > > -- > Julien

Re: Parquet Format Change for Statistics Ordering

2016-08-26 Thread Wes McKinney
The type of comparison used here strikes me as dependent on the ConvertedType of the column. Adding explicit signed/unsigned min/max of course gives you both options after the fact. So another option is (if I'm understanding correctly) to change parquet-mr's BYTE_ARRAY comparison used for UTF8

Re: Travis CI has been failing

2016-08-29 Thread Wes McKinney
Since googlecode project hosting seems to have completely shut down (they had claimed that these downloads would be available the "rest of 2016"), you can use the download links from GitHub: https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.bz2 cf

Re: Preparing for parquet-cpp 0.1

2016-11-08 Thread Wes McKinney
I think we are ready to make a release once PARQUET-702 is merged. Is there any more licensing / NOTICE review work to do? On Fri, Nov 4, 2016 at 10:29 AM, Deepak Majeti wrote: > I would like to get PARQUET-764 and PARQUET-702 into the release as > well. Both of them

Re: Parquet sync-up starting now

2016-10-28 Thread Wes McKinney
Same. Thanks On Fri, Oct 28, 2016 at 2:36 PM, Deepak Majeti wrote: > Julien, > > Can you please add me to the calendar invite for the sync-up meetings ? > Thanks. > > On Thu, Oct 27, 2016 at 2:33 PM, Julien Le Dem wrote: >> Attendees/Agenda >> Julien

Re: [Draft report] Apache Parquet

2016-10-13 Thread Wes McKinney
gt;>> The parquet-cpp repo has reached a stable state and should release soon. >>> Integration with arrow-cpp is now in the parquet-cpp repo. >>> >>> ## Health report: >>> The PMC and committer list are growing. Discussion is happening on the >>

Re: parquet-cpp problem with Thrift library

2017-01-11 Thread Wes McKinney
hi James, You have to pass "-fPIC" in your $CXXFLAGS when you are building Thrift. See how we have things set up in our external project https://github.com/apache/parquet-cpp/blob/master/cmake_modules/ThirdpartyToolchain.cmake#L54 As an example for Thrift 0.9.3 (which uses CMake now instead of

Re: [PARQUET_CPP] Reading consecutive columns is inefficient

2016-12-20 Thread Wes McKinney
hi Keith, It seems perfect reasonable to add configurable read buffering, or an option to buffer the entire row group if your environment permits it. Can you create a JIRA about this? We would welcome contributions around IO tuning for different hardware / network environments. Note that in

IO and memory management in parquet-cpp and Arrow (C++)

2016-12-23 Thread Wes McKinney
hi folks, Spurred by the discussion and bugfix for PARQUET-799, I'd like to do something about the IO interfaces that we currently have implemented in parquet-cpp. For C++ at least, the Parquet project is not an ideal place to be maintaining cross-platform IO and memory management. There are

parquet-cpp 1.0.0 binaries available from conda-forge

2017-03-25 Thread Wes McKinney
These are available now (Thanks Uwe!): conda install parquet-cpp -c conda-forge Support for date and time types is incomplete in the Arrow adapter -- after Arrow 0.3 comes out we'll want to push for a 1.1.0 release including more complete support. If anyone reading has the skills and time to

Re: Failing C Parquet Writer

2017-03-16 Thread Wes McKinney
/master/be/src/exec/hdfs-parquet-scanner.h#L78 On Thu, Mar 16, 2017 at 3:51 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Grant, > > The value [1, 2, 3] is only 1 value, not 3. The "Number of rows" > passed to the row group is with respect to top level records, *not

Re: Failing C Parquet Writer

2017-03-16 Thread Wes McKinney
hi Grant, The value [1, 2, 3] is only 1 value, not 3. The "Number of rows" passed to the row group is with respect to top level records, *not* counting repeated fields. >From https://blog.twitter.com/2013/dremel-made-simple-with-parquet, I believe the correct data to write is: rep level | def

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC5

2017-03-14 Thread Wes McKinney
hing >> looks good. >> >> On Mon, Mar 13, 2017 at 3:10 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> > This is in the README >> > >> > "The test suite relies on an environment variable PARQUET_TEST_DATA >> > pointing to the

Re: Marking categorical data in Parquet schemas

2017-04-06 Thread Wes McKinney
hi Uwe, Thanks for bringing this up. I have a somewhat different opinion, which is that I don't think categorical metadata belongs _formally_ in the Parquet format. The reason is that database systems generally address storage of categorical data using fact and dimension tables -- if you store

Re: Day of Sync-up

2017-03-08 Thread Wes McKinney
usy on Mondays and Tuesdays, the rest of the week is fine by me. >> >> >> >> >> >> Zoltan >> >> >> >> >> >> On Mon, Feb 27, 2017 at 8:28 AM Uwe L. Korn <uw...@xhochy.com> >> wrote: >> >> >> >> >> >&g

Re: Failing C++ Parquet Writer

2017-03-13 Thread Wes McKinney
6c9afc/src/parquet/column/writer.cc#L337 > Now the Parquet Writer destructor tries to write close the file and > encounters https://github.com/apache/parquet-cpp/blob/5e59bc5c6491a7505 > 585c08fd62aa52f9a6c9afc/src/parquet/column/writer.cc#L159 > > > On Mon, Mar 1

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC5

2017-03-13 Thread Wes McKinney
> I’m doing wrong, my vote is +0. > > rb > > > On Mon, Mar 13, 2017 at 2:32 PM, Ryan Blue <rb...@netflix.com> wrote: > >> Will do, sorry for the delay. >> >> On Mon, Mar 13, 2017 at 2:31 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >&g

Re: [VOTE] Release Apache Parquet C++ 1.0.0 RC5

2017-03-13 Thread Wes McKinney
hi Uwe, Thank you for making the release candidate. I have * Built and run the unit tests (Ubuntu 14.04, gcc 4.8.5) * Verified the MD5 signature * Verified the GPG signature My vote: +1 (binding) @Ryan or @Julien, since we're running a bit short on the voting window would you mind taking a

Re: Failing C++ Parquet Writer

2017-03-13 Thread Wes McKinney
hi Grant, the exception is coming from if (num_rows_ != expected_rows_) { throw ParquetException( "Less than the number of expected rows written in" " the current column chunk"); }

Re: Failing C++ Parquet Writer

2017-03-13 Thread Wes McKinney
See https://issues.apache.org/jira/browse/PARQUET-914 On Mon, Mar 13, 2017 at 6:01 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Grant, > > the exception is coming from > > if (num_rows_ != expected_rows_) { > throw ParquetException( > "Less

Re: [PARQUET-CPP] Does Parquet cpp have a record reader interface?

2017-04-05 Thread Wes McKinney
hi Keith -- we have focused so far on columnar reads (i.e. Arrow) vs. row/record reads. We would welcome contributions to add a record reader interface Thanks Wes On Tue, Apr 4, 2017 at 8:21 PM, Keith Chapman wrote: > Hi, > > I'm trying to read a parquet file which has

Re: [Proposal] Merge parquet-mr and parquet-format repos

2017-08-02 Thread Wes McKinney
+1. In doing so we may want to rename the repository to apache/parquet to reflect the expanded scope. We could also discuss merging in the C++ implementation, though the main reservation I would have would be version numbers as we will likely be releasing parquet-cpp more frequently than

Re: [Proposal] Merge parquet-mr and parquet-format repos

2017-08-03 Thread Wes McKinney
the >> > Java >> > and C++ interoperability. Currently, Java treats parquet files written by >> > C++ differently. >> > >> > On Wed, Aug 2, 2017 at 7:59 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> > > +1. In do

Re: parqet-cpp

2017-08-15 Thread Wes McKinney
hi Joerg, Our developer community did not author that tutorial -- I recommend following the documentation in the Arrow and Parquet codebases; if the documentation is inaccurate or incomplete, we should work together to improve it. It looks like you may have -DPARQUET_BOOST_USE_SHARED=OFF set

Re: Jira access

2017-07-14 Thread Wes McKinney
hi Anna -- I just added you to the Contributor list (your apache.org login), so you should be able to assign issues now. - Wes On Fri, Jul 14, 2017 at 6:36 PM, Anna Szonyi wrote: > Hi, > > Could I get access to the parquet project Jira? I'd like to assign a few > newbie

Re: metadata reading

2017-07-25 Thread Wes McKinney
hi Mike, You can use import pyarrow.parquet as pq pf = pq.ParquetFile(path) pf.metadata or pf.schema This does not read the whole file, only the metadata. Note that we have a function write_metadata: https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L777 It would be nice

Re: subselecting rows

2017-07-25 Thread Wes McKinney
This is not easy to do right now while the file is being read (rather, ex-post), but you are welcome to look at extending the Parquet read API to support selecting a particular row subset. - Wes On Tue, Jul 25, 2017 at 4:10 PM, Katelman, Michael wrote: >

Re: metadata reading

2017-07-26 Thread Wes McKinney
adata had through pq.ParquetFile(path).metadata (or > .schema) include user metadata? I only see num rows, num row groups, column > names and types. Maybe I'm not looking in the right place. > > -Mike > > -Original Message- > From: Wes McKinney [mailto:wesmck...@gmail.com]

Re: metadata reading

2017-07-26 Thread Wes McKinney
metadata. I should be able to add it myself as well. > > -Mike > > -Original Message- > From: Wes McKinney [mailto:wesmck...@gmail.com] > Sent: Wednesday, July 26, 2017 9:34 > To: dev@parquet.apache.org > Subject: Re: metadata reading > > The Arrow use

Re: Converting signed physical types to unsigned logical types

2017-07-26 Thread Wes McKinney
s on more datatypes? Would adding more compressible physical_types > even be useful? > > Felipe > > > > On Wed, Jul 26, 2017 at 1:31 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> We are using std::copy to cast the values on the write side (from >>

Re: Converting signed physical types to unsigned logical types

2017-07-26 Thread Wes McKinney
hi Felipe, In C++ it is the equivalent of uint64_t val = ...; int64_t encoded_val = *reinterpret_cast(); So no alteration of the bit pattern - Wes On Wed, Jul 26, 2017 at 12:18 PM, Felipe Aramburu wrote: >

Fwd: Github's disappearing mirrors

2017-04-28 Thread Wes McKinney
-- Forwarded message -- From: Chris Lambertus Date: Fri, Apr 28, 2017 at 3:22 PM Subject: Github's disappearing mirrors To: committers Hello committers, We have received quite a few reports of github mirrors gone missing. We’ve tracked

Re: writing tables with dictionary arrays

2017-07-30 Thread Wes McKinney
hi Mike No, it's a TODO: https://issues.apache.org/jira/browse/PARQUET-929 - Wes On Sun, Jul 30, 2017 at 11:00 AM, Katelman, Michael wrote: > Hi, > > I was trying to write out a really long column of strings where it makes > sense to use a dictionary

Re: why line by line

2017-08-07 Thread Wes McKinney
hi Joerg, It sounds like you are referring to the record-based writer API that's found in parquet-mr, which was originally designed for use in Hadoop MapReduce (if I understand correctly). There is no requirement to write Parquet files in this fashion. The Parquet C++ writer and reader API

Re: [VOTE] Release Apache Parquet C++ 1.1.0 RC0

2017-05-17 Thread Wes McKinney
+1 (binding) * Verified signature * Build from minimal env and run unit tests on Linux (Ubuntu 14.04), built against Arrow 0.4.0 RC0 and ran Python unit tests * Built RC with Visual Studio 2015 against Apache Arrow 0.4.0 rc0, built Python extension and ran unit tests. The Visual Studio build is

Re: representing NA values

2017-05-15 Thread Wes McKinney
hi Mike, I think you want to use WriteBatch on TypedColumnWriter: https://github.com/apache/parquet-cpp/blob/master/src/parquet/column/writer.h#L166 For a flat table with an optional repetition type, the definition levels are a sequence of 1's and 0's, where 1 is for non-null values. The array

Re: parquet::SetArrayBit

2017-06-26 Thread Wes McKinney
Seems that function could use some documentation. It is not intended to be able to clear bits, but rather to set a bit to 1 only if is_set is true. Another way would be if (is_set) { bits[i / 8] |= 1 << (i % 8); } In theory the branch-free version may be faster, but I have not run any

Re: [VOTE] Release Apache Parquet C++ 1.1.0 RC0

2017-05-18 Thread Wes McKinney
sted using > `./dev/release/verify-release-candidate 1.1.0 0` on macOS > > I had to clean up an old version of arrow in /usr/local : > > /usr/local//lib/pkgconfig/arrow.pc > > /usr/local/include/arrow > > > > On Wed, May 17, 2017 at 10:03 PM, Wes McKinney <wesmck.

Re: [RESULT][VOTE] Release Apache Parquet C++ 1.1.0 RC2

2017-05-22 Thread Wes McKinney
st on macos > > > > On Fri, May 19, 2017 at 9:15 AM, Ryan Blue <rb...@netflix.com.invalid> > > wrote: > > > > > +1 (binding) > > > > > > * Checked signatures, checksums > > > * Built on Ubuntu 16.04 LTS > > > * Ran unit tests

Re: [VOTE] Release Apache Parquet C++ 1.1.0 RC2

2017-05-19 Thread Wes McKinney
+1 (binding) * Verified the blocker PARQUET-995 has been fixed * Ran unit tests on Linux + Arrow/Python integration * Ran unit tests on Windows/Visual Studio 2015 On Thu, May 18, 2017 at 4:09 PM, Uwe L. Korn wrote: > +1 (binding) > > Build tests on Linux and macOS and verified

Re: Writing numpy arrays on disk using pyarrow-parquet

2017-06-07 Thread Wes McKinney
hi Vaishal, I already replied to you about this on the mailing list on June 1, can you reply to that thread? I see that you opened ARROW-1097 about the tensor issue. If you could add a standalone reproduction of the problem that would help us debug it and fix faster Thanks Wes On Wed, Jun 7,

Re: Store numpy arrays in parquet format

2017-06-01 Thread Wes McKinney
hi Vaishal, You can certainly use NumPy arrays to create Parquet files, but you will have to do a bit of work to adapt the NumPy arrays to Parquet's (and Arrow's) columnar data model. pandas DataFrame contains NumPy arrays internally. import pyarrow as pa import pyarrow.parquet as pq import

Re: Documentation on Relationship between Logical and Physical Types

2017-06-08 Thread Wes McKinney
hi Felipe, Yes, that's right. For primitive types it is typical for the LogicalType to be not set in the Thrift metadata. The particular integer logical types were added relatively late to the Parquet format and are not used in all implementations (for example, some databases like Hive and Impala

Re: Parquet-cpp build on GCC4.8?

2017-06-09 Thread Wes McKinney
hi Young, It looks like your Boost was compiled with a different version of gcc. If you're targeting gcc 4.8 you need to compile all the dependencies with the same compiler, otherwise you will have a conflict with the libstdc++ ABI. Redhat provides the devtoolset which helps with deploying on a

Re: Next Parquet C++ release

2017-05-04 Thread Wes McKinney
I would like to have MSVC / Windows support in 1.1.0, I will add a blocker. If it isn't done by next week sometime we can move forward with the RC. On Wed, May 3, 2017 at 2:21 PM Uwe L. Korn wrote: > Hello Parquet devs, > > as Apache Arrow 1.1.0 comes close to a release, it is

Re: unit tests fail on master and apache-parquet-cpp-1.1.0 ?

2017-06-06 Thread Wes McKinney
I opened https://issues.apache.org/jira/browse/PARQUET-1021 about adding a more helpful failure message, though you would need to return with ctest -VV in order to see any error output (unless you run the unit test executables directly) On Tue, Jun 6, 2017 at 2:22 PM, Artem

Parquet C++ 1.3.0 release

2017-09-19 Thread Wes McKinney
hi all, We have one last patch PARQUET-1037 pending, but the only other thing in progress is support for Arrow decimal read/write. While I would like to see decimal support go into 1.3.0, there are a number of open questions being discussed on the Arrow mailing list:

Re: Key-Value metadata at RowGroup level in C++ parquet library

2017-09-20 Thread Wes McKinney
hi Rahul, the key value metadata is only supported at the file/schema level and at the column chunk (i.e. each column in a row group) level: https://github.com/apache/parquet-cpp/blob/master/src/parquet/parquet.thrift#L530 We should add an accessor for the column chunk key-value metadata to

  1   2   3   4   5   6   7   8   9   10   >