Hi James, In-lined some comments.
On Tue, Oct 11, 2016 at 5:38 PM, James Pirz <james.p...@gmail.com> wrote: > Hello, > > I am using C++ and I need to convert a corpus of JSON documents, whose > schema is not fixed/known in advance, into Parquet format for efficient > processing/storage. I have gone through a number of examples and test-cases > to get an idea about the best way to do it, however I am still confused. I > believe I need to use ParquetWriter and ParquetReader and I am basically > trying to understand: > 1- Is it really a requirement to use Avro, Thrift or Protobuf for this > purpose (all the examples seem to use one of them) ? I know the schema info > needs to be stored in the footer of Parquet files, but does it mean that I > need to know the schema ahead of time and do I have to use one of those 3 > to store an in-memory representation of my objects or Can I directly feed > the parsed JSON docs into a ParquetWriter ? (Using Avro, Thrift or Protobuf > creates extra dependency which I am really trying to avoid). Parquet is a structured columnar file format. Hence, you will need to need to specify the schema in advance. You can feed typed data directly into C++ ParquetWriter. In your case, you will need to first extract "data and type information" using JSON parsers. > 2- Almost all the examples I found are described in Java. I am using C++ > and I am really looking for an example in that context. I have looked at > a couple of test-cases under parquet-cpp repo, however I am just wondering > if a succinct example is available in C++ to get an idea for such a > conversion. We are in the process of providing a detailed example on how to write and read Parquet files using C++. Jira tracking this https://issues.apache.org/jira/browse/PARQUET-702 I expect this Jira to be completed very soon. For now, the test cases are your best source. > Any hint or suggestion would be highly appreciated. > > Thnx. > James -- regards, Deepak Majeti