[jira] [Commented] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718426#comment-16718426 ] ASF GitHub Bot commented on PARQUET-1475: - jacques-n opened a new pull request #564:

[jira] [Updated] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1475: Labels: pull-request-available (was: ) > DirectCodecFactory's

Arrow read write support on Java

2018-12-11 Thread Yurui Zhou
Hello I just learned arrow now provided a native reader/writer implementation on C++ to allow user directly read parquet file into Arrow Buffer and Write to parquet file from arrow buffer. I am wondering is there any plan on making the same support on the Java side? I found an implementation

[jira] [Created] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1475: --- Summary: DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor Key: PARQUET-1475 URL:

Re: [Discuss] Code of conduct

2018-12-11 Thread Ryan Blue
+1 On Tue, Dec 11, 2018 at 4:14 PM Julien Le Dem wrote: > strangely enough I was unaware of the apache CoC which has been around for > a while. > How about we add a CODE_OF_CONDUCT.md at the root of the repo pointing to > the apache CoC? > It seems to be the place people would look at first. >

Re: [Discuss] Code of conduct

2018-12-11 Thread Julien Le Dem
strangely enough I was unaware of the apache CoC which has been around for a while. How about we add a CODE_OF_CONDUCT.md at the root of the repo pointing to the apache CoC? It seems to be the place people would look at first. On Sun, Dec 9, 2018 at 8:54 PM Uwe L. Korn wrote: > Hello Julien, >

Re: Regarding Apache Parquet Project

2018-12-11 Thread Arjit Yadav
Thank you guys for the resources. Arjit Yadav Phone: +91-9503372431 Email: arjit32...@gmail.com On Tue, Dec 11, 2018 at 3:23 AM Nandor Kollar wrote: > Hi Arjit, > > I'd also recommend you to have a look at Parquet website: > https://parquet.apache.org/ > > You can find a couple of old, but

Re: parquet-arrow estimate file size

2018-12-11 Thread Jiayuan Chen
So seems like there is no solution to implement such mechanism using the low-level API? I tried to dump the arrow::Buffer after each rowgroup is completed, but looks like it is not a clear cut, that pages starting from the second rowgroup became unreadable (the schema is correct tho). If this

RE: parquet-arrow estimate file size

2018-12-11 Thread Lee, David
In my experience and experiments it is really hard to approximate target sizes. A single parquet file with a single row group could be 20% larger than a parquet files with 20 row groups because if you have a lot of rows with a lot of data variety you can lose dictionary encoding options. I

[jira] [Updated] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1474: Labels: pull-request-available (was: ) > Less verbose and lower level logging for

[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-11 Thread Wes McKinney (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717363#comment-16717363 ] Wes McKinney commented on PARQUET-1470: --- [~ArnaudL] we all propose changes to the Parquet

[jira] [Commented] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717355#comment-16717355 ] ASF GitHub Bot commented on PARQUET-1474: - gszadovszky opened a new pull request #563:

Re: parquet-arrow estimate file size

2018-12-11 Thread Wes McKinney
hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to control or examine the size of files as they are being written. Ideally we would develop an interface to write a sequence of arrow::RecordBatch objects that would automatically move on to a new file once a certain

[jira] [Created] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1474: - Summary: Less verbose and lower level logging for missing column/offset indexes Key: PARQUET-1474 URL: https://issues.apache.org/jira/browse/PARQUET-1474

[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-11 Thread Arnaud Linz (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717272#comment-16717272 ] Arnaud Linz commented on PARQUET-1470: -- I tried, but I'm not a regular commiter and don't have

[jira] [Updated] (PARQUET-1472) Dictionary filter fails on FIXED_LEN_BYTE_ARRAY

2018-12-11 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/PARQUET-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1472: Labels: pull-request-available (was: ) > Dictionary filter fails on

Re: Regarding Apache Parquet Project

2018-12-11 Thread Nandor Kollar
Hi Arjit, I'd also recommend you to have a look at Parquet website: https://parquet.apache.org/ You can find a couple of old, but great presentations there, I recommend you to watch those to understand the basics (despite Parquet improved during the years with additional features, the basics

Re: parquet-arrow estimate file size

2018-12-11 Thread Hatem Helal
I think if I've understood the problem correctly, you could use the parquet::arrow::FileWriter https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128 The basic pattern is to use an object to manage the FileWriter lifetime, call the WriteTable method for each row group,

Re: Regarding Apache Parquet Project

2018-12-11 Thread Hatem Helal
Hi Arjit, I'm new around here too but interested to hear what the others on this list have to say. For C++ development, I've recommend reading through the examples: https://github.com/apache/arrow/tree/master/cpp/examples/parquet and the command-line tools: