[jira] [Commented] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718426#comment-16718426
 ] 

ASF GitHub Bot commented on PARQUET-1475:
-

jacques-n opened a new pull request #564: PARQUET-1475: Fix lack of cause 
propagation in DirectCodecFactory.Par…
URL: https://github.com/apache/parquet-mr/pull/564
 
 
   …quetCompressionCodecException.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause 
> in one constructor
> 
>
> Key: PARQUET-1475
> URL: https://issues.apache.org/jira/browse/PARQUET-1475
> Project: Parquet
>  Issue Type: Bug
>Reporter: Jacques Nadeau
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java#L521]
>  
> Cause is not actually passed to super.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1475:

Labels: pull-request-available  (was: )

> DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause 
> in one constructor
> 
>
> Key: PARQUET-1475
> URL: https://issues.apache.org/jira/browse/PARQUET-1475
> Project: Parquet
>  Issue Type: Bug
>Reporter: Jacques Nadeau
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java#L521]
>  
> Cause is not actually passed to super.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Arrow read write support on Java

2018-12-11 Thread Yurui Zhou
Hello

I just learned arrow now provided a native reader/writer implementation on C++ 
to allow user directly read parquet file into Arrow Buffer and Write to parquet 
file from arrow buffer.

I am wondering is there any plan on making the same support on the Java side? 

I found an implementation on dremio codebase that provide the arrow support 
mentioned above. 
https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
 


Does the parquet community or arrow community have any plan to integrate this 
into the parquet codebase or implement a new version from scratch?

Thanks
Yurui

smime.p7s
Description: S/MIME cryptographic signature


[jira] [Created] (PARQUET-1475) DirectCodecFactory's ParquetCompressionCodecException drops a passed in cause in one constructor

2018-12-11 Thread Jacques Nadeau (JIRA)
Jacques Nadeau created PARQUET-1475:
---

 Summary: DirectCodecFactory's ParquetCompressionCodecException 
drops a passed in cause in one constructor
 Key: PARQUET-1475
 URL: https://issues.apache.org/jira/browse/PARQUET-1475
 Project: Parquet
  Issue Type: Bug
Reporter: Jacques Nadeau


[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/DirectCodecFactory.java#L521]

 

Cause is not actually passed to super.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss] Code of conduct

2018-12-11 Thread Ryan Blue
+1

On Tue, Dec 11, 2018 at 4:14 PM Julien Le Dem
 wrote:

> strangely enough I was unaware of the apache CoC which has been around for
> a while.
> How about we add a CODE_OF_CONDUCT.md at the root of the repo pointing to
> the apache CoC?
> It seems to be the place people would look at first.
>
> On Sun, Dec 9, 2018 at 8:54 PM Uwe L. Korn  wrote:
>
> > Hello Julien,
> >
> > As per ASF guideline
> > https://www.apache.org/foundation/policies/conduct.html applies also to
> > the Apache Parquet channels. Would that be sufficient for you?
> >
> > Cheers
> > Uwe
> >
> > On Sat, Dec 8, 2018, at 2:14 AM, Julien Le Dem wrote:
> > > We currently don’t have an explicit code of conduct. We’ve always
> > > encouraged respectful discussions and as far as I know all discussions
> > have
> > > been that way.
> > > However, I don’t think we should wait for an incident to create the
> need
> > > for an explicit code of conduct. I suggest we adopt the contributor
> > > covenant as it is well aligned with our values as far as I am
> concerned.
> > > I also think that explicitly adopting it will encourage others to do
> the
> > > same in the open source community.
> > > Best
> > > Julien
> >
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Code of conduct

2018-12-11 Thread Julien Le Dem
strangely enough I was unaware of the apache CoC which has been around for
a while.
How about we add a CODE_OF_CONDUCT.md at the root of the repo pointing to
the apache CoC?
It seems to be the place people would look at first.

On Sun, Dec 9, 2018 at 8:54 PM Uwe L. Korn  wrote:

> Hello Julien,
>
> As per ASF guideline
> https://www.apache.org/foundation/policies/conduct.html applies also to
> the Apache Parquet channels. Would that be sufficient for you?
>
> Cheers
> Uwe
>
> On Sat, Dec 8, 2018, at 2:14 AM, Julien Le Dem wrote:
> > We currently don’t have an explicit code of conduct. We’ve always
> > encouraged respectful discussions and as far as I know all discussions
> have
> > been that way.
> > However, I don’t think we should wait for an incident to create the need
> > for an explicit code of conduct. I suggest we adopt the contributor
> > covenant as it is well aligned with our values as far as I am concerned.
> > I also think that explicitly adopting it will encourage others to do the
> > same in the open source community.
> > Best
> > Julien
>


Re: Regarding Apache Parquet Project

2018-12-11 Thread Arjit Yadav
Thank you guys for the resources.

Arjit Yadav
Phone: +91-9503372431
Email: arjit32...@gmail.com



On Tue, Dec 11, 2018 at 3:23 AM Nandor Kollar 
wrote:

> Hi Arjit,
>
> I'd also recommend you to have a look at Parquet website:
> https://parquet.apache.org/
>
> You can find a couple of old, but great presentations there, I recommend
> you to watch those to understand the basics (despite Parquet improved
> during the years with additional features, the basics didn't change, and
> you can understand it from these presentations). Also, you can find the
> link to the Git repositories, I'd also recommend you to have a look at
> those.
>
> If you're interested in the latest ongoing development efforts, have a look
> at the Jira: https://issues.apache.org/jira/projects/PARQUET/ and have a
> look at the open pull requests attached to these Jiras.
>
> Regards,
> Nandor
>
> On Tue, Dec 11, 2018 at 9:41 AM Hatem Helal 
> wrote:
>
> > Hi Arjit,
> >
> > I'm new around here too but interested to hear what the others on this
> > list have to say.  For C++ development, I've recommend reading through
> the
> > examples:
> >
> > https://github.com/apache/arrow/tree/master/cpp/examples/parquet
> >
> > and the command-line tools:
> >
> > https://github.com/apache/arrow/tree/master/cpp/tools/parquet
> >
> > Both were helpful for getting up to speed on the main APIs.  I use an IDE
> > (Xcode but doesn't matter which) to debug and step through the code and
> try
> > to understand the internal dependencies.  The setup for Xcode was a bit
> > manual but let me know if there is interest and I can investigate
> > automation so that I can share it with others.
> >
> > Hope this helps,
> >
> > Hatem
> >
> > On 12/11/18, 5:39 AM, "Arjit Yadav"  wrote:
> >
> > Hi all,
> >
> > I am new to this project. While I have used parquet in the past, I
> > want to
> > know how it works internally and look up relevant documentation and
> > code
> > inorder to start contributing to the project.
> >
> > Please let me know any available resources in this regard.
> >
> > Regards,
> > Arjit Yadav
> >
> >
> >
>


Re: parquet-arrow estimate file size

2018-12-11 Thread Jiayuan Chen
So seems like there is no solution to implement such mechanism using the
low-level API? I tried to dump the arrow::Buffer after each rowgroup is
completed, but looks like it is not a clear cut, that pages starting from
the second rowgroup became unreadable (the schema is correct tho).

If this solution does not exist, I will get back to the high level API that
uses a in-memory Arrow table then.




On Tue, Dec 11, 2018 at 8:17 AM Lee, David  wrote:

> In my experience and experiments it is really hard to approximate target
> sizes. A single parquet file with a single row group could be 20% larger
> than a parquet files with 20 row groups because if you have a lot of rows
> with a lot of data variety you can lose dictionary encoding options. I
> predetermine my row group sizes by creating them as files and then write
> them to a single parquet file.
>
> A better approach would probably be to write the row group to a single
> file and once the size exceeds your target size, remove the last row group
> written and start a new file with it, but I don't think there is a method
> to remove a row group right now.
>
> Another option would be to write the row group out as a file object in
> memory to predetermine its size before adding it as a row group in a
> parquet file.
>
>
> -Original Message-
> From: Wes McKinney 
> Sent: Tuesday, December 11, 2018 7:16 AM
> To: Parquet Dev 
> Subject: Re: parquet-arrow estimate file size
>
> External Email: Use caution with links and attachments
>
>
> hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to
> control or examine the size of files as they are being written.
> Ideally we would develop an interface to write a sequence of
> arrow::RecordBatch objects that would automatically move on to a new file
> once a certain approximate target size has been reached in an existing
> file. There's a number of moving parts that would need to be created to
> make this possible.
>
> - Wes
> On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal 
> wrote:
> >
> > I think if I've understood the problem correctly, you could use the
> > parquet::arrow::FileWriter
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> > _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128=DwIFaQ=zU
> > O0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=r
> > rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko=zQJ4skn8jLtkXiPTGWljgDyof
> > gJTKIAyAeCBwHQuamw=
> >
> > The basic pattern is to use an object to manage the FileWriter lifetime,
> call the WriteTable method for each row group, and close it when you are
> done.  My understanding is that each call to WriteTable will append a new
> row group which should allow you to incrementally write an out-of-memory
> dataset.  I realize now that I haven't tested this myself so it would be
> good to double-check this with someone more experienced with the
> parquet-cpp APIs.
> >
> > On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:
> >
> > Thanks for the suggestion, will do.
> >
> > Since such high-level API is not yet implemented in the parquet-cpp
> > project, I have to turn back to use the API newly introduced in the
> > low-level API, that calculates the Parquet file size when adding
> data into
> > the column writers. I have another question on that part:
> >
> > Is there any sample code & advice that I can follow to be able to
> stream
> > the Parquet file on a per rowgroup basis? In order words, to restrict
> > memory usage but still create big enough Parquet file, I would like
> to
> > create relatively small rowgroup in memory using
> InMemoryOutputStream(),
> > and dump the buffer contents to my external stream, after completing
> each
> > row group, until a big file with several rowgroups is finished.
> However, my
> > attempt to manipulate the underline arrow::Buffer have failed, that
> the
> > pages starting from the second rowgroup are unreadable.
> >
> > Thanks!
> >
> > On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney 
> wrote:
> >
> > > hi Jiayuan,
> > >
> > > To your question
> > >
> > > > Would this be in the roadmap?
> > >
> > > I doubt there would be any objections to adding this feature to the
> > > Arrow writer API -- please feel free to open a JIRA issue to
> describe
> > > how the API might work in C++. Note there is no formal roadmap in
> this
> > > project.
> > >
> > > - Wes
> > > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen 
> wrote:
> > > >
> > > > Thanks for the Python solution. However, is there a solution in
> C++ that
> > > I
> > > > can create such Parquet file with only in-memory buffer, using
> > > parquet-cpp
> > > > library?
> > > >
> > > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David <
> david@blackrock.com>
> > > wrote:
> > > >
> > > > > Resending.. Somehow I lost some line feeds in the previous
> reply..
> > > > >
> 

RE: parquet-arrow estimate file size

2018-12-11 Thread Lee, David
In my experience and experiments it is really hard to approximate target sizes. 
A single parquet file with a single row group could be 20% larger than a 
parquet files with 20 row groups because if you have a lot of rows with a lot 
of data variety you can lose dictionary encoding options. I predetermine my row 
group sizes by creating them as files and then write them to a single parquet 
file.

A better approach would probably be to write the row group to a single file and 
once the size exceeds your target size, remove the last row group written and 
start a new file with it, but I don't think there is a method to remove a row 
group right now.

Another option would be to write the row group out as a file object in memory 
to predetermine its size before adding it as a row group in a parquet file.


-Original Message-
From: Wes McKinney  
Sent: Tuesday, December 11, 2018 7:16 AM
To: Parquet Dev 
Subject: Re: parquet-arrow estimate file size

External Email: Use caution with links and attachments


hi Hatem -- the arrow::FileWriter class doesn't provide any way for you to 
control or examine the size of files as they are being written.
Ideally we would develop an interface to write a sequence of arrow::RecordBatch 
objects that would automatically move on to a new file once a certain 
approximate target size has been reached in an existing file. There's a number 
of moving parts that would need to be created to make this possible.

- Wes
On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal  wrote:
>
> I think if I've understood the problem correctly, you could use the 
> parquet::arrow::FileWriter
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _arrow_blob_master_cpp_src_parquet_arrow_writer.h-23L128=DwIFaQ=zU
> O0BtkCe66yJvAZ4cAvZg=SpeiLeBTifecUrj1SErsTRw4nAqzMxT043sp_gndNeI=r
> rZY6rZEv9VudDxkk9yE0eaOMMgZDwqpUCOE5qpF8Ko=zQJ4skn8jLtkXiPTGWljgDyof
> gJTKIAyAeCBwHQuamw=
>
> The basic pattern is to use an object to manage the FileWriter lifetime, call 
> the WriteTable method for each row group, and close it when you are done.  My 
> understanding is that each call to WriteTable will append a new row group 
> which should allow you to incrementally write an out-of-memory dataset.  I 
> realize now that I haven't tested this myself so it would be good to 
> double-check this with someone more experienced with the parquet-cpp APIs.
>
> On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:
>
> Thanks for the suggestion, will do.
>
> Since such high-level API is not yet implemented in the parquet-cpp
> project, I have to turn back to use the API newly introduced in the
> low-level API, that calculates the Parquet file size when adding data into
> the column writers. I have another question on that part:
>
> Is there any sample code & advice that I can follow to be able to stream
> the Parquet file on a per rowgroup basis? In order words, to restrict
> memory usage but still create big enough Parquet file, I would like to
> create relatively small rowgroup in memory using InMemoryOutputStream(),
> and dump the buffer contents to my external stream, after completing each
> row group, until a big file with several rowgroups is finished. However, 
> my
> attempt to manipulate the underline arrow::Buffer have failed, that the
> pages starting from the second rowgroup are unreadable.
>
> Thanks!
>
> On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:
>
> > hi Jiayuan,
> >
> > To your question
> >
> > > Would this be in the roadmap?
> >
> > I doubt there would be any objections to adding this feature to the
> > Arrow writer API -- please feel free to open a JIRA issue to describe
> > how the API might work in C++. Note there is no formal roadmap in this
> > project.
> >
> > - Wes
> > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> > >
> > > Thanks for the Python solution. However, is there a solution in C++ 
> that
> > I
> > > can create such Parquet file with only in-memory buffer, using
> > parquet-cpp
> > > library?
> > >
> > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> > wrote:
> > >
> > > > Resending.. Somehow I lost some line feeds in the previous reply..
> > > >
> > > > import os
> > > > import pyarrow.parquet as pq
> > > > import glob as glob
> > > >
> > > > max_target_size = 134217728
> > > > target_size = max_target_size * .95
> > > > # Directory where parquet files are saved
> > > > working_directory = '/tmp/test'
> > > > files_dict = dict()
> > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > > files.sort()
> > > > for file in files:
> > > > files_dict[file] = os.path.getsize(file)
> > > > print("Merging parquet files")
> > > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > > file_no = 0
> > > > for 

[jira] [Updated] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1474:

Labels: pull-request-available  (was: )

> Less verbose and lower level logging for missing column/offset indexes
> --
>
> Key: PARQUET-1474
> URL: https://issues.apache.org/jira/browse/PARQUET-1474
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> Currently, exception stacktrace is logged at warn level if an offset index is 
> missing. Also a warn level log happens if a column index is missing which is 
> required for column-index based filtering. Both cases are properly valid 
> scenarios if the file is written by older libraries (where no column/offset 
> indexes are written at all) or the sorting order is undefined for the related 
> column type (e.g. INT96).
> These logs shall be kept at INFO level and no stacktrace shall be provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-11 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717363#comment-16717363
 ] 

Wes McKinney commented on PARQUET-1470:
---

[~ArnaudL] we all propose changes to the Parquet repositories using pull 
requests, not by pushing directly to the repo. Another committer will merge 
your patch if it is accepted

> Inputstream leakage in ParquetFileWriter.appendFile
> ---
>
> Key: PARQUET-1470
> URL: https://issues.apache.org/jira/browse/PARQUET-1470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Arnaud Linz
>Priority: Major
>
> Current implementation of ParquetFileWriter.appendFile is:
>  
> {{public void appendFile(InputFile file) throws IOException {}}
> {{    ParquetFileReader.open(file).appendTo(this);}}
> {{ }}}
> this method never closes the inputstream created when the file is opened in 
> the ParquetFileReader constructor.
> This leads for instance to TooManyFilesOpened exceptions when large merge are 
> made with the parquet tools.
> something  like
> {{ try (ParquetFileReader reader = ParquetFileReader.open(file)) {}}
> {{    reader.appendTo(this);}}
> {{ }}}
> would be cleaner.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717355#comment-16717355
 ] 

ASF GitHub Bot commented on PARQUET-1474:
-

gszadovszky opened a new pull request #563: PARQUET-1474: Less verbose and 
lower level logging for missing column/offset indexes
URL: https://github.com/apache/parquet-mr/pull/563
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Less verbose and lower level logging for missing column/offset indexes
> --
>
> Key: PARQUET-1474
> URL: https://issues.apache.org/jira/browse/PARQUET-1474
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> Currently, exception stacktrace is logged at warn level if an offset index is 
> missing. Also a warn level log happens if a column index is missing which is 
> required for column-index based filtering. Both cases are properly valid 
> scenarios if the file is written by older libraries (where no column/offset 
> indexes are written at all) or the sorting order is undefined for the related 
> column type (e.g. INT96).
> These logs shall be kept at INFO level and no stacktrace shall be provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: parquet-arrow estimate file size

2018-12-11 Thread Wes McKinney
hi Hatem -- the arrow::FileWriter class doesn't provide any way for
you to control or examine the size of files as they are being written.
Ideally we would develop an interface to write a sequence of
arrow::RecordBatch objects that would automatically move on to a new
file once a certain approximate target size has been reached in an
existing file. There's a number of moving parts that would need to be
created to make this possible.

- Wes
On Tue, Dec 11, 2018 at 2:54 AM Hatem Helal  wrote:
>
> I think if I've understood the problem correctly, you could use the 
> parquet::arrow::FileWriter
>
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128
>
> The basic pattern is to use an object to manage the FileWriter lifetime, call 
> the WriteTable method for each row group, and close it when you are done.  My 
> understanding is that each call to WriteTable will append a new row group 
> which should allow you to incrementally write an out-of-memory dataset.  I 
> realize now that I haven't tested this myself so it would be good to 
> double-check this with someone more experienced with the parquet-cpp APIs.
>
> On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:
>
> Thanks for the suggestion, will do.
>
> Since such high-level API is not yet implemented in the parquet-cpp
> project, I have to turn back to use the API newly introduced in the
> low-level API, that calculates the Parquet file size when adding data into
> the column writers. I have another question on that part:
>
> Is there any sample code & advice that I can follow to be able to stream
> the Parquet file on a per rowgroup basis? In order words, to restrict
> memory usage but still create big enough Parquet file, I would like to
> create relatively small rowgroup in memory using InMemoryOutputStream(),
> and dump the buffer contents to my external stream, after completing each
> row group, until a big file with several rowgroups is finished. However, 
> my
> attempt to manipulate the underline arrow::Buffer have failed, that the
> pages starting from the second rowgroup are unreadable.
>
> Thanks!
>
> On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:
>
> > hi Jiayuan,
> >
> > To your question
> >
> > > Would this be in the roadmap?
> >
> > I doubt there would be any objections to adding this feature to the
> > Arrow writer API -- please feel free to open a JIRA issue to describe
> > how the API might work in C++. Note there is no formal roadmap in this
> > project.
> >
> > - Wes
> > On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> > >
> > > Thanks for the Python solution. However, is there a solution in C++ 
> that
> > I
> > > can create such Parquet file with only in-memory buffer, using
> > parquet-cpp
> > > library?
> > >
> > > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> > wrote:
> > >
> > > > Resending.. Somehow I lost some line feeds in the previous reply..
> > > >
> > > > import os
> > > > import pyarrow.parquet as pq
> > > > import glob as glob
> > > >
> > > > max_target_size = 134217728
> > > > target_size = max_target_size * .95
> > > > # Directory where parquet files are saved
> > > > working_directory = '/tmp/test'
> > > > files_dict = dict()
> > > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > > files.sort()
> > > > for file in files:
> > > > files_dict[file] = os.path.getsize(file)
> > > > print("Merging parquet files")
> > > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > > file_no = 0
> > > > for file in files:
> > > > if file in files_dict:
> > > > file_no = file_no + 1
> > > > file_name = os.path.join(working_directory,
> > str(file_no).zfill(4)
> > > > + ".parquet")
> > > > print("Saving to parquet file " + file_name)
> > > > # Just rename file if the file size is in target range
> > > > if files_dict[file] > target_size:
> > > > del files_dict[file]
> > > > os.rename(file, file_name)
> > > > continue
> > > > merge_list = list()
> > > > file_size = 0
> > > > # Find files to merge together which add up to less than 128
> > megs
> > > > for k, v in files_dict.items():
> > > > if file_size + v <= max_target_size:
> > > > print("Adding file " + k + " to merge list")
> > > > merge_list.append(k)
> > > > file_size = file_size + v
> > > > # Just rename file if there is only one file to merge
> > > > if len(merge_list) == 1:
> > > > del files_dict[merge_list[0]]
> > > > os.rename(merge_list[0], file_name)
> > > >

[jira] [Created] (PARQUET-1474) Less verbose and lower level logging for missing column/offset indexes

2018-12-11 Thread Gabor Szadovszky (JIRA)
Gabor Szadovszky created PARQUET-1474:
-

 Summary: Less verbose and lower level logging for missing 
column/offset indexes
 Key: PARQUET-1474
 URL: https://issues.apache.org/jira/browse/PARQUET-1474
 Project: Parquet
  Issue Type: Improvement
Reporter: Gabor Szadovszky


Currently, exception stacktrace is logged at warn level if an offset index is 
missing. Also a warn level log happens if a column index is missing which is 
required for column-index based filtering. Both cases are properly valid 
scenarios if the file is written by older libraries (where no column/offset 
indexes are written at all) or the sorting order is undefined for the related 
column type (e.g. INT96).
These logs shall be kept at INFO level and no stacktrace shall be provided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1470) Inputstream leakage in ParquetFileWriter.appendFile

2018-12-11 Thread Arnaud Linz (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717272#comment-16717272
 ] 

Arnaud Linz commented on PARQUET-1470:
--

I tried, but I'm not a regular commiter and don't have push access to the repo. 
It would be quicker if someone else takes care of it.

> Inputstream leakage in ParquetFileWriter.appendFile
> ---
>
> Key: PARQUET-1470
> URL: https://issues.apache.org/jira/browse/PARQUET-1470
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0
>Reporter: Arnaud Linz
>Priority: Major
>
> Current implementation of ParquetFileWriter.appendFile is:
>  
> {{public void appendFile(InputFile file) throws IOException {}}
> {{    ParquetFileReader.open(file).appendTo(this);}}
> {{ }}}
> this method never closes the inputstream created when the file is opened in 
> the ParquetFileReader constructor.
> This leads for instance to TooManyFilesOpened exceptions when large merge are 
> made with the parquet tools.
> something  like
> {{ try (ParquetFileReader reader = ParquetFileReader.open(file)) {}}
> {{    reader.appendTo(this);}}
> {{ }}}
> would be cleaner.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1472) Dictionary filter fails on FIXED_LEN_BYTE_ARRAY

2018-12-11 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1472:

Labels: pull-request-available  (was: )

> Dictionary filter fails on FIXED_LEN_BYTE_ARRAY
> ---
>
> Key: PARQUET-1472
> URL: https://issues.apache.org/jira/browse/PARQUET-1472
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> DictonaryFilter does not handle FIXED_LEN_BYTE_ARRAY. Moreover, 
> [DictionaryFilter.expandFilter(ColumnChunkMetaData)|https://github.com/apache/parquet-mr/blob/dc61e510126aaa1a95a46fe39bf1529f394147e9/parquet-hadoop/src/main/java/org/apache/parquet/filter2/dictionarylevel/DictionaryFilter.java#L78]
>  returns an empty map instead of null therefore the row-group might be 
> dropped as the value seems to not being in the dictionary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Regarding Apache Parquet Project

2018-12-11 Thread Nandor Kollar
Hi Arjit,

I'd also recommend you to have a look at Parquet website:
https://parquet.apache.org/

You can find a couple of old, but great presentations there, I recommend
you to watch those to understand the basics (despite Parquet improved
during the years with additional features, the basics didn't change, and
you can understand it from these presentations). Also, you can find the
link to the Git repositories, I'd also recommend you to have a look at
those.

If you're interested in the latest ongoing development efforts, have a look
at the Jira: https://issues.apache.org/jira/projects/PARQUET/ and have a
look at the open pull requests attached to these Jiras.

Regards,
Nandor

On Tue, Dec 11, 2018 at 9:41 AM Hatem Helal 
wrote:

> Hi Arjit,
>
> I'm new around here too but interested to hear what the others on this
> list have to say.  For C++ development, I've recommend reading through the
> examples:
>
> https://github.com/apache/arrow/tree/master/cpp/examples/parquet
>
> and the command-line tools:
>
> https://github.com/apache/arrow/tree/master/cpp/tools/parquet
>
> Both were helpful for getting up to speed on the main APIs.  I use an IDE
> (Xcode but doesn't matter which) to debug and step through the code and try
> to understand the internal dependencies.  The setup for Xcode was a bit
> manual but let me know if there is interest and I can investigate
> automation so that I can share it with others.
>
> Hope this helps,
>
> Hatem
>
> On 12/11/18, 5:39 AM, "Arjit Yadav"  wrote:
>
> Hi all,
>
> I am new to this project. While I have used parquet in the past, I
> want to
> know how it works internally and look up relevant documentation and
> code
> inorder to start contributing to the project.
>
> Please let me know any available resources in this regard.
>
> Regards,
> Arjit Yadav
>
>
>


Re: parquet-arrow estimate file size

2018-12-11 Thread Hatem Helal
I think if I've understood the problem correctly, you could use the 
parquet::arrow::FileWriter

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h#L128

The basic pattern is to use an object to manage the FileWriter lifetime, call 
the WriteTable method for each row group, and close it when you are done.  My 
understanding is that each call to WriteTable will append a new row group which 
should allow you to incrementally write an out-of-memory dataset.  I realize 
now that I haven't tested this myself so it would be good to double-check this 
with someone more experienced with the parquet-cpp APIs.

On 12/11/18, 12:54 AM, "Jiayuan Chen"  wrote:

Thanks for the suggestion, will do.

Since such high-level API is not yet implemented in the parquet-cpp
project, I have to turn back to use the API newly introduced in the
low-level API, that calculates the Parquet file size when adding data into
the column writers. I have another question on that part:

Is there any sample code & advice that I can follow to be able to stream
the Parquet file on a per rowgroup basis? In order words, to restrict
memory usage but still create big enough Parquet file, I would like to
create relatively small rowgroup in memory using InMemoryOutputStream(),
and dump the buffer contents to my external stream, after completing each
row group, until a big file with several rowgroups is finished. However, my
attempt to manipulate the underline arrow::Buffer have failed, that the
pages starting from the second rowgroup are unreadable.

Thanks!

On Mon, Dec 10, 2018 at 3:53 PM Wes McKinney  wrote:

> hi Jiayuan,
>
> To your question
>
> > Would this be in the roadmap?
>
> I doubt there would be any objections to adding this feature to the
> Arrow writer API -- please feel free to open a JIRA issue to describe
> how the API might work in C++. Note there is no formal roadmap in this
> project.
>
> - Wes
> On Mon, Dec 10, 2018 at 5:31 PM Jiayuan Chen  wrote:
> >
> > Thanks for the Python solution. However, is there a solution in C++ that
> I
> > can create such Parquet file with only in-memory buffer, using
> parquet-cpp
> > library?
> >
> > On Mon, Dec 10, 2018 at 3:23 PM Lee, David 
> wrote:
> >
> > > Resending.. Somehow I lost some line feeds in the previous reply..
> > >
> > > import os
> > > import pyarrow.parquet as pq
> > > import glob as glob
> > >
> > > max_target_size = 134217728
> > > target_size = max_target_size * .95
> > > # Directory where parquet files are saved
> > > working_directory = '/tmp/test'
> > > files_dict = dict()
> > > files = glob.glob(os.path.join(working_directory, "*.parquet"))
> > > files.sort()
> > > for file in files:
> > > files_dict[file] = os.path.getsize(file)
> > > print("Merging parquet files")
> > > temp_file = os.path.join(working_directory, "temp.parquet")
> > > file_no = 0
> > > for file in files:
> > > if file in files_dict:
> > > file_no = file_no + 1
> > > file_name = os.path.join(working_directory,
> str(file_no).zfill(4)
> > > + ".parquet")
> > > print("Saving to parquet file " + file_name)
> > > # Just rename file if the file size is in target range
> > > if files_dict[file] > target_size:
> > > del files_dict[file]
> > > os.rename(file, file_name)
> > > continue
> > > merge_list = list()
> > > file_size = 0
> > > # Find files to merge together which add up to less than 128
> megs
> > > for k, v in files_dict.items():
> > > if file_size + v <= max_target_size:
> > > print("Adding file " + k + " to merge list")
> > > merge_list.append(k)
> > > file_size = file_size + v
> > > # Just rename file if there is only one file to merge
> > > if len(merge_list) == 1:
> > > del files_dict[merge_list[0]]
> > > os.rename(merge_list[0], file_name)
> > > continue
> > > # Merge smaller files into one large file. Read row groups 
from
> > > each file and add them to the new file.
> > > schema = pq.read_schema(file)
> > > print("Saving to new parquet file")
> > > writer = pq.ParquetWriter(temp_file, schema=schema,
> > > use_dictionary=True, compression='snappy')
> > > for merge in merge_list:
> > > parquet_file = pq.ParquetFile(merge)
> > > print("Writing " + merge + " to new parquet file")
> > > for i in range(parquet_file.num_row_groups):
> > > 

Re: Regarding Apache Parquet Project

2018-12-11 Thread Hatem Helal
Hi Arjit,

I'm new around here too but interested to hear what the others on this list 
have to say.  For C++ development, I've recommend reading through the examples:

https://github.com/apache/arrow/tree/master/cpp/examples/parquet

and the command-line tools:

https://github.com/apache/arrow/tree/master/cpp/tools/parquet

Both were helpful for getting up to speed on the main APIs.  I use an IDE 
(Xcode but doesn't matter which) to debug and step through the code and try to 
understand the internal dependencies.  The setup for Xcode was a bit manual but 
let me know if there is interest and I can investigate automation so that I can 
share it with others.

Hope this helps,

Hatem

On 12/11/18, 5:39 AM, "Arjit Yadav"  wrote:

Hi all,

I am new to this project. While I have used parquet in the past, I want to
know how it works internally and look up relevant documentation and code
inorder to start contributing to the project.

Please let me know any available resources in this regard.

Regards,
Arjit Yadav