[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-15 Thread GitBox


gszadovszky commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383152341

   I agree that merging the key-value metadata is not an easy question. Let's 
discuss it separately as it is not related to this PR.
   
   I also agree to store the current writer (parquet-mr) in `created_by` in 
case of rewriting.  It is not easy to decide what would be the proper solution 
anyway. `created_by` is usually used for handling potential erroneous writes. 
Let's say there was an issue in parquet-mr at the version 1.2.3 that written a 
specific encoding of integers wrongly (not according to spec). What if we 
rewrite the file but do not re-encode the pages? Can we still handle the 
original issue? What if the rewriter re-encodes the related pages?
   Let's store the original writer in `original.created.by` for now. Let's 
discuss this topic separately however, I am not sure if we can find a proper 
solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-14 Thread GitBox


gszadovszky commented on PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382840526

   > I am afraid some implementations may drop characters after `'\n'` when 
displaying the string content. Let me do some investigation.
   
   I do not have a strong opinion for `'\n'` only that we need a character that 
probably won't be used by any systems writing parquet files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org