Re: parquet using encoding other than UTF-8
I am not the producer of data so I can not control encoding. I do receive ByteBuffer and encoding. I can decode data with given encoding and covert to UTF-8 for storing with Parquet. I was thinking to remove that overhead if possible Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well." On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn wrote: > Hello Manik, > > this is not possible at the moment. As Parquet is a portable on-disk > format, we focus on having a single representation for each data type. Thus > implementing readers/writers is limited to these to make their > implementation simpler. Especially as you are the producer but not the > consumer, even adding a new type would not solve your problem. You really > can only use a new logical type when it has been implemented in all the > readers and your consumers have all updated to these reader versions. > > As Unicode and thus UTF-8 support all characters one can think off, you > should always be able to convert strings to it. Given that Parquet files > encode and compress the data anyway afterwards, the conversion is a bit of > a CPU overhead but should not make a difference in size and form of the > data actually stored in the files. Also I guess that the UTF-16->UTF-8 > conversion costs less CPU that the Parquet compression process. > > Did this help you or is there any reason why you really cannot convert > your data to UTF-8? > > Uwe > > On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > > Hi > > > > I am new to Parquet. I am trying to save UTF-16 or some other encoding > than > > UTF-8. > > I am also trying to use encoding hint when saving ByteBuffer. > > > > I don't find way to use any thing other than UTF-8. > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > says > > we can extend primitive types to solve cases. > > > > Other thing I want to mention is I am only the producer of parquet file > but > > not consumer. > > > > Could you guide me which examples I can look into or which will be right > way > > > > > > Regards > > Manik Singla > > +91-9996008893 > > +91-9665639677 > > > > "Life doesn't consist in holding good cards but playing those you hold > > well." >
Re: parquet using encoding other than UTF-8
Hello Manik, this is not possible at the moment. As Parquet is a portable on-disk format, we focus on having a single representation for each data type. Thus implementing readers/writers is limited to these to make their implementation simpler. Especially as you are the producer but not the consumer, even adding a new type would not solve your problem. You really can only use a new logical type when it has been implemented in all the readers and your consumers have all updated to these reader versions. As Unicode and thus UTF-8 support all characters one can think off, you should always be able to convert strings to it. Given that Parquet files encode and compress the data anyway afterwards, the conversion is a bit of a CPU overhead but should not make a difference in size and form of the data actually stored in the files. Also I guess that the UTF-16->UTF-8 conversion costs less CPU that the Parquet compression process. Did this help you or is there any reason why you really cannot convert your data to UTF-8? Uwe On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote: > Hi > > I am new to Parquet. I am trying to save UTF-16 or some other encoding than > UTF-8. > I am also trying to use encoding hint when saving ByteBuffer. > > I don't find way to use any thing other than UTF-8. > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says > we can extend primitive types to solve cases. > > Other thing I want to mention is I am only the producer of parquet file but > not consumer. > > Could you guide me which examples I can look into or which will be right way > > > Regards > Manik Singla > +91-9996008893 > +91-9665639677 > > "Life doesn't consist in holding good cards but playing those you hold > well."
parquet using encoding other than UTF-8
Hi I am new to Parquet. I am trying to save UTF-16 or some other encoding than UTF-8. I am also trying to use encoding hint when saving ByteBuffer. I don't find way to use any thing other than UTF-8. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says we can extend primitive types to solve cases. Other thing I want to mention is I am only the producer of parquet file but not consumer. Could you guide me which examples I can look into or which will be right way Regards Manik Singla +91-9996008893 +91-9665639677 "Life doesn't consist in holding good cards but playing those you hold well."
[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface
[ https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761079#comment-16761079 ] Deepak Majeti commented on PARQUET-1523: Sure! > [C++] Vectorize comparator interface > > > Key: PARQUET-1523 > URL: https://issues.apache.org/jira/browse/PARQUET-1523 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Deepak Majeti >Priority: Major > Fix For: cpp-1.6.0 > > > The {{parquet::Comparator}} interface yields scalar virtual calls on the > innermost loop. In addition to removing the usage of > {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to > refactor to a vector-based comparison to update the minimum and maximum > elements in a single virtual call > cc [~mdeepak] [~xhochy] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1525) [C++] remove dependency on getopt in parquet tools
[ https://issues.apache.org/jira/browse/PARQUET-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved PARQUET-1525. --- Resolution: Fixed Issue resolved by pull request 3545 [https://github.com/apache/arrow/pull/3545] > [C++] remove dependency on getopt in parquet tools > -- > > Key: PARQUET-1525 > URL: https://issues.apache.org/jira/browse/PARQUET-1525 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Renat Valiullin >Assignee: Renat Valiullin >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Moved] (PARQUET-1525) [C++] remove dependency on getopt in parquet tools
[ https://issues.apache.org/jira/browse/PARQUET-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney moved ARROW-4456 to PARQUET-1525: -- Fix Version/s: (was: 0.13.0) cpp-1.6.0 Component/s: (was: C++) parquet-cpp Workflow: patch-available, re-open possible (was: jira) Key: PARQUET-1525 (was: ARROW-4456) Project: Parquet (was: Apache Arrow) > [C++] remove dependency on getopt in parquet tools > -- > > Key: PARQUET-1525 > URL: https://issues.apache.org/jira/browse/PARQUET-1525 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Renat Valiullin >Assignee: Renat Valiullin >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PARQUET-1497) [Java] Building on OSX fails with OpenJDK 11
[ https://issues.apache.org/jira/browse/PARQUET-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-1497: Labels: pull-request-available (was: ) > [Java] Building on OSX fails with OpenJDK 11 > > > Key: PARQUET-1497 > URL: https://issues.apache.org/jira/browse/PARQUET-1497 > Project: Parquet > Issue Type: Bug > Components: parquet-thrift >Affects Versions: 1.10.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: pull-request-available > > When trying to build with OpenJDK 11, I get errors due to the Generated > annotation not being resolved: > {code:java} > [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ > parquet-format-structures --- > [INFO] Changes detected - recompiling the module! > [INFO] Compiling 51 source files to > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/classes > [INFO] - > [WARNING] COMPILATION WARNING : > [INFO] - > [WARNING] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java: > > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java > uses or overrides a deprecated API. > [WARNING] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java: > Recompile with -Xlint:deprecation for details. > [INFO] 2 warnings > [INFO] - > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[37,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[40,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[43,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[41,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[40,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[42,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimeUnit.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/MilliSeconds.java:[32,24] > package javax.annotation does not exist > [ERROR] >
[jira] [Commented] (PARQUET-1497) [Java] Building on OSX fails with OpenJDK 11
[ https://issues.apache.org/jira/browse/PARQUET-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760921#comment-16760921 ] ASF GitHub Bot commented on PARQUET-1497: - gszadovszky commented on pull request #604: PARQUET-1497: Add javax.annotation-api dependency for JDK >= 9 URL: https://github.com/apache/parquet-mr/pull/604 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Building on OSX fails with OpenJDK 11 > > > Key: PARQUET-1497 > URL: https://issues.apache.org/jira/browse/PARQUET-1497 > Project: Parquet > Issue Type: Bug > Components: parquet-thrift >Affects Versions: 1.10.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: pull-request-available > > When trying to build with OpenJDK 11, I get errors due to the Generated > annotation not being resolved: > {code:java} > [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ > parquet-format-structures --- > [INFO] Changes detected - recompiling the module! > [INFO] Compiling 51 source files to > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/classes > [INFO] - > [WARNING] COMPILATION WARNING : > [INFO] - > [WARNING] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java: > > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java > uses or overrides a deprecated API. > [WARNING] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java: > Recompile with -Xlint:deprecation for details. > [INFO] 2 warnings > [INFO] - > [INFO] - > [ERROR] COMPILATION ERROR : > [INFO] - > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[37,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[40,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[43,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[41,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[40,2] > cannot find symbol > symbol: class Generated > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[32,24] > package javax.annotation does not exist > [ERROR] > /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[42,2] > cannot find symbol > symbol: class Generated > [ERROR] >
[jira] [Assigned] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter
[ https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn reassigned PARQUET-1521: Assignee: Wes McKinney > [C++] Do not use "extern template class" with parquet::ColumnWriter > --- > > Key: PARQUET-1521 > URL: https://issues.apache.org/jira/browse/PARQUET-1521 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > As continued cleaning, similar to parquet::TypedColumnReader I will do > similar refactoring for parquet::TypedColumnWriter, leaving the current > public API for writing columns unchanged -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter
[ https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved PARQUET-1521. -- Resolution: Fixed Issue resolved by pull request 3551 [https://github.com/apache/arrow/pull/3551] > [C++] Do not use "extern template class" with parquet::ColumnWriter > --- > > Key: PARQUET-1521 > URL: https://issues.apache.org/jira/browse/PARQUET-1521 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: cpp-1.6.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > As continued cleaning, similar to parquet::TypedColumnReader I will do > similar refactoring for parquet::TypedColumnWriter, leaving the current > public API for writing columns unchanged -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface
[ https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760747#comment-16760747 ] Deepak Majeti commented on PARQUET-1523: I can work on this. I need to make some changes to the statistics API as well. > [C++] Vectorize comparator interface > > > Key: PARQUET-1523 > URL: https://issues.apache.org/jira/browse/PARQUET-1523 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Major > Fix For: cpp-1.6.0 > > > The {{parquet::Comparator}} interface yields scalar virtual calls on the > innermost loop. In addition to removing the usage of > {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to > refactor to a vector-based comparison to update the minimum and maximum > elements in a single virtual call > cc [~mdeepak] [~xhochy] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PARQUET-1524) [C++] remove dependency on getopt in parquet tools
[ https://issues.apache.org/jira/browse/PARQUET-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760555#comment-16760555 ] Micah Kornfield commented on PARQUET-1524: -- Created to replace ARROW-4456 > [C++] remove dependency on getopt in parquet tools > -- > > Key: PARQUET-1524 > URL: https://issues.apache.org/jira/browse/PARQUET-1524 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: Micah Kornfield >Priority: Major > > This will allow parquet tools to be built and used on windows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (PARQUET-1524) [C++] remove dependency on getopt in parquet tools
Micah Kornfield created PARQUET-1524: Summary: [C++] remove dependency on getopt in parquet tools Key: PARQUET-1524 URL: https://issues.apache.org/jira/browse/PARQUET-1524 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: Micah Kornfield This will allow parquet tools to be built and used on windows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)