Re: parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
I am not the producer of data so I can not control encoding. I do receive
ByteBuffer and encoding.
I can decode data with given encoding and covert to UTF-8 for storing with
Parquet.
I was thinking to remove that overhead if possible

Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


On Wed, Feb 6, 2019 at 11:37 AM Uwe L. Korn  wrote:

> Hello Manik,
>
> this is not possible at the moment. As Parquet is a portable on-disk
> format, we focus on having a single representation for each data type. Thus
> implementing readers/writers is limited to these to make their
> implementation simpler. Especially as you are the producer but not the
> consumer, even adding a new type would not solve your problem. You really
> can only use a new logical type when it has been implemented in all the
> readers and your consumers have all updated to these reader versions.
>
> As Unicode and thus UTF-8 support all characters one can think off, you
> should always be able to convert strings to it. Given that Parquet files
> encode and compress the data anyway afterwards, the conversion is a bit of
> a CPU overhead but should not make a difference in size and form of the
> data actually stored in the files. Also I guess that the UTF-16->UTF-8
> conversion costs less CPU that the Parquet compression process.
>
> Did this help you or is there any reason why you really cannot convert
> your data to UTF-8?
>
> Uwe
>
> On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> > Hi
> >
> > I am new to Parquet. I am trying to save UTF-16 or some other encoding
> than
> > UTF-8.
> > I am also trying to use encoding hint when saving ByteBuffer.
> >
> > I don't find way to use any thing other than UTF-8.
> > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> says
> > we can extend primitive types to solve cases.
> >
> > Other thing I want to mention is I am only the producer of parquet file
> but
> > not consumer.
> >
> > Could you guide me which examples I can look into or which will be right
> way
> >
> >
> > Regards
> > Manik Singla
> > +91-9996008893
> > +91-9665639677
> >
> > "Life doesn't consist in holding good cards but playing those you hold
> > well."
>


Re: parquet using encoding other than UTF-8

2019-02-05 Thread Uwe L. Korn
Hello Manik,

this is not possible at the moment. As Parquet is a portable on-disk format, we 
focus on having a single representation for each data type. Thus implementing 
readers/writers is limited to these to make their implementation simpler. 
Especially as you are the producer but not the consumer, even adding a new type 
would not solve your problem. You really can only use a new logical type when 
it has been implemented in all the readers and your consumers have all updated 
to these reader versions.

As Unicode and thus UTF-8 support all characters one can think off, you should 
always be able to convert strings to it. Given that Parquet files encode and 
compress the data anyway afterwards, the conversion is a bit of a CPU overhead 
but should not make a difference in size and form of the data actually stored 
in the files. Also I guess that the UTF-16->UTF-8 conversion costs less CPU 
that the Parquet compression process. 

Did this help you or is there any reason why you really cannot convert your 
data to UTF-8?

Uwe

On Wed, Feb 6, 2019, at 6:19 AM, Manik Singla wrote:
> Hi
> 
> I am new to Parquet. I am trying to save UTF-16 or some other encoding than
> UTF-8.
> I am also trying to use encoding hint when saving ByteBuffer.
> 
> I don't find way to use any thing other than UTF-8.
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says
> we can extend primitive types to solve cases.
> 
> Other thing I want to mention is I am only the producer of parquet file but
> not consumer.
> 
> Could you guide me which examples I can look into or which will be right way
> 
> 
> Regards
> Manik Singla
> +91-9996008893
> +91-9665639677
> 
> "Life doesn't consist in holding good cards but playing those you hold
> well."


parquet using encoding other than UTF-8

2019-02-05 Thread Manik Singla
Hi

I am new to Parquet. I am trying to save UTF-16 or some other encoding than
UTF-8.
I am also trying to use encoding hint when saving ByteBuffer.

I don't find way to use any thing other than UTF-8.
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md says
we can extend primitive types to solve cases.

Other thing I want to mention is I am only the producer of parquet file but
not consumer.

Could you guide me which examples I can look into or which will be right way


Regards
Manik Singla
+91-9996008893
+91-9665639677

"Life doesn't consist in holding good cards but playing those you hold
well."


[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-05 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761079#comment-16761079
 ] 

Deepak Majeti commented on PARQUET-1523:


Sure!

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1525) [C++] remove dependency on getopt in parquet tools

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1525.
---
Resolution: Fixed

Issue resolved by pull request 3545
[https://github.com/apache/arrow/pull/3545]

> [C++] remove dependency on getopt in parquet tools
> --
>
> Key: PARQUET-1525
> URL: https://issues.apache.org/jira/browse/PARQUET-1525
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Moved] (PARQUET-1525) [C++] remove dependency on getopt in parquet tools

2019-02-05 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney moved ARROW-4456 to PARQUET-1525:
--

Fix Version/s: (was: 0.13.0)
   cpp-1.6.0
  Component/s: (was: C++)
   parquet-cpp
 Workflow: patch-available, re-open possible  (was: jira)
  Key: PARQUET-1525  (was: ARROW-4456)
  Project: Parquet  (was: Apache Arrow)

> [C++] remove dependency on getopt in parquet tools
> --
>
> Key: PARQUET-1525
> URL: https://issues.apache.org/jira/browse/PARQUET-1525
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Renat Valiullin
>Assignee: Renat Valiullin
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1497) [Java] Building on OSX fails with OpenJDK 11

2019-02-05 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-1497:

Labels: pull-request-available  (was: )

> [Java] Building on OSX fails with OpenJDK 11
> 
>
> Key: PARQUET-1497
> URL: https://issues.apache.org/jira/browse/PARQUET-1497
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-thrift
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build with OpenJDK 11, I get errors due to the Generated 
> annotation not being resolved:
> {code:java}
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
> parquet-format-structures ---
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 51 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/classes
> [INFO] -
> [WARNING] COMPILATION WARNING :
> [INFO] -
> [WARNING] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
>  
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java
>  uses or overrides a deprecated API.
> [WARNING] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 2 warnings
> [INFO] -
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[37,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[40,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[43,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[41,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[40,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[42,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimeUnit.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/MilliSeconds.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> 

[jira] [Commented] (PARQUET-1497) [Java] Building on OSX fails with OpenJDK 11

2019-02-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760921#comment-16760921
 ] 

ASF GitHub Bot commented on PARQUET-1497:
-

gszadovszky commented on pull request #604: PARQUET-1497: Add 
javax.annotation-api dependency for JDK >= 9
URL: https://github.com/apache/parquet-mr/pull/604
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Building on OSX fails with OpenJDK 11
> 
>
> Key: PARQUET-1497
> URL: https://issues.apache.org/jira/browse/PARQUET-1497
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-thrift
>Affects Versions: 1.10.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
>
> When trying to build with OpenJDK 11, I get errors due to the Generated 
> annotation not being resolved:
> {code:java}
> [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ 
> parquet-format-structures ---
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 51 source files to 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/classes
> [INFO] -
> [WARNING] COMPILATION WARNING :
> [INFO] -
> [WARNING] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
>  
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java
>  uses or overrides a deprecated API.
> [WARNING] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/src/main/java/org/apache/parquet/format/event/Consumers.java:
>  Recompile with -Xlint:deprecation for details.
> [INFO] 2 warnings
> [INFO] -
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/NanoSeconds.java:[37,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/StringType.java:[40,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/DataPageHeaderV2.java:[43,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/Statistics.java:[41,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/SortingColumn.java:[40,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[32,24]
>  package javax.annotation does not exist
> [ERROR] 
> /Users/uwe/tmp/apache-parquet-1.11.0/parquet-format-structures/target/generated-sources/thrift/org/apache/parquet/format/TimestampType.java:[42,2]
>  cannot find symbol
>   symbol: class Generated
> [ERROR] 
> 

[jira] [Assigned] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter

2019-02-05 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1521:


Assignee: Wes McKinney

> [C++] Do not use "extern template class" with parquet::ColumnWriter
> ---
>
> Key: PARQUET-1521
> URL: https://issues.apache.org/jira/browse/PARQUET-1521
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As continued cleaning, similar to parquet::TypedColumnReader I will do 
> similar refactoring for parquet::TypedColumnWriter, leaving the current 
> public API for writing columns unchanged



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PARQUET-1521) [C++] Do not use "extern template class" with parquet::ColumnWriter

2019-02-05 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1521.
--
Resolution: Fixed

Issue resolved by pull request 3551
[https://github.com/apache/arrow/pull/3551]

> [C++] Do not use "extern template class" with parquet::ColumnWriter
> ---
>
> Key: PARQUET-1521
> URL: https://issues.apache.org/jira/browse/PARQUET-1521
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-1.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As continued cleaning, similar to parquet::TypedColumnReader I will do 
> similar refactoring for parquet::TypedColumnWriter, leaving the current 
> public API for writing columns unchanged



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1523) [C++] Vectorize comparator interface

2019-02-05 Thread Deepak Majeti (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760747#comment-16760747
 ] 

Deepak Majeti commented on PARQUET-1523:


I can work on this. I need to make some changes to the statistics API as well.

> [C++] Vectorize comparator interface
> 
>
> Key: PARQUET-1523
> URL: https://issues.apache.org/jira/browse/PARQUET-1523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Major
> Fix For: cpp-1.6.0
>
>
> The {{parquet::Comparator}} interface yields scalar virtual calls on the 
> innermost loop. In addition to removing the usage of 
> {{PARQUET_TEMPLATE_EXPORT}} as with other recent patches, I propose to 
> refactor to a vector-based comparison to update the minimum and maximum 
> elements in a single virtual call
> cc [~mdeepak] [~xhochy]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1524) [C++] remove dependency on getopt in parquet tools

2019-02-05 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760555#comment-16760555
 ] 

Micah Kornfield commented on PARQUET-1524:
--

Created to replace ARROW-4456

> [C++] remove dependency on getopt in parquet tools
> --
>
> Key: PARQUET-1524
> URL: https://issues.apache.org/jira/browse/PARQUET-1524
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Micah Kornfield
>Priority: Major
>
> This will allow parquet tools to be built and used on windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1524) [C++] remove dependency on getopt in parquet tools

2019-02-05 Thread Micah Kornfield (JIRA)
Micah Kornfield created PARQUET-1524:


 Summary: [C++] remove dependency on getopt in parquet tools
 Key: PARQUET-1524
 URL: https://issues.apache.org/jira/browse/PARQUET-1524
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Micah Kornfield


This will allow parquet tools to be built and used on windows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)