[jira] [Commented] (ORC-22) Make buffer block size configurable

2015-07-25 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-22?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641424#comment-14641424
 ] 

Lefty Leverenz commented on ORC-22:
---

Should this be documented?

 Make buffer block size configurable
 ---

 Key: ORC-22
 URL: https://issues.apache.org/jira/browse/ORC-22
 Project: Orc
  Issue Type: Task
Reporter: Aliaksei Sandryhaila
Assignee: Aliaksei Sandryhaila

 The current implementation of seekable file input stream reads files, by 
 default, in 256K chunks. This parameter should be configurable via 
 ReaderOptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-52) Add support for mapreduce InputFormat and OutputFormat

2016-06-02 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-52?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313507#comment-15313507
 ] 

Lefty Leverenz commented on ORC-52:
---

Doc note:  This creates four configuration parameters in OrcConf.java and 
removes one parameter, so they should be documented.  Some general 
documentation would also be good.

* orc.mapred.input.schema
* orc.mapred.map.output.key.schema
* orc.mapred.map.output.value.schema
* orc.mapred.output.schema
* (_removed:_  orc.schema)

How do we want to track doc issues -- with TODOC labels like the Hive 
project?

> Add support for mapreduce InputFormat and OutputFormat
> --
>
> Key: ORC-52
> URL: https://issues.apache.org/jira/browse/ORC-52
> Project: Orc
>  Issue Type: Improvement
>  Components: Java, MapReduce
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.1.0
>
>
> We have the mapred InputFormat and OutputFormat, but we need one for the 
> newer API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-65) Write new documentation for site

2016-06-11 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326079#comment-15326079
 ] 

Lefty Leverenz commented on ORC-65:
---

Sorry I missed the boat.  I'll review this soonish and open a new jira for typo 
fixes.  (Only found one so far.)

> Write new documentation for site
> 
>
> Key: ORC-65
> URL: https://issues.apache.org/jira/browse/ORC-65
> Project: Orc
>  Issue Type: Improvement
>  Components: site
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.2.0
>
>
> We need to add additional documentation for the site to discuss how to build, 
> use the APIs, and how to use the tools.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-46) Reserve CompressionKind values for LZ4 and ZSTD

2016-04-08 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233099#comment-15233099
 ] 

Lefty Leverenz commented on ORC-46:
---

Doc note:  Eventually the documentation should be updated.

Now it just says:  "If the ORC file writer selects a generic compression codec 
(zlib or snappy), every part of the ORC file except for the Postscript is 
compressed with that codec."  And the Hive ORC documentation says:  "The codec 
can be Snappy, Zlib, or none."

LZO should also be added.

* [ORC docs -- Compression | http://orc.apache.org/docs/compression.html]
* [Hive ORC doc -- Compression | 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-Compression]

> Reserve CompressionKind values for LZ4 and ZSTD
> ---
>
> Key: ORC-46
> URL: https://issues.apache.org/jira/browse/ORC-46
> Project: Orc
>  Issue Type: Improvement
>Reporter: David Phillips
>Assignee: David Phillips
> Fix For: 1.1.0
>
>
> We are going to start using ORC with LZ4 and ZSTD compression types in Presto 
> (which has its own reader and soon its own writer). Registering these as 
> official CompressionKind values now will avoid future compatibility issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-84) Create a separate java tool module

2016-07-30 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-84?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400538#comment-15400538
 ] 

Lefty Leverenz commented on ORC-84:
---

Does this need to be documented in the wiki?

> Create a separate java tool module
> --
>
> Key: ORC-84
> URL: https://issues.apache.org/jira/browse/ORC-84
> Project: Orc
>  Issue Type: Improvement
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.2.0
>
>
>  I think it would be nice if we made a tool jar to make it easier to invoke 
> the equivalent of orcfiledump from Hive. In particular, I propose that we 
> create a java/tools directory and move FileDump and JsonFileDump in to it. We 
> can also make an orc-tools.jar that is an uber jar and can be invoked easily 
> from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-77) Support Snappy, LZO, and LZ4 from aircompressor.

2016-08-09 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414709#comment-15414709
 ] 

Lefty Leverenz commented on ORC-77:
---

Any doc needed?

> Support Snappy, LZO, and LZ4 from aircompressor.
> 
>
> Key: ORC-77
> URL: https://issues.apache.org/jira/browse/ORC-77
> Project: Orc
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.2.0
>
>
> It would be great to support LZO, Snappy, and LZ4 using 
> https://github.com/airlift/aircompressor .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-81) Add support for lzo and lz4 to c++ reader

2016-08-09 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414710#comment-15414710
 ] 

Lefty Leverenz commented on ORC-81:
---

Any doc needed?

> Add support for lzo and lz4 to c++ reader
> -
>
> Key: ORC-81
> URL: https://issues.apache.org/jira/browse/ORC-81
> Project: Orc
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.2.0
>
>
> Since ORC-77 is adding support for LZO and LZ4 to the Java reader and writer, 
> we need to add matching support for the C++ reader.
> I propose that we port the aircompressor Java LzoDecompressor to C++ so that 
> we end up with an Apache licensed C++ lzo decompressor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-88) Add a raw metadata switch to orc-metadata

2016-08-08 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-88?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411433#comment-15411433
 ] 

Lefty Leverenz commented on ORC-88:
---

Does this need to be documented?

Perhaps it could go here:

* [File Tail | https://orc.apache.org/docs/file-tail.html]

> Add a raw metadata switch to orc-metadata
> -
>
> Key: ORC-88
> URL: https://issues.apache.org/jira/browse/ORC-88
> Project: Orc
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.2.0
>
>
> It would be useful for orc-metadata to be able to print the raw protobuf 
> structures from the ORC file, especially for cases where some damage has 
> happened to the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-150) Add tool to convert from JSON to ORC

2017-02-27 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887216#comment-15887216
 ] 

Lefty Leverenz commented on ORC-150:


Should this be documented in the wiki?

* [ORC Tools | http://orc.apache.org/docs/tools.html]

> Add tool to convert from JSON to ORC
> 
>
> Key: ORC-150
> URL: https://issues.apache.org/jira/browse/ORC-150
> Project: Orc
>  Issue Type: New Feature
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.4.0
>
>
> We should have a tool to convert from JSON data to ORC. Something like
> % java -jar orc-tools-*uber.jar create "struct" *.json



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index

2016-08-17 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425919#comment-15425919
 ] 

Lefty Leverenz commented on ORC-54:
---

At least the new ORC configuration parameter (*orc.tolerate.missing.schema*) 
needs to be documented.  But what about the field name functionality?

> Evolve schemas based on field name rather than index
> 
>
> Key: ORC-54
> URL: https://issues.apache.org/jira/browse/ORC-54
> Project: Orc
>  Issue Type: Improvement
>Reporter: Mark Wagner
>Assignee: Mark Wagner
> Fix For: 1.2.0
>
>
> Schema evolution as it stands today allows adding fields to the end of 
> schemas or removing them from the end. However, because it is based on the 
> index of the column, you can only ever add or remove -- not both.
> ORC files have the full schema information of their contents, so there's 
> actually enough metadata to support changing columns anywhere in the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-54) Evolve schemas based on field name rather than index

2016-08-17 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425915#comment-15425915
 ] 

Lefty Leverenz commented on ORC-54:
---

Should this be documented in the wiki?

> Evolve schemas based on field name rather than index
> 
>
> Key: ORC-54
> URL: https://issues.apache.org/jira/browse/ORC-54
> Project: Orc
>  Issue Type: Improvement
>Reporter: Mark Wagner
>Assignee: Mark Wagner
> Fix For: 1.2.0
>
>
> Schema evolution as it stands today allows adding fields to the end of 
> schemas or removing them from the end. However, because it is based on the 
> index of the column, you can only ever add or remove -- not both.
> ORC files have the full schema information of their contents, so there's 
> actually enough metadata to support changing columns anywhere in the schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-120) Create a backwards compatibility mode of ignoring names for evolution

2016-12-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759654#comment-15759654
 ] 

Lefty Leverenz commented on ORC-120:


The new configuration parameter in OrcConf.java is 
*orc.force.positional.evolution*.

> Create a backwards compatibility mode of ignoring names for evolution
> -
>
> Key: ORC-120
> URL: https://issues.apache.org/jira/browse/ORC-120
> Project: Orc
>  Issue Type: Task
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.3.0
>
>
> ORC's schema evolution uses the column names when they are available. Hive 
> 2.1 uses a positional schema, so ORC should support a backward compatibility 
> mode for Hive users during the transition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-120) Create a backwards compatibility mode of ignoring names for evolution

2016-12-18 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759643#comment-15759643
 ] 

Lefty Leverenz commented on ORC-120:


This will need documentation, as discussed in some Dec. 12 messages on the 
dev@orc email thread, archived here:

* 
https://mail-archives.apache.org/mod_mbox/orc-dev/201612.mbox/%3ced4820a4-8ee1-462a-8658-551d54323...@iq80.com%3e

It's strange that the messages don't appear here in the JIRA.

Synopsis:

Dain Sundstrom -- Is "ORC's schema evolution uses the column names when they 
are available” documented somewhere?

Owen O'Malley -- No, unfortunately, but it needs to be. The basic rules from 
SchemaEvolution.java look like:

structs (including the top row):
   if field names are available (post HIVE-4243), use name matching
   otherwise use positional matching

lists, maps, unions:
  children must match

Many primitives can convert to each other, but this list needs to be cleaned up:
  boolean, byte, short, int, long, float, double, decimal -> boolean, byte,
short, int, long, float, double, decimal, string, char, varchar, timestamp
  string, char, varchar -> all
  timestamp -> boolean, byte, short, int, long, float, double, decimal,
string, char, varchar, date
  date -> string, char, varchar, timestamp
  binary -> string, char, varchar, date

Dain Sundstrom -- So, rename column is not expected to work anymore?

Owen O'Malley -- ORC-120 will add an option to force positional mapping.

Dain Sundstrom -- Oh, I see this is an ORC feature like the Parquet schema 
evolution stuff.  We implemented support for ordering by the top level struct 
names in Presto a while back.

> Create a backwards compatibility mode of ignoring names for evolution
> -
>
> Key: ORC-120
> URL: https://issues.apache.org/jira/browse/ORC-120
> Project: Orc
>  Issue Type: Task
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.3.0
>
>
> ORC's schema evolution uses the column names when they are available. Hive 
> 2.1 uses a positional schema, so ORC should support a backward compatibility 
> mode for Hive users during the transition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ORC-150) Add tool to convert from JSON to ORC

2017-03-07 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899261#comment-15899261
 ] 

Lefty Leverenz commented on ORC-150:


Thanks Owen, here's the link:

* [Java ORC Tools | https://orc.apache.org/docs/tools.html#java-orc-tools]

By the way, the formatting in paragraph 2 is broken (subcommands).

> Add tool to convert from JSON to ORC
> 
>
> Key: ORC-150
> URL: https://issues.apache.org/jira/browse/ORC-150
> Project: ORC
>  Issue Type: New Feature
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
> Fix For: 1.4.0
>
>
> We should have a tool to convert from JSON data to ORC. Something like
> % java -jar orc-tools-*uber.jar create "struct" *.json



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (ORC-219) Boolean and timestamp converter for CSV

2017-08-08 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119103#comment-16119103
 ] 

Lefty Leverenz commented on ORC-219:


Okay, thanks Seshu.

> Boolean and timestamp converter for CSV
> ---
>
> Key: ORC-219
> URL: https://issues.apache.org/jira/browse/ORC-219
> Project: ORC
>  Issue Type: Improvement
>  Components: Java, tools
>Affects Versions: 1.5.0
>Reporter: Seshu Pasam
>Assignee: Seshu Pasam
> Fix For: 1.5.0
>
>
> Similar to JSON converter CSV should support boolean and timestamp



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-238) Make OrcFile's "enforceBufferSize" user-settable.

2017-09-26 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16181910#comment-16181910
 ] 

Lefty Leverenz commented on ORC-238:


Doc note:  The configuration parameters *orc.buffer.size.enforce* and 
*hive.exec.orc.buffer.size.enforce* should be documented in the ORC wiki, with 
version information.

But apparently we never got around to documenting the "orc.x" parameters that 
correspond to "hive.exec.orc.x" parameters.  Anyway, here are the doc links:

* [ORC Documentation | https://orc.apache.org/docs/] 
** [Hive Configuration | https://orc.apache.org/docs/hive-config.html]

> Make OrcFile's "enforceBufferSize" user-settable.
> -
>
> Key: ORC-238
> URL: https://issues.apache.org/jira/browse/ORC-238
> Project: ORC
>  Issue Type: Improvement
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 1.5.0
>
> Attachments: ORC-238.1.patch
>
>
> Compression buffer-sizes in OrcFile are computed at runtime, except when 
> {{enforceBufferSize}} is set. The only snag here is that this flag can't be 
> set by the user.
> When runtime-computed buffer-sizes are not optimal (for some reason), the 
> user has no way to work around it by setting a custom value.
> I have a patch that we use at Yahoo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-238) Make OrcFile's "enforceBufferSize" user-settable.

2017-09-30 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187283#comment-16187283
 ] 

Lefty Leverenz commented on ORC-238:


Mithun, this is most embarrassing ... I haven't updated any ORC documentation 
yet, despite many good intentions, so I don't know how it's done.  But take a 
look at ORC-65 for an example.



> Make OrcFile's "enforceBufferSize" user-settable.
> -
>
> Key: ORC-238
> URL: https://issues.apache.org/jira/browse/ORC-238
> Project: ORC
>  Issue Type: Improvement
>Affects Versions: 1.4.0, 1.5.0
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
> Fix For: 1.5.0
>
> Attachments: ORC-238.1.patch
>
>
> Compression buffer-sizes in OrcFile are computed at runtime, except when 
> {{enforceBufferSize}} is set. The only snag here is that this flag can't be 
> set by the user.
> When runtime-computed buffer-sizes are not optimal (for some reason), the 
> user has no way to work around it by setting a custom value.
> I have a patch that we use at Yahoo.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-228) Make MemoryManagerImpl.ROWS_BETWEEN_CHECKS configurable

2017-08-20 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134314#comment-16134314
 ] 

Lefty Leverenz commented on ORC-228:


Does the new config (orc.rows.between.memory.checks) need to be documented if 
it's mainly for testing?

> Make MemoryManagerImpl.ROWS_BETWEEN_CHECKS configurable
> ---
>
> Key: ORC-228
> URL: https://issues.apache.org/jira/browse/ORC-228
> Project: ORC
>  Issue Type: Improvement
>Reporter: Eugene Koifman
>Assignee: Eugene Koifman
> Fix For: 1.5.0
>
>
> currently addedRow() looks like
> {noformat}
> public void addedRow(int rows) throws IOException {
> rowsAddedSinceCheck += rows;
> if (rowsAddedSinceCheck >= ROWS_BETWEEN_CHECKS) {
>   notifyWriters();
> }
>   }
> {noformat}
> it would be convenient for testing to set ROWS_BETWEEN_CHECKS to a low value 
> so that we can generate multiple stripes with very little data.
> Currently the only way to do this is to create a new MemoryManager that 
> overrides this method and install it via OrcFile.WriterOptions but this only 
> works when you have control over creating the Writer.
> For example 
> _org.apache.hadoop.hive.ql.io.orc.TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta()_
> There is no way to do this via some set of config params to make Hive query 
> for example, create multiple stripes with little data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-207) [C++] Enable users the ability to provide their own thirdparty libraries

2017-11-12 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248988#comment-16248988
 ] 

Lefty Leverenz commented on ORC-207:


Owen's comment on ORC-238 tells how ORC documentation is done:

* [doc comment on ORC-238 | 
https://issues.apache.org/jira/browse/ORC-238?focusedCommentId=16192026=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16192026]

> [C++] Enable users the ability to provide their own thirdparty libraries
> 
>
> Key: ORC-207
> URL: https://issues.apache.org/jira/browse/ORC-207
> Project: ORC
>  Issue Type: Improvement
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
>
> Compression libraries are currently part of the main source. There is no 
> option to use user-provided libraries. The scope of this JIRA is to enable 
> this option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-341) Add ability to use timestamp in UTC rather than local timezone.

2018-05-13 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473682#comment-16473682
 ] 

Lefty Leverenz commented on ORC-341:


Has this been documented in the wiki?

> Add ability to use timestamp in UTC rather than local timezone.
> ---
>
> Key: ORC-341
> URL: https://issues.apache.org/jira/browse/ORC-341
> Project: ORC
>  Issue Type: Improvement
>Reporter: Jesus Camacho Rodriguez
>Assignee: Jesus Camacho Rodriguez
>Priority: Major
> Fix For: 1.5.0
>
>
> Currently, time zone is hardcoded as the system default time zone and ORC 
> applies displacement between timestamp values read/written based on time zone.
> This issue aims at adding the option to pass the time zone as a parameter to 
> the reader/writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-13 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511849#comment-16511849
 ] 

Lefty Leverenz commented on ORC-343:


Should this be documented in the wiki?

> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-343) Enable C++ writer to support RleV2

2018-06-14 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513346#comment-16513346
 ] 

Lefty Leverenz commented on ORC-343:


Okay, thanks.

> Enable C++ writer to support RleV2
> --
>
> Key: ORC-343
> URL: https://issues.apache.org/jira/browse/ORC-343
> Project: ORC
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yurui Zhou
>Priority: Major
>
> Currently only the Java implementation support RleV2 encoder, the C++ 
> implementation only support RleV2 decoding. 
> The issue aims to enable the c++ writer to support RleV2 encoding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-290) [C++] Update Readme to include C++ writer info

2018-01-25 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340347#comment-16340347
 ] 

Lefty Leverenz commented on ORC-290:


Alas, the patch has a typo (missing "d"):  "and will be extened in the future."

> [C++] Update Readme to include C++ writer info
> --
>
> Key: ORC-290
> URL: https://issues.apache.org/jira/browse/ORC-290
> Project: ORC
>  Issue Type: Improvement
>Reporter: Xiening Dai
>Assignee: Xiening Dai
>Priority: Major
> Fix For: 1.5.0
>
>
> Update Readme.md to include the new information about writer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-256) Add unmasked ranges option for redact mask

2018-01-01 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307710#comment-16307710
 ] 

Lefty Leverenz commented on ORC-256:


Should this be documented in the wiki?

> Add unmasked ranges option for redact mask
> --
>
> Key: ORC-256
> URL: https://issues.apache.org/jira/browse/ORC-256
> Project: ORC
>  Issue Type: Sub-task
>Reporter: Owen O'Malley
>Assignee: Sandeep More
> Fix For: 1.5.0
>
>
> It would be good to extend the Redact DataMask so that you could leave 
> certain ranges of strings unmasked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-264) column name matching while schema evolution should be case unaware.

2018-01-01 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307711#comment-16307711
 ] 

Lefty Leverenz commented on ORC-264:


Doc note:  *orc.schema.evolution.case.sensitive* should be documented in the 
wiki.

> column name matching while schema evolution should be case unaware. 
> 
>
> Key: ORC-264
> URL: https://issues.apache.org/jira/browse/ORC-264
> Project: ORC
>  Issue Type: Bug
>  Components: evolution
>Reporter: piyush mukati
>Assignee: piyush mukati
>Priority: Minor
> Fix For: 1.5.0
>
>
> Column name matching while schema evolution should be case unaware or can be 
> controlled with config.
> Reader schema passed by hive are all small cases and if the column name 
> stored in the file has any uppercase,  it will return null values for those 
> columns even if the data is present in the file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ORC-179) ORC C++ Writer Umbrella

2018-04-05 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427955#comment-16427955
 ] 

Lefty Leverenz commented on ORC-179:


What documentation does this need?

A search for "doc" in all the subtasks only finds two:  ORC-178 and ORC-276.

> ORC C++ Writer Umbrella
> ---
>
> Key: ORC-179
> URL: https://issues.apache.org/jira/browse/ORC-179
> Project: ORC
>  Issue Type: Task
>  Components: C++
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-322) c++ writer should not adjust gmtOffset when writing timestamps

2018-04-05 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427978#comment-16427978
 ] 

Lefty Leverenz commented on ORC-322:


On March 26 @stiga-huang wrote: "LGTM. We just need to document that users 
should pass in UTC timestamps when using the c++ writer."

So does this need to be documented along with umbrella ORC-179?

> c++ writer should not adjust gmtOffset when writing timestamps
> --
>
> Key: ORC-322
> URL: https://issues.apache.org/jira/browse/ORC-322
> Project: ORC
>  Issue Type: Bug
>  Components: C++
>Reporter: Quanlong Huang
>Assignee: Gang Wu
>Priority: Major
> Fix For: 1.5.0
>
>
> The c++ TimestampColumnWriter will adjust timestamp with gmtOffset:
> {code:java}
>   void TimestampColumnWriter::add(ColumnVectorBatch& rowBatch,
>   uint64_t offset,
>   uint64_t numValues) {
> ..
> int64_t *secs = tsBatch->data.data() + offset;
> int64_t *nanos = tsBatch->nanoseconds.data() + offset;
> ..
> bool hasNull = false;
> for (uint64_t i = 0; i < numValues; ++i) {
>   if (notNull == nullptr || notNull[i]) {
> // TimestampVectorBatch already stores data in UTC
> int64_t millsUTC = secs[i] * 1000 + nanos[i] / 100;
> tsStats->increase(1);
> tsStats->update(millsUTC);
> secs[i] -= timezone.getVariant(secs[i]).gmtOffset;<-- should not 
> adjust with gmtOffset here
> secs[i] -= timezone.getEpoch();
> nanos[i] = formatNano(nanos[i]);
>   } else if (!hasNull) {
> hasNull = true;
>   }
> }
> tsStats->setHasNull(hasNull);
> secRleEncoder->add(secs, numValues, notNull);
> nanoRleEncoder->add(nanos, numValues, notNull);
>   }
> {code}
> The java reader doesn't adjust this:
> {code:java}
>   public void writeBatch(ColumnVector vector, int offset,
>  int length) throws IOException {
> super.writeBatch(vector, offset, length);
> TimestampColumnVector vec = (TimestampColumnVector) vector;
> if (vector.isRepeating) {
>   ..
> } else {
>   for (int i = 0; i < length; ++i) {
> if (vec.noNulls || !vec.isNull[i + offset]) {
>   // ignore the bottom three digits from the vec.time field
>   final long secs = vec.time[i + offset] / MILLIS_PER_SECOND;
>   final int newNanos = vec.nanos[i + offset];
>   // set the millis based on the top three digits of the nanos
>   long millis = secs * MILLIS_PER_SECOND + newNanos / 1_000_000;
>   if (millis < 0 && newNanos > 999_999) {
> millis -= MILLIS_PER_SECOND;
>   }
>   long utc = SerializationUtils.convertToUtc(localTimezone, millis);
>   seconds.write(secs - baseEpochSecsLocalTz);   <-- only adjust with 
> ORC epoch
>   nanos.write(formatNanos(newNanos));
>   indexStatistics.updateTimestamp(utc);
>   if (createBloomFilter) {
> if (bloomFilter != null) {
>   bloomFilter.addLong(millis);
> }
> bloomFilterUtf8.addLong(utc);
>   }
> }
>   }
> }
>   }
> {code}
> This is a follow-up of ORC-320. I think there's a wrong assumption in c++ 
> codes that timestamps given to the writer's TimestampVectorBatch are equal to 
> timestamps got from the reader's TimestampVectorBatch (see codes in 
> TestWriter::writeTimestamp). Noticed that there're gmtOffset processing 
> logics in the c++ TimestampColumnPrinter.
> To reveal the bug:
> {code:bash}
> $ cat /etc/timezone 
> America/Los_Angeles
> $ cat ts.csv# prepare a csv file with two timestamps
> 1520762399
> 1520762400
> $ date -d @1520762399
> Sun Mar 11 01:59:59 PST 2018
> $ date -d @1520762400
> Sun Mar 11 03:00:00 PDT 2018
> $ build/tools/src/csv-import "struct" ts.csv ts.orc# 
> generate orc file by c++ writer
> [2018-03-14 21:41:58] Start importing Orc file...
> [2018-03-14 21:41:58] Finish importing Orc file.
> [2018-03-14 21:41:58] Total writer elasped time: 0.000475s.
> [2018-03-14 21:41:58] Total writer CPU time: 0.000474s.
> $ build/tools/src/orc-contents ts.orc # read by c++ reader, results 
> are not we written
> {"col0": "2018-03-11 10:59:59.0"}
> {"col0": "2018-03-11 10:00:00.0"}
> $ java -jar java/tools/target/orc-tools-1.5.0-SNAPSHOT-uber.jar data ts.orc   
>  # read by java reader, results are the same as c++ reader
> Processing data file ts.orc [length: 257]
> {"col0":"2018-03-11 10:59:59.0"}
> {"col0":"2018-03-11 10:00:00.0"}
> {code}
> Since there're no tools importing timestamps as long, we can use java codes 
> to generate a correct orc file:
> {code:java}
> public class TimestampWriter {
>   public static void 

[jira] [Commented] (ORC-322) c++ writer should not adjust gmtOffset when writing timestamps

2018-04-09 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431668#comment-16431668
 ] 

Lefty Leverenz commented on ORC-322:


[~wgtmac], for examples see ORC-65 or any of the most recent changes to files 
in orc/site/_docs. Also see Owen's comment on ORC-238.  Markdown is used for 
formatting.
 * [orc/site/_docs |https://github.com/apache/orc/tree/master/site/_docs]
 * [orc/site/README.md 
|https://github.com/apache/orc/blob/master/site/README.md]
 * Owen's doc comment

> c++ writer should not adjust gmtOffset when writing timestamps
> --
>
> Key: ORC-322
> URL: https://issues.apache.org/jira/browse/ORC-322
> Project: ORC
>  Issue Type: Bug
>  Components: C++
>Reporter: Quanlong Huang
>Assignee: Gang Wu
>Priority: Major
> Fix For: 1.5.0
>
>
> The c++ TimestampColumnWriter will adjust timestamp with gmtOffset:
> {code:java}
>   void TimestampColumnWriter::add(ColumnVectorBatch& rowBatch,
>   uint64_t offset,
>   uint64_t numValues) {
> ..
> int64_t *secs = tsBatch->data.data() + offset;
> int64_t *nanos = tsBatch->nanoseconds.data() + offset;
> ..
> bool hasNull = false;
> for (uint64_t i = 0; i < numValues; ++i) {
>   if (notNull == nullptr || notNull[i]) {
> // TimestampVectorBatch already stores data in UTC
> int64_t millsUTC = secs[i] * 1000 + nanos[i] / 100;
> tsStats->increase(1);
> tsStats->update(millsUTC);
> secs[i] -= timezone.getVariant(secs[i]).gmtOffset;<-- should not 
> adjust with gmtOffset here
> secs[i] -= timezone.getEpoch();
> nanos[i] = formatNano(nanos[i]);
>   } else if (!hasNull) {
> hasNull = true;
>   }
> }
> tsStats->setHasNull(hasNull);
> secRleEncoder->add(secs, numValues, notNull);
> nanoRleEncoder->add(nanos, numValues, notNull);
>   }
> {code}
> The java reader doesn't adjust this:
> {code:java}
>   public void writeBatch(ColumnVector vector, int offset,
>  int length) throws IOException {
> super.writeBatch(vector, offset, length);
> TimestampColumnVector vec = (TimestampColumnVector) vector;
> if (vector.isRepeating) {
>   ..
> } else {
>   for (int i = 0; i < length; ++i) {
> if (vec.noNulls || !vec.isNull[i + offset]) {
>   // ignore the bottom three digits from the vec.time field
>   final long secs = vec.time[i + offset] / MILLIS_PER_SECOND;
>   final int newNanos = vec.nanos[i + offset];
>   // set the millis based on the top three digits of the nanos
>   long millis = secs * MILLIS_PER_SECOND + newNanos / 1_000_000;
>   if (millis < 0 && newNanos > 999_999) {
> millis -= MILLIS_PER_SECOND;
>   }
>   long utc = SerializationUtils.convertToUtc(localTimezone, millis);
>   seconds.write(secs - baseEpochSecsLocalTz);   <-- only adjust with 
> ORC epoch
>   nanos.write(formatNanos(newNanos));
>   indexStatistics.updateTimestamp(utc);
>   if (createBloomFilter) {
> if (bloomFilter != null) {
>   bloomFilter.addLong(millis);
> }
> bloomFilterUtf8.addLong(utc);
>   }
> }
>   }
> }
>   }
> {code}
> This is a follow-up of ORC-320. I think there's a wrong assumption in c++ 
> codes that timestamps given to the writer's TimestampVectorBatch are equal to 
> timestamps got from the reader's TimestampVectorBatch (see codes in 
> TestWriter::writeTimestamp). Noticed that there're gmtOffset processing 
> logics in the c++ TimestampColumnPrinter.
> To reveal the bug:
> {code:bash}
> $ cat /etc/timezone 
> America/Los_Angeles
> $ cat ts.csv# prepare a csv file with two timestamps
> 1520762399
> 1520762400
> $ date -d @1520762399
> Sun Mar 11 01:59:59 PST 2018
> $ date -d @1520762400
> Sun Mar 11 03:00:00 PDT 2018
> $ build/tools/src/csv-import "struct" ts.csv ts.orc# 
> generate orc file by c++ writer
> [2018-03-14 21:41:58] Start importing Orc file...
> [2018-03-14 21:41:58] Finish importing Orc file.
> [2018-03-14 21:41:58] Total writer elasped time: 0.000475s.
> [2018-03-14 21:41:58] Total writer CPU time: 0.000474s.
> $ build/tools/src/orc-contents ts.orc # read by c++ reader, results 
> are not we written
> {"col0": "2018-03-11 10:59:59.0"}
> {"col0": "2018-03-11 10:00:00.0"}
> $ java -jar java/tools/target/orc-tools-1.5.0-SNAPSHOT-uber.jar data ts.orc   
>  # read by java reader, results are the same as c++ reader
> Processing data file ts.orc [length: 257]
> {"col0":"2018-03-11 10:59:59.0"}
> {"col0":"2018-03-11 10:00:00.0"}
> {code}
> Since there're no tools 

[jira] [Commented] (ORC-328) Add option to allow custom timestamp format for CSV to ORC conversion

2018-04-01 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421590#comment-16421590
 ] 

Lefty Leverenz commented on ORC-328:


[~owen.omalley], this should be documented in the wiki.

I think it belongs here:
 * [https://orc.apache.org/docs/tools.html#java-convert]

but if so, a few other command options also need to be documented.

> Add option to allow custom timestamp format for CSV to ORC conversion
> -
>
> Key: ORC-328
> URL: https://issues.apache.org/jira/browse/ORC-328
> Project: ORC
>  Issue Type: Bug
>Reporter: Bill Warshaw
>Assignee: Bill Warshaw
>Priority: Major
> Fix For: 1.5.0
>
>
> We should support an specify a custom timestamp format on the command line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-321) Add a pretty print option to the JSON schema finder.

2018-03-21 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/ORC-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408993#comment-16408993
 ] 

Lefty Leverenz commented on ORC-321:


The -p option needs to be documented in the ORC wiki (with version information):
 * [Java JSON Schema |https://orc.apache.org/docs/tools.html#java-json-schema]

> Add a pretty print option to the JSON schema finder.
> 
>
> Key: ORC-321
> URL: https://issues.apache.org/jira/browse/ORC-321
> Project: ORC
>  Issue Type: Bug
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
> Fix For: 1.5.0
>
>
> If the schema is complicated, it is really hard to understand the schema 
> without pretty-printing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ORC-397) ORC should allow selectively disabling dictionary-encoding on specified columns

2018-09-28 Thread Lefty Leverenz (JIRA)


[ 
https://issues.apache.org/jira/browse/ORC-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632524#comment-16632524
 ] 

Lefty Leverenz commented on ORC-397:


Has orc.column.encoding.direct been documented in the wiki?

> ORC should allow selectively disabling dictionary-encoding on specified 
> columns
> ---
>
> Key: ORC-397
> URL: https://issues.apache.org/jira/browse/ORC-397
> Project: ORC
>  Issue Type: New Feature
>Reporter: Mithun Radhakrishnan
>Assignee: Mithun Radhakrishnan
>Priority: Major
> Fix For: 1.6.0, 1.5.3
>
> Attachments: HIVE-18608.1-branch-2.2.patch, 
> HIVE-18608.2-branch-2.2.patch
>
>
> Just as ORC allows the choice of columns to enable bloom-filters on, it would 
> be nice to have a way to specify which columns {{DICTIONARY_V2}} encoding 
> should be disabled on.
> Currently, the choice of dictionary-encoding depends on the results of 
> sampling the first row-stride within a stripe. If the user knows that a 
> column's cardinality is bound to prevent an effective dictionary, she might 
> choose to simply disable it on just that column, and avoid the cost of 
> sampling in the first row-stride.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)