[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/orc/pull/247


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182139793
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
--- End diff --

Just to clarify the answer "variant" means "one of the instances of 
something that varies." The unfortunately similar looking "varint" is an 
unofficial contraction of "variable length integer."


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182138208
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
--- End diff --

No, I mean that the new RLEv3 needs to support both 64 bit and 128 bit 
integers. The specification of course could just define the spec using 128 
bits, but the Java and C++ implementation would likely be split.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182137684
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
--- End diff --

I have some thoughts, but haven't done any work on it.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r182137566
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
--- End diff --

This is absolutely a work in progress. I'd love to see your proposal.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-17 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181962407
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
--- End diff --

I think you meant base 128 and 256 varints.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-15 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181610811
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
--- End diff --

Is there anyone working on it already?


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-13 Thread xndai
Github user xndai commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181530787
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
--- End diff --

Is this a final list of v2 or we are still working on it? I have one 
proposal to add to ORC v2, which is what I call "clustered index". Basically 
the writer can specify a sorting property on one or more columns, then we 
create an index section in ORC file with keys being the column(s) value and the 
value is the row number. To reduce the size of index, each row group has one 
entry in the clustered index. This will enable new range scan pattern when 
reader provides upper bound and lower bound of column(s) values. 

I can write up a detailed proposal for this.


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-13 Thread omalley
Github user omalley commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181437863
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
+   * Zero suppression
+   * Evaluate the rle subformats
+* Group stripe data into stripelets to enable Async IO for reads.
+* Reorder stripe data into (stripe metadata, index, dictionary, data)
+* Stop sorting dictionaries and record the sort order separately in the 
index.
+* Remove use of RLEv1 and RLEv2.
+* Remove non-utf8 bloom filter.
+* Use numeric value for decimal bloom filter.
--- End diff --

Agreed


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-12 Thread wgtmac
Github user wgtmac commented on a diff in the pull request:

https://github.com/apache/orc/pull/247#discussion_r181239251
  
--- Diff: site/specification/ORCv2.md ---
@@ -0,0 +1,1032 @@
+---
+layout: page
+title: Evolving Draft for ORC Specification v2
+---
+
+This specification is rapidly evolving and should only be used for
+developers on the project.
+
+# TO DO items
+
+The list of things that we plan to change:
+
+* Create a decimal representation with fixed scale using rle.
+* Create a better float/double encoding that splits mantissa and
+  exponent.
+* Create a dictionary encoding for float, double, and decimal.
+* Create RLEv3:
+   * 64 and 128 bit variants
+   * Zero suppression
+   * Evaluate the rle subformats
+* Group stripe data into stripelets to enable Async IO for reads.
+* Reorder stripe data into (stripe metadata, index, dictionary, data)
+* Stop sorting dictionaries and record the sort order separately in the 
index.
+* Remove use of RLEv1 and RLEv2.
+* Remove non-utf8 bloom filter.
+* Use numeric value for decimal bloom filter.
--- End diff --

We may also use numeric value for decimal column statistics


---


[GitHub] orc pull request #247: ORC-339. Reorganize the ORC file format specification...

2018-04-12 Thread omalley
GitHub user omalley opened a pull request:

https://github.com/apache/orc/pull/247

ORC-339. Reorganize the ORC file format specification.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/omalley/orc orc-339

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/orc/pull/247.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #247


commit 5c56d74d948a73f5c456e0e80ff0622505d6c1cf
Author: Owen O'Malley 
Date:   2018-04-12T22:03:00Z

ORC-339. Reorganize the ORC file format specification.




---