Repository: arrow-site
Updated Branches:
  refs/heads/asf-site 24caf72d8 -> 9217514cf


Update metadata docs


Project: http://git-wip-us.apache.org/repos/asf/arrow-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow-site/commit/9217514c
Tree: http://git-wip-us.apache.org/repos/asf/arrow-site/tree/9217514c
Diff: http://git-wip-us.apache.org/repos/asf/arrow-site/diff/9217514c

Branch: refs/heads/asf-site
Commit: 9217514cf652f55ba585108cb4dffc89d11589bb
Parents: 24caf72
Author: Korn, Uwe <uwe.k...@blue-yonder.com>
Authored: Sun Feb 4 13:09:55 2018 +0100
Committer: Korn, Uwe <uwe.k...@blue-yonder.com>
Committed: Sun Feb 4 13:09:55 2018 +0100

----------------------------------------------------------------------
 docs/ipc.html           | 47 ++++++++++++++++++++++++++++++++++++++++++--
 docs/memory_layout.html | 11 +++++------
 docs/metadata.html      |  3 ++-
 3 files changed, 52 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow-site/blob/9217514c/docs/ipc.html
----------------------------------------------------------------------
diff --git a/docs/ipc.html b/docs/ipc.html
index c480fea..6d96632 100644
--- a/docs/ipc.html
+++ b/docs/ipc.html
@@ -145,7 +145,7 @@
 
 <ul>
   <li>A length prefix indicating the metadata size</li>
-  <li>The message metadata as a <a 
href="https://github.com/google]/flatbuffers";>Flatbuffer</a></li>
+  <li>The message metadata as a <a 
href="https://github.com/google/flatbuffers";>Flatbuffer</a></li>
   <li>Padding bytes to an 8-byte boundary</li>
   <li>The message body, which must be a multiple of 8 bytes</li>
 </ul>
@@ -190,7 +190,9 @@ flatbuffer union), and the size of the message body:</p>
 of encapsulated messages, each of which follows the format above. The schema
 comes first in the stream, and it is the same for all of the record batches
 that follow. If any fields in the schema are dictionary-encoded, one or more
-<code class="highlighter-rouge">DictionaryBatch</code> messages will follow 
the schema.</p>
+<code class="highlighter-rouge">DictionaryBatch</code> messages will be 
included. <code class="highlighter-rouge">DictionaryBatch</code> and
+<code class="highlighter-rouge">RecordBatch</code> messages may be 
interleaved, but before any dictionary key is used
+in a <code class="highlighter-rouge">RecordBatch</code> it should be defined 
in a <code class="highlighter-rouge">DictionaryBatch</code>.</p>
 
 <div class="highlighter-rouge"><pre class="highlight"><code>&lt;SCHEMA&gt;
 &lt;DICTIONARY 0&gt;
@@ -198,6 +200,10 @@ that follow. If any fields in the schema are 
dictionary-encoded, one or more
 &lt;DICTIONARY k - 1&gt;
 &lt;RECORD BATCH 0&gt;
 ...
+&lt;DICTIONARY x DELTA&gt;
+...
+&lt;DICTIONARY y DELTA&gt;
+...
 &lt;RECORD BATCH n - 1&gt;
 &lt;EOS [optional]: int32&gt;
 </code></pre>
@@ -232,6 +238,10 @@ footer.</p>
 </code></pre>
 </div>
 
+<p>In the file format, there is no requirement that dictionary keys should be
+defined in a <code class="highlighter-rouge">DictionaryBatch</code> before 
they are used in a <code class="highlighter-rouge">RecordBatch</code>, as long
+as the keys are defined somewhere in the file.</p>
+
 <h3 id="recordbatch-body-structure">RecordBatch body structure</h3>
 
 <p>The <code class="highlighter-rouge">RecordBatch</code> metadata contains a 
depth-first (pre-order) flattened set of
@@ -305,6 +315,7 @@ the dictionaries can be properly interpreted.</p>
 <div class="highlighter-rouge"><pre class="highlight"><code>table 
DictionaryBatch {
   id: long;
   data: RecordBatch;
+  isDelta: boolean = false;
 }
 </code></pre>
 </div>
@@ -314,6 +325,38 @@ in the schema, so that dictionaries can even be used for 
multiple fields. See
 the <a 
href="https://github.com/apache/arrow/blob/master/format/Layout.md";>Physical 
Layout</a> document for more about the semantics of
 dictionary-encoded data.</p>
 
+<p>The dictionary <code class="highlighter-rouge">isDelta</code> flag allows 
dictionary batches to be modified
+mid-stream.  A dictionary batch with <code 
class="highlighter-rouge">isDelta</code> set indicates that its vector
+should be concatenated with those of any previous batches with the same <code 
class="highlighter-rouge">id</code>. A
+stream which encodes one column, the list of strings
+<code class="highlighter-rouge">["A", "B", "C", "B", "D", "C", "E", 
"A"]</code>, with a delta dictionary batch could
+take the form:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code>&lt;SCHEMA&gt;
+&lt;DICTIONARY 0&gt;
+(0) "A"
+(1) "B"
+(2) "C"
+
+&lt;RECORD BATCH 0&gt;
+0
+1
+2
+1
+
+&lt;DICTIONARY 0 DELTA&gt;
+(3) "D"
+(4) "E"
+
+&lt;RECORD BATCH 1&gt;
+3
+2
+4
+0
+EOS
+</code></pre>
+</div>
+
 <h3 id="tensor-multi-dimensional-array-message-format">Tensor 
(Multi-dimensional Array) Message Format</h3>
 
 <p>The <code class="highlighter-rouge">Tensor</code> message types provides a 
way to write a multidimensional array of

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/9217514c/docs/memory_layout.html
----------------------------------------------------------------------
diff --git a/docs/memory_layout.html b/docs/memory_layout.html
index 16a43ea..0eb8d03 100644
--- a/docs/memory_layout.html
+++ b/docs/memory_layout.html
@@ -161,9 +161,8 @@ from <code class="highlighter-rouge">List&lt;V&gt;</code> 
iff U and V are differ
 or a fully-specified nested type. When we say slot we mean a relative type
 value, not necessarily any physical storage region.</li>
   <li>Logical type: A data type that is implemented using some relative 
(physical)
-type. For example, a Decimal value stored in 16 bytes could be stored in a
-primitive array with slot size 16 bytes. Similarly, strings can be stored as
-<code class="highlighter-rouge">List&lt;1-byte&gt;</code>.</li>
+type. For example, Decimal values are stored as 16 bytes in a fixed byte
+size array. Similarly, strings can be stored as <code 
class="highlighter-rouge">List&lt;1-byte&gt;</code>.</li>
   <li>Parent and child arrays: names to express relationships between physical
 value arrays in a nested type structure. For example, a <code 
class="highlighter-rouge">List&lt;T&gt;</code>-type parent
 array has a T-type array as its child (see more on lists below).</li>
@@ -752,9 +751,9 @@ the the types array indicates that a slot contains a 
different type at the index
 <h2 id="dictionary-encoding">Dictionary encoding</h2>
 
 <p>When a field is dictionary encoded, the values are represented by an array 
of Int32 representing the index of the value in the dictionary.
-The Dictionary is received as a DictionaryBatch whose id is referenced by a 
dictionary attribute defined in the metadata (<a 
href="https://github.com/apache/arrow/blob/master/format/Message.fbs";>Message.fbs</a>)
 in the Field table.
-The dictionary has the same layout as the type of the field would dictate. 
Each entry in the dictionary can be accessed by its index in the 
DictionaryBatch.
-When a Schema references a Dictionary id, it must send a DictionaryBatch for 
this id before any RecordBatch.</p>
+The Dictionary is received as one or more DictionaryBatches with the id 
referenced by a dictionary attribute defined in the metadata (<a 
href="https://github.com/apache/arrow/blob/master/format/Message.fbs";>Message.fbs</a>)
 in the Field table.
+The dictionary has the same layout as the type of the field would dictate. 
Each entry in the dictionary can be accessed by its index in the 
DictionaryBatches.
+When a Schema references a Dictionary id, it must send at least one 
DictionaryBatch for this id.</p>
 
 <p>As an example, you could have the following data:</p>
 <div class="highlighter-rouge"><pre class="highlight"><code>type: 
List&lt;String&gt;

http://git-wip-us.apache.org/repos/asf/arrow-site/blob/9217514c/docs/metadata.html
----------------------------------------------------------------------
diff --git a/docs/metadata.html b/docs/metadata.html
index 9e25689..9b12883 100644
--- a/docs/metadata.html
+++ b/docs/metadata.html
@@ -530,7 +530,8 @@ logical type, which have no children) and 3 buffers:</p>
 
 <h3 id="decimal">Decimal</h3>
 
-<p>TBD</p>
+<p>Decimals are represented as a 2’s complement 128-bit (16 byte) signed 
integer
+in little-endian byte order.</p>
 
 <h3 id="timestamp">Timestamp</h3>
 

Reply via email to