[GitHub] drill issue #866: DRILL-5657: Implement size-aware result set loader

2017-08-17 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/866
  
Closing as this PR is now superseded by #914.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #866: DRILL-5657: Implement size-aware result set loader

2017-07-25 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/866
  
Let's defer this one so we can focus on the lower layer: the column 
accessors for maps and lists (DRILL-5688). Once that PR is done, we'll come 
back and update this one with those revisions. Please continue to get familiar 
with the concepts here. However, the details will change a bit to allow support 
for repeated maps and lists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #866: Drill 5657: Implement size-aware result set loader

2017-07-03 Thread paul-rogers
Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/866
  
This commit provides another two levels of foundation for size-aware vector 
writers in the Drill record readers.

Much of the material below appears in Javadoc throughout the code. But, it 
is gathered here for quick reference to speed the code review.

The PR is broken into commits by layer of function. May be easier to review 
each commit one-by-one rather than looking at the whole mess in one big diff.

## Column Accessors

A recent extension to Drill's set of test tools created a "row set" 
abstraction to allow us to create, and verify, record batches with very few 
lines of code. Part of this work involved creating a set of "column accessors" 
in the vector subsystem. Column readers provide a uniform API to obtain data 
from columns (vectors), while column writers provide a uniform writing 
interface.

DRILL-5211 discusses a set of changes to limit value vectors to 16 MB in 
size (to avoid memory fragmentation due to Drill's two memory allocators.) The 
column accessors have proven to be so useful that they will be the basis for 
the new, size-aware writers used by Drill's record readers.

Changes include:

* Implement fill-empties logic for vectors that do not provide it.
* Use the new size-aware methods, throwing vector overflow exceptions which 
can now occur.
* Some fiddling to handle the non-standard names of vector functions.
* Modify strings to use a default type of bytes[], but offset a String 
version for convenience.
* Add “finish batch” logic to handle values omitted at the end of a 
batch. (This is a bug in some existing record readers.)

## Result Set Loader

The second layer of this commit is the new “result set loader.” This 
abstraction is an evolution of the “Mutator” class in the scan batch, when 
used with the existing column writers (which some readers use and others do 
not.)

A result set loader loads a set of tuples (AKA records, rows) from any 
source (such as a record reader) into a set of record batches. The loader:

* Divides records into batches based on a maximum row count or a maximum 
vector size, whichever occurs first. (Later revisions may limit overall batch 
size.)
* Tracks the start and end of each batch.
* Tracks the start and end of each row.
* Provides column loaders to write each column value.
* Handles overflow when a vector becomes full, but the client still must 
finish writing the current row.

The original Mutator class divided up responsibilities:

* The Mutator handled the entire record batch
* An optional VectorContainerWriter writes each record

The result set loader follows this same general pattern.

* The result set loader handles the entire record batch (really, a series 
of batches that make up the entire result set: hence the name.)
* The TupleLoader class provides per-tuple services which mostly consists 
of access to the column loaders.
* A tuple schema defines the schema for the result set (see below.)

To hide this complexity from the client, a ResultSetLoader interface 
defines the public API. Then, a ResultSetLoaderImpl class implements the 
interface with all the gory details. Separate classes handle each column, the 
result set schema, and so on.

This class is pretty complex, with a state machine per batch and per 
column, so take your time reviewing it.

## Column Loaders

The column writers are low-level classes that interface between a consumer 
and a value vector. To create the tuple loader we need a higher-level 
abstraction: the column loader. (Not that there is no equivalent for reading 
columns at this time: generated code does the reading in its own special way 
for each operator.)

Column loaders have a number of responsibilities:

* Single class used for all data types. No more casting.
* Transparently handle vector overflow and rollover.
* Provide generic (Object-based) setters, most useful for testing.

Because this commit seeks to prove the concept; the column loader supports 
a subset of types. Adding the other types is simply a matter of copy & paste, 
and will be done once things settle down. For now, the focus is on int and 
Varchar types (though the generic version supports all types.)

To handle vector overflow, each “set” method:

* Tries to write the value into the current vector (using a column writer)
* If overflow occurs, tell the listener (the row set mutator) to create a 
new vector
* Try the write a second time using the new vector

The set of column writers must be synchronized (not in a multi-thread 
sense) on the current row position. As in the row set test utilities, a 
WriterIndex performs this task. (In fact, this is derived from the