[jira] [Commented] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

2018-09-03 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602498#comment-16602498
 ] 

Paul Rogers commented on ARROW-3164:


Hi [~wesmckinn], I reworded the passage to say, " Arrow's vector code started 
with similar code in Apache Drill.". The key point is that the vector data 
structures, hence read and write challenges, are similar.

If the goal of Arrow is to provide a toolkit for creating databases, then 
row-to-column "rotation" is a key ingredient, as is solid memory management. As 
Drill has found, some operations in a DB are inherently row-based (because 
databases are designed to deal with rows/objects and their attributes.) I'm 
sure that Dremio (which started with Apache Drill code) has wrestled with 
similar issues.

Thanks for the head's-up on the evolved project goals. I'm currently in 
"drinking from the firehose" mode in catching up on the great progress that has 
been made, such as the continued evolution of the metadata structures from what 
I saw six months ago.

All that said, it seems reasonable that row-based reading and writing is 
essential; though we'll want to work out the right set of details for the Arrow 
context.

For example, one topic sure to come up are the existing Arrow (and Drill) 
"complex readers and writers." For now, let's simply acknowledge that these 
existing abstractions exist, and were one of the inspirations for the Row Set 
abstractions.

> [Java] Port Row Set abstraction from Drill to Arrow
> ---
>
> Key: ARROW-3164
> URL: https://issues.apache.org/jira/browse/ARROW-3164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Paul Rogers
>Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the 
> process, however, data must be load into, and read out of the Arrow vectors.
> Arrow's vector code started with similar code in Apache Drill. The Drill 
> project created a "Row Set" abstraction that:
>  * Provides a simple way to define the schema for a set of batches.
>  * Loads data into vectors from row-oriented inputs.
>  * Reads data out of vectors in row-oriented output.
>  * Controls memory consumed by the record batch when loading data into 
> vectors.
>  * Ensures maximum usage of the allocated vector space when loading data Into 
> vectors.
>  * Optionally handles projection when reading data from an input file into a 
> set of vectors.
>  * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow 
> vectors.
> Detailed information is available in [this 
> wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
> description of the motivation for this project, and an analysis of the work 
> required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same 
> mechanisms can be ported to other languages if useful. Those ports would be 
> separate future projects.
> The code will be placed in a new Java module which can be imported by 
> projects that wish to use the code. Changes may be needed to expose items 
> from the {{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

2018-09-03 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602492#comment-16602492
 ] 

Wes McKinney commented on ARROW-3164:
-

Sounds like a useful initiative. We're already developed some rows-to-columns 
functionality in C++ and would be great to expand beyond what we have now, 
particular around creating neatly-sized record batches. It would be useful to 
be able to quickly convert to Protobuf or Avro-encoded row data, and back. 

One minor point though:

> Arrow evolved from Apache Drill. 

This isn't quite accurate. Java code from Apache Drill formed the basis for the 
initial Java codebase in Apache Arrow. I wouldn't say that the project evolved 
from Apache Drill itself. The project was created by a confluence of open 
source projects wishing to define an open standard for in-memory columnar data 
as its first project, with the broader goal of creating reusable libraries for 
creating database-like systems ("the deconstructed database" we have been 
calling it). It happened to be that Drill's ValueVectors were already very 
close to the fully-shredded columnar model that the community desired, and 
provided a good starting point. The scope of the project has evolved 
significantly in the meantime.

> [Java] Port Row Set abstraction from Drill to Arrow
> ---
>
> Key: ARROW-3164
> URL: https://issues.apache.org/jira/browse/ARROW-3164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Paul Rogers
>Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the 
> process, however, data must be load into, and read out of the Arrow vectors.
> Arrow evolved from Apache Drill. The Drill project created a "Row Set" 
> abstraction that:
> * Provides a simple way to define the schema for a set of batches.
> * Loads data into vectors from row-oriented inputs.
> * Reads data out of vectors in row-oriented output.
> * Controls memory consumed by the record batch when loading data into vectors.
> * Ensures maximum usage of the allocated vector space when loading data Into 
> vectors.
> * Optionally handles projection when reading data from an input file into a 
> set of vectors.
> * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow 
> vectors.
> Detailed information is available in [this 
> wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
> description of the motivation for this project, and an analysis of the work 
> required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same 
> mechanisms can be ported to other languages if useful. Those ports would be 
> separate future projects.
> The code will be placed in a new Java module which can be imported by 
> projects that wish to use the code. Changes may be needed to expose items 
> from the {{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)