[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620933#comment-16620933
 ] 

Paul Rogers commented on ARROW-3267:


Yes, that's were Drill started also, and it is what step 2 in the previous note 
does.

I suspect you'll find that, once you have a function, you'll want an easy way 
to create the schema (step 1).

Then, unless a mechanism already exists, if you watch allocation logging, 
you'll see vector doublings you can avoid. So, soon want to optimize allocation 
performance by providing size hints. The size hint step can be a separate bunch 
of data, or can be part of the schema passed to the empty_table function. (You 
might want to have an allocate_table function that creates the table and 
allocates vectors.)

Sounds like you're not hit these issues yet; but keep this in mind if/when you 
do.
 

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620873#comment-16620873
 ] 

Paul Rogers commented on ARROW-3267:


FWIW, ARROW-3164 describes a port of a "row set mechanism" from Apache Drill 
that does exactly this. There are three relevant components:

1. A fluent schema builder to define the schema.
2. The schema definition itself which includes both scalar and "complex" types.
3. A "row set" (vector batch) builder to build vectors from schema.

Drill found that it was helpful to have additional metadata in the schema, such 
as expected width for VARCHAR columns, expected cardinality for arrays, and 
expected types for unions.

The row set builder could then optionally allocate vector buffers at the 
approximate desired size, which avoided the need to double vectors repeatedly 
as they are written.

The rest of the mechanism provides a means to write to, or read from vectors, 
which is beyond the scope of this particular ticket.

This ticket talks about Python, so the Java row set code is not directly 
applicable. Still feel free to borrow ideas. Also, perhaps we can coordinate to 
establish a common approach across languages.

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

2018-09-03 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602498#comment-16602498
 ] 

Paul Rogers commented on ARROW-3164:


Hi [~wesmckinn], I reworded the passage to say, " Arrow's vector code started 
with similar code in Apache Drill.". The key point is that the vector data 
structures, hence read and write challenges, are similar.

If the goal of Arrow is to provide a toolkit for creating databases, then 
row-to-column "rotation" is a key ingredient, as is solid memory management. As 
Drill has found, some operations in a DB are inherently row-based (because 
databases are designed to deal with rows/objects and their attributes.) I'm 
sure that Dremio (which started with Apache Drill code) has wrestled with 
similar issues.

Thanks for the head's-up on the evolved project goals. I'm currently in 
"drinking from the firehose" mode in catching up on the great progress that has 
been made, such as the continued evolution of the metadata structures from what 
I saw six months ago.

All that said, it seems reasonable that row-based reading and writing is 
essential; though we'll want to work out the right set of details for the Arrow 
context.

For example, one topic sure to come up are the existing Arrow (and Drill) 
"complex readers and writers." For now, let's simply acknowledge that these 
existing abstractions exist, and were one of the inspirations for the Row Set 
abstractions.

> [Java] Port Row Set abstraction from Drill to Arrow
> ---
>
> Key: ARROW-3164
> URL: https://issues.apache.org/jira/browse/ARROW-3164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Paul Rogers
>Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the 
> process, however, data must be load into, and read out of the Arrow vectors.
> Arrow's vector code started with similar code in Apache Drill. The Drill 
> project created a "Row Set" abstraction that:
>  * Provides a simple way to define the schema for a set of batches.
>  * Loads data into vectors from row-oriented inputs.
>  * Reads data out of vectors in row-oriented output.
>  * Controls memory consumed by the record batch when loading data into 
> vectors.
>  * Ensures maximum usage of the allocated vector space when loading data Into 
> vectors.
>  * Optionally handles projection when reading data from an input file into a 
> set of vectors.
>  * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow 
> vectors.
> Detailed information is available in [this 
> wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
> description of the motivation for this project, and an analysis of the work 
> required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same 
> mechanisms can be ported to other languages if useful. Those ports would be 
> separate future projects.
> The code will be placed in a new Java module which can be imported by 
> projects that wish to use the code. Changes may be needed to expose items 
> from the {{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow

2018-09-03 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated ARROW-3164:
---
Description: 
Arrow is a great way to exchange data between systems. Somewhere in the 
process, however, data must be load into, and read out of the Arrow vectors.

Arrow's vector code started with similar code in Apache Drill. The Drill 
project created a "Row Set" abstraction that:
 * Provides a simple way to define the schema for a set of batches.
 * Loads data into vectors from row-oriented inputs.
 * Reads data out of vectors in row-oriented output.
 * Controls memory consumed by the record batch when loading data into vectors.
 * Ensures maximum usage of the allocated vector space when loading data Into 
vectors.
 * Optionally handles projection when reading data from an input file into a 
set of vectors.
 * Optionally handles data conversion from input to vector formats.

This mechanism is handy for any Java developer who produces or consumes Arrow 
vectors.

Detailed information is available in [this 
wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
description of the motivation for this project, and an analysis of the work 
required to do the Drill-to-Arrow port.

The code is in Java simply because Drill is written in Java. The same 
mechanisms can be ported to other languages if useful. Those ports would be 
separate future projects.

The code will be placed in a new Java module which can be imported by projects 
that wish to use the code. Changes may be needed to expose items from the 
{{vector}} module; we'll tackle those issues if/when they occur.

  was:
Arrow is a great way to exchange data between systems. Somewhere in the 
process, however, data must be load into, and read out of the Arrow vectors.

Arrow evolved from Apache Drill. The Drill project created a "Row Set" 
abstraction that:

* Provides a simple way to define the schema for a set of batches.
* Loads data into vectors from row-oriented inputs.
* Reads data out of vectors in row-oriented output.
* Controls memory consumed by the record batch when loading data into vectors.
* Ensures maximum usage of the allocated vector space when loading data Into 
vectors.
* Optionally handles projection when reading data from an input file into a set 
of vectors.
* Optionally handles data conversion from input to vector formats.

This mechanism is handy for any Java developer who produces or consumes Arrow 
vectors.

Detailed information is available in [this 
wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
description of the motivation for this project, and an analysis of the work 
required to do the Drill-to-Arrow port.

The code is in Java simply because Drill is written in Java. The same 
mechanisms can be ported to other languages if useful. Those ports would be 
separate future projects.

The code will be placed in a new Java module which can be imported by projects 
that wish to use the code. Changes may be needed to expose items from the 
{{vector}} module; we'll tackle those issues if/when they occur.


> [Java] Port Row Set abstraction from Drill to Arrow
> ---
>
> Key: ARROW-3164
> URL: https://issues.apache.org/jira/browse/ARROW-3164
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Paul Rogers
>Priority: Major
>
> Arrow is a great way to exchange data between systems. Somewhere in the 
> process, however, data must be load into, and read out of the Arrow vectors.
> Arrow's vector code started with similar code in Apache Drill. The Drill 
> project created a "Row Set" abstraction that:
>  * Provides a simple way to define the schema for a set of batches.
>  * Loads data into vectors from row-oriented inputs.
>  * Reads data out of vectors in row-oriented output.
>  * Controls memory consumed by the record batch when loading data into 
> vectors.
>  * Ensures maximum usage of the allocated vector space when loading data Into 
> vectors.
>  * Optionally handles projection when reading data from an input file into a 
> set of vectors.
>  * Optionally handles data conversion from input to vector formats.
> This mechanism is handy for any Java developer who produces or consumes Arrow 
> vectors.
> Detailed information is available in [this 
> wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
> description of the motivation for this project, and an analysis of the work 
> required to do the Drill-to-Arrow port.
> The code is in Java simply because Drill is written in Java. The same 
> mechanisms can be ported to other languages if useful. Those ports would be 
> separate future projects.
> The code will be placed in a new Java module which can be imported by 
> projects that wish to use the code. 

[jira] [Created] (ARROW-3164) Port Row Set abstraction from Drill to Arrow

2018-09-03 Thread Paul Rogers (JIRA)
Paul Rogers created ARROW-3164:
--

 Summary: Port Row Set abstraction from Drill to Arrow
 Key: ARROW-3164
 URL: https://issues.apache.org/jira/browse/ARROW-3164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Paul Rogers


Arrow is a great way to exchange data between systems. Somewhere in the 
process, however, data must be load into, and read out of the Arrow vectors.

Arrow evolved from Apache Drill. The Drill project created a "Row Set" 
abstraction that:

* Provides a simple way to define the schema for a set of batches.
* Loads data into vectors from row-oriented inputs.
* Reads data out of vectors in row-oriented output.
* Controls memory consumed by the record batch when loading data into vectors.
* Ensures maximum usage of the allocated vector space when loading data Into 
vectors.
* Optionally handles projection when reading data from an input file into a set 
of vectors.
* Optionally handles data conversion from input to vector formats.

This mechanism is handy for any Java developer who produces or consumes Arrow 
vectors.

Detailed information is available in [this 
wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed 
description of the motivation for this project, and an analysis of the work 
required to do the Drill-to-Arrow port.

The code is in Java simply because Drill is written in Java. The same 
mechanisms can be ported to other languages if useful. Those ports would be 
separate future projects.

The code will be placed in a new Java module which can be imported by projects 
that wish to use the code. Changes may be needed to expose items from the 
{{vector}} module; we'll tackle those issues if/when they occur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3151) [C++] Create Protocol Buffers interface for iterating over the semantic "rows" of a record batch, and accessing the rows using the protobuf API

2018-08-30 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597920#comment-16597920
 ] 

Paul Rogers commented on ARROW-3151:


Let's see if we can coordinate on this. I'm starting work on a proposal for a 
"RowSet" interface to be ported over from Drill that provides a simple 
row-based API to read from, and write to, vectors. On the write site, the 
mechanism also enforces memory limits, which is the key reason Drill created 
the "RowSet" abstraction.

Given that this project will need a way to assemble a row from a bundle of 
vectors, the "columnar-to-row" mechanism of RowSet might be a way to populate 
the row buffer.

On the other hand, the RowSet code from Drill is in Java, this is C++. Still, 
might make sense to port the mechanism to C++ so it can be used in multiple 
contexts.

Any background docs I could read to get a better understanding of the project 
context to determine if what was just said above makes sense in this context? 
Thanks.

> [C++] Create Protocol Buffers interface for iterating over the semantic 
> "rows" of a record batch, and accessing the rows using the protobuf API
> ---
>
> Key: ARROW-3151
> URL: https://issues.apache.org/jira/browse/ARROW-3151
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> The desired workflow:
> * User writes a .proto file describing the structure of a "row" as a Message
> * Given the generated pb.h bindings, an Arrow users can iterate over an 
> {{arrow::RecordBatch}}, each iteration populating an instance of the Row 
> message
> * The values of the row can then be accessed via the standard Protobuf APIs
> A corresponding interface could be developed to write a RecordBatch using 
> protobufs as input, but that could be its own project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)