[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620933#comment-16620933 ] Paul Rogers commented on ARROW-3267: Yes, that's were Drill started also, and it is what step 2 in the previous note does. I suspect you'll find that, once you have a function, you'll want an easy way to create the schema (step 1). Then, unless a mechanism already exists, if you watch allocation logging, you'll see vector doublings you can avoid. So, soon want to optimize allocation performance by providing size hints. The size hint step can be a separate bunch of data, or can be part of the schema passed to the empty_table function. (You might want to have an allocate_table function that creates the table and allocates vectors.) Sounds like you're not hit these issues yet; but keep this in mind if/when you do. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620873#comment-16620873 ] Paul Rogers commented on ARROW-3267: FWIW, ARROW-3164 describes a port of a "row set mechanism" from Apache Drill that does exactly this. There are three relevant components: 1. A fluent schema builder to define the schema. 2. The schema definition itself which includes both scalar and "complex" types. 3. A "row set" (vector batch) builder to build vectors from schema. Drill found that it was helpful to have additional metadata in the schema, such as expected width for VARCHAR columns, expected cardinality for arrays, and expected types for unions. The row set builder could then optionally allocate vector buffers at the approximate desired size, which avoided the need to double vectors repeatedly as they are written. The rest of the mechanism provides a means to write to, or read from vectors, which is beyond the scope of this particular ticket. This ticket talks about Python, so the Java row set code is not directly applicable. Still feel free to borrow ideas. Also, perhaps we can coordinate to establish a common approach across languages. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow
[ https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602498#comment-16602498 ] Paul Rogers commented on ARROW-3164: Hi [~wesmckinn], I reworded the passage to say, " Arrow's vector code started with similar code in Apache Drill.". The key point is that the vector data structures, hence read and write challenges, are similar. If the goal of Arrow is to provide a toolkit for creating databases, then row-to-column "rotation" is a key ingredient, as is solid memory management. As Drill has found, some operations in a DB are inherently row-based (because databases are designed to deal with rows/objects and their attributes.) I'm sure that Dremio (which started with Apache Drill code) has wrestled with similar issues. Thanks for the head's-up on the evolved project goals. I'm currently in "drinking from the firehose" mode in catching up on the great progress that has been made, such as the continued evolution of the metadata structures from what I saw six months ago. All that said, it seems reasonable that row-based reading and writing is essential; though we'll want to work out the right set of details for the Arrow context. For example, one topic sure to come up are the existing Arrow (and Drill) "complex readers and writers." For now, let's simply acknowledge that these existing abstractions exist, and were one of the inspirations for the Row Set abstractions. > [Java] Port Row Set abstraction from Drill to Arrow > --- > > Key: ARROW-3164 > URL: https://issues.apache.org/jira/browse/ARROW-3164 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Paul Rogers >Priority: Major > > Arrow is a great way to exchange data between systems. Somewhere in the > process, however, data must be load into, and read out of the Arrow vectors. > Arrow's vector code started with similar code in Apache Drill. The Drill > project created a "Row Set" abstraction that: > * Provides a simple way to define the schema for a set of batches. > * Loads data into vectors from row-oriented inputs. > * Reads data out of vectors in row-oriented output. > * Controls memory consumed by the record batch when loading data into > vectors. > * Ensures maximum usage of the allocated vector space when loading data Into > vectors. > * Optionally handles projection when reading data from an input file into a > set of vectors. > * Optionally handles data conversion from input to vector formats. > This mechanism is handy for any Java developer who produces or consumes Arrow > vectors. > Detailed information is available in [this > wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed > description of the motivation for this project, and an analysis of the work > required to do the Drill-to-Arrow port. > The code is in Java simply because Drill is written in Java. The same > mechanisms can be ported to other languages if useful. Those ports would be > separate future projects. > The code will be placed in a new Java module which can be imported by > projects that wish to use the code. Changes may be needed to expose items > from the {{vector}} module; we'll tackle those issues if/when they occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3164) [Java] Port Row Set abstraction from Drill to Arrow
[ https://issues.apache.org/jira/browse/ARROW-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated ARROW-3164: --- Description: Arrow is a great way to exchange data between systems. Somewhere in the process, however, data must be load into, and read out of the Arrow vectors. Arrow's vector code started with similar code in Apache Drill. The Drill project created a "Row Set" abstraction that: * Provides a simple way to define the schema for a set of batches. * Loads data into vectors from row-oriented inputs. * Reads data out of vectors in row-oriented output. * Controls memory consumed by the record batch when loading data into vectors. * Ensures maximum usage of the allocated vector space when loading data Into vectors. * Optionally handles projection when reading data from an input file into a set of vectors. * Optionally handles data conversion from input to vector formats. This mechanism is handy for any Java developer who produces or consumes Arrow vectors. Detailed information is available in [this wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed description of the motivation for this project, and an analysis of the work required to do the Drill-to-Arrow port. The code is in Java simply because Drill is written in Java. The same mechanisms can be ported to other languages if useful. Those ports would be separate future projects. The code will be placed in a new Java module which can be imported by projects that wish to use the code. Changes may be needed to expose items from the {{vector}} module; we'll tackle those issues if/when they occur. was: Arrow is a great way to exchange data between systems. Somewhere in the process, however, data must be load into, and read out of the Arrow vectors. Arrow evolved from Apache Drill. The Drill project created a "Row Set" abstraction that: * Provides a simple way to define the schema for a set of batches. * Loads data into vectors from row-oriented inputs. * Reads data out of vectors in row-oriented output. * Controls memory consumed by the record batch when loading data into vectors. * Ensures maximum usage of the allocated vector space when loading data Into vectors. * Optionally handles projection when reading data from an input file into a set of vectors. * Optionally handles data conversion from input to vector formats. This mechanism is handy for any Java developer who produces or consumes Arrow vectors. Detailed information is available in [this wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed description of the motivation for this project, and an analysis of the work required to do the Drill-to-Arrow port. The code is in Java simply because Drill is written in Java. The same mechanisms can be ported to other languages if useful. Those ports would be separate future projects. The code will be placed in a new Java module which can be imported by projects that wish to use the code. Changes may be needed to expose items from the {{vector}} module; we'll tackle those issues if/when they occur. > [Java] Port Row Set abstraction from Drill to Arrow > --- > > Key: ARROW-3164 > URL: https://issues.apache.org/jira/browse/ARROW-3164 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Paul Rogers >Priority: Major > > Arrow is a great way to exchange data between systems. Somewhere in the > process, however, data must be load into, and read out of the Arrow vectors. > Arrow's vector code started with similar code in Apache Drill. The Drill > project created a "Row Set" abstraction that: > * Provides a simple way to define the schema for a set of batches. > * Loads data into vectors from row-oriented inputs. > * Reads data out of vectors in row-oriented output. > * Controls memory consumed by the record batch when loading data into > vectors. > * Ensures maximum usage of the allocated vector space when loading data Into > vectors. > * Optionally handles projection when reading data from an input file into a > set of vectors. > * Optionally handles data conversion from input to vector formats. > This mechanism is handy for any Java developer who produces or consumes Arrow > vectors. > Detailed information is available in [this > wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed > description of the motivation for this project, and an analysis of the work > required to do the Drill-to-Arrow port. > The code is in Java simply because Drill is written in Java. The same > mechanisms can be ported to other languages if useful. Those ports would be > separate future projects. > The code will be placed in a new Java module which can be imported by > projects that wish to use the code.
[jira] [Created] (ARROW-3164) Port Row Set abstraction from Drill to Arrow
Paul Rogers created ARROW-3164: -- Summary: Port Row Set abstraction from Drill to Arrow Key: ARROW-3164 URL: https://issues.apache.org/jira/browse/ARROW-3164 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Paul Rogers Arrow is a great way to exchange data between systems. Somewhere in the process, however, data must be load into, and read out of the Arrow vectors. Arrow evolved from Apache Drill. The Drill project created a "Row Set" abstraction that: * Provides a simple way to define the schema for a set of batches. * Loads data into vectors from row-oriented inputs. * Reads data out of vectors in row-oriented output. * Controls memory consumed by the record batch when loading data into vectors. * Ensures maximum usage of the allocated vector space when loading data Into vectors. * Optionally handles projection when reading data from an input file into a set of vectors. * Optionally handles data conversion from input to vector formats. This mechanism is handy for any Java developer who produces or consumes Arrow vectors. Detailed information is available in [this wiki|https://github.com/paul-rogers/arrow/wiki], including a more detailed description of the motivation for this project, and an analysis of the work required to do the Drill-to-Arrow port. The code is in Java simply because Drill is written in Java. The same mechanisms can be ported to other languages if useful. Those ports would be separate future projects. The code will be placed in a new Java module which can be imported by projects that wish to use the code. Changes may be needed to expose items from the {{vector}} module; we'll tackle those issues if/when they occur. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3151) [C++] Create Protocol Buffers interface for iterating over the semantic "rows" of a record batch, and accessing the rows using the protobuf API
[ https://issues.apache.org/jira/browse/ARROW-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597920#comment-16597920 ] Paul Rogers commented on ARROW-3151: Let's see if we can coordinate on this. I'm starting work on a proposal for a "RowSet" interface to be ported over from Drill that provides a simple row-based API to read from, and write to, vectors. On the write site, the mechanism also enforces memory limits, which is the key reason Drill created the "RowSet" abstraction. Given that this project will need a way to assemble a row from a bundle of vectors, the "columnar-to-row" mechanism of RowSet might be a way to populate the row buffer. On the other hand, the RowSet code from Drill is in Java, this is C++. Still, might make sense to port the mechanism to C++ so it can be used in multiple contexts. Any background docs I could read to get a better understanding of the project context to determine if what was just said above makes sense in this context? Thanks. > [C++] Create Protocol Buffers interface for iterating over the semantic > "rows" of a record batch, and accessing the rows using the protobuf API > --- > > Key: ARROW-3151 > URL: https://issues.apache.org/jira/browse/ARROW-3151 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > > The desired workflow: > * User writes a .proto file describing the structure of a "row" as a Message > * Given the generated pb.h bindings, an Arrow users can iterate over an > {{arrow::RecordBatch}}, each iteration populating an instance of the Row > message > * The values of the row can then be accessed via the standard Protobuf APIs > A corresponding interface could be developed to write a RecordBatch using > protobufs as input, but that could be its own project -- This message was sent by Atlassian JIRA (v7.6.3#76005)