[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621664#comment-16621664 ] Uwe L. Korn commented on ARROW-3267: [~Paul.Rogers] We already have the necessary builder infrastructure, this function is mainly to have something to pass around when there is no data. Also the {{Table}} instance is not meant to be modified, i.e. it will stay empty all along the pipeline. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620933#comment-16620933 ] Paul Rogers commented on ARROW-3267: Yes, that's were Drill started also, and it is what step 2 in the previous note does. I suspect you'll find that, once you have a function, you'll want an easy way to create the schema (step 1). Then, unless a mechanism already exists, if you watch allocation logging, you'll see vector doublings you can avoid. So, soon want to optimize allocation performance by providing size hints. The size hint step can be a separate bunch of data, or can be part of the schema passed to the empty_table function. (You might want to have an allocate_table function that creates the table and allocates vectors.) Sounds like you're not hit these issues yet; but keep this in mind if/when you do. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620908#comment-16620908 ] Wes McKinney commented on ARROW-3267: - [~Paul.Rogers] I think this is a bit different. The scope of this issue is to have a function {code} table = pyarrow.empty_table(schema) {code} So the task is to construct the C++ data structure with a bunch of 0-length columns. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema
[ https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620873#comment-16620873 ] Paul Rogers commented on ARROW-3267: FWIW, ARROW-3164 describes a port of a "row set mechanism" from Apache Drill that does exactly this. There are three relevant components: 1. A fluent schema builder to define the schema. 2. The schema definition itself which includes both scalar and "complex" types. 3. A "row set" (vector batch) builder to build vectors from schema. Drill found that it was helpful to have additional metadata in the schema, such as expected width for VARCHAR columns, expected cardinality for arrays, and expected types for unions. The row set builder could then optionally allocate vector buffers at the approximate desired size, which avoided the need to double vectors repeatedly as they are written. The rest of the mechanism provides a means to write to, or read from vectors, which is beyond the scope of this particular ticket. This ticket talks about Python, so the Java row set code is not directly applicable. Still feel free to borrow ideas. Also, perhaps we can coordinate to establish a common approach across languages. > [Python] Create empty table from schema > --- > > Key: ARROW-3267 > URL: https://issues.apache.org/jira/browse/ARROW-3267 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.11.0 > > > When one knows the expected schema for its input data but has no input data > for a data pipeline, it is necessary to construct an empty table as a > sentinel value to pass through. > This is a small but often useful convenience function. -- This message was sent by Atlassian JIRA (v7.6.3#76005)