[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-20 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621664#comment-16621664
 ] 

Uwe L. Korn commented on ARROW-3267:


[~Paul.Rogers] We already have the necessary builder infrastructure, this 
function is mainly to have something to pass around when there is no data. Also 
the {{Table}} instance is not meant to be modified, i.e. it will stay empty all 
along the pipeline.

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620933#comment-16620933
 ] 

Paul Rogers commented on ARROW-3267:


Yes, that's were Drill started also, and it is what step 2 in the previous note 
does.

I suspect you'll find that, once you have a function, you'll want an easy way 
to create the schema (step 1).

Then, unless a mechanism already exists, if you watch allocation logging, 
you'll see vector doublings you can avoid. So, soon want to optimize allocation 
performance by providing size hints. The size hint step can be a separate bunch 
of data, or can be part of the schema passed to the empty_table function. (You 
might want to have an allocate_table function that creates the table and 
allocates vectors.)

Sounds like you're not hit these issues yet; but keep this in mind if/when you 
do.
 

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620908#comment-16620908
 ] 

Wes McKinney commented on ARROW-3267:
-

[~Paul.Rogers] I think this is a bit different. The scope of this issue is to 
have a function

{code}
table = pyarrow.empty_table(schema)
{code}

So the task is to construct the C++ data structure with a bunch of 0-length 
columns. 

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3267) [Python] Create empty table from schema

2018-09-19 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620873#comment-16620873
 ] 

Paul Rogers commented on ARROW-3267:


FWIW, ARROW-3164 describes a port of a "row set mechanism" from Apache Drill 
that does exactly this. There are three relevant components:

1. A fluent schema builder to define the schema.
2. The schema definition itself which includes both scalar and "complex" types.
3. A "row set" (vector batch) builder to build vectors from schema.

Drill found that it was helpful to have additional metadata in the schema, such 
as expected width for VARCHAR columns, expected cardinality for arrays, and 
expected types for unions.

The row set builder could then optionally allocate vector buffers at the 
approximate desired size, which avoided the need to double vectors repeatedly 
as they are written.

The rest of the mechanism provides a means to write to, or read from vectors, 
which is beyond the scope of this particular ticket.

This ticket talks about Python, so the Java row set code is not directly 
applicable. Still feel free to borrow ideas. Also, perhaps we can coordinate to 
establish a common approach across languages.

> [Python] Create empty table from schema
> ---
>
> Key: ARROW-3267
> URL: https://issues.apache.org/jira/browse/ARROW-3267
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.11.0
>
>
> When one knows the expected schema for its input data but has no input data 
> for a data pipeline, it is necessary to construct an empty table as a 
> sentinel value to pass through.
> This is a small but often useful convenience function.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)