[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2020-06-01 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121070#comment-17121070
 ] 

Wes McKinney commented on KUDU-1276:


I'll try to help with this patch, I've been pretty busy with a few things 
lately but I may be able to work a bit on the patch this week. 

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2020-06-01 Thread Grant Henke (Jira)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121069#comment-17121069
 ] 

Grant Henke commented on KUDU-1276:
---

Kudu recently added support for a columnar, arrow compatible, format. There is 
a WIP patch here to update the Python client to use it:
https://gerrit.cloudera.org/#/c/15661/

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2018-10-04 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638292#comment-16638292
 ] 

Wes McKinney commented on KUDU-1276:


Ha, I have a completely laughable amount of spare bandwidth right now =) I'm 
working on growing my team at Ursa Labs and have Kudu-Arrow-pandas improvements 
on at least the 12-18 month horizon. In the meantime if someone could add 
Apache Arrow to the Kudu build toolchain that would make things easier for us 
to tackle this when we get to it (if no one beats us there)

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2018-10-04 Thread Jordan Birdsell (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638230#comment-16638230
 ] 

Jordan Birdsell commented on KUDU-1276:
---

Yea, this approach was not the ideal approach, meant to be a quick stab at 
getting the functionality in. [~wesmckinn] if you're interested in helping out 
here it would be much appreciated, I've not had a lot of time in recent months 
for this work.

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2018-10-04 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638033#comment-16638033
 ] 

Wes McKinney commented on KUDU-1276:


I just took a look at the patch. The approach is not very efficient. The 
optimal path for an application like Kudu to go to pandas is via Arrow. I would 
suggest writing a C++ converter to yield an Arrow (columnar) record batch, wrap 
that in a {{pyarrow.RecordBatch}}, then call {{to_pandas()}} (cf 
http://wesmckinney.com/blog/high-perf-arrow-to-pandas/)

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2018-10-03 Thread Adar Dembo (JIRA)


[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637767#comment-16637767
 ] 

Adar Dembo commented on KUDU-1276:
--

[~jtbirdsell] [your recent Pandas patch|https://gerrit.cloudera.org/c/10809/] 
added support for reading from Kudu into Pandas, but not in the way described 
by Wes in this bug's description. Not being familiar with Pandas, is there more 
work to be done here on the reading side? Or is it done, albeit in a different 
way than Wes suggested?

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client, python
>Reporter: Wes McKinney
>Assignee: Jordan Birdsell
>Priority: Major
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KUDU-1276) Add a vectorized read/write interface for pandas DataFrame objects

2016-09-15 Thread Jordan Birdsell (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15493534#comment-15493534
 ] 

Jordan Birdsell commented on KUDU-1276:
---

[~wesmckinn], have you had a chance to do anything on this? If not, mind if I 
take a stab?

Also, dumb question, is there any reason to not take the approach Spark took 
for the reader and use the from_records method?

> Add a vectorized read/write interface for pandas DataFrame objects
> --
>
> Key: KUDU-1276
> URL: https://issues.apache.org/jira/browse/KUDU-1276
> Project: Kudu
>  Issue Type: New Feature
>  Components: client
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> A pandas read/write interface would make Kudu significantly easier to use for 
> average Python data users.
> The layering is as follows:
> - Writer: "Vectorized" insert that accepts a C/C++ array of values plus an 
> array (either bits or bytes) indicating nullness for nullable slots
> - Reader: Converts a row batch to NumPy arrays with missing data 
> representation suitable for use in pandas. Ideally should not create more 
> than one PyString object for each observed string value. Binary can be 
> encoded as UTF8 string, while Timestamp will need to be converted to 
> nanoseconds for pandas 
> This would also give a very performant and relatively GIL-free data ingest 
> path to the Kudu (and Kudu consumers like Impala) without a great deal of 
> Python+Cython coding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)