[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/15/18 3:58 AM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = arrow_table.schema.names
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee commented on ARROW-4032:
--

Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986
 ] 

David Lee edited comment on ARROW-4032 at 12/15/18 3:53 AM:


Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}


was (Author: davlee1...@yahoo.com):
Ended up just writing from_pylist() and to_pylist().. They run much faster than 
going through pandas..
{code:java}
def from_pylist(pylist, schema, safe=True):
arrow_columns = list()
for column in schema.names:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

def to_pylist(arrow_table):
od = pyarrow.Table.to_pydict(arrow_table)
pylist = list()
columns = list(arrow_table.keys())
rows = len(arrow_table[columns[0]])
for row in range(rows):
pylist.append({key: arrow_table[key][row] for key in columns})
return pylist
{code}

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes

2018-12-14 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721984#comment-16721984
 ] 

Francois Saint-Jacques commented on ARROW-4033:
---

I suppose that realpath is used to get the absolute path such that the 
subsequent exports are independent of the relative path. It could be replaced 
by `readlink -f` which is also part of coreutils (but older).

> [C++] thirdparty/download_dependencies.sh uses tools or options not available 
> in older Linuxes
> --
>
> Key: ARROW-4033
> URL: https://issues.apache.org/jira/browse/ARROW-4033
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also 
> {{wget 1.15}} does not have the {{--show-progress}} option



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2026) [Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True

2018-12-14 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721955#comment-16721955
 ] 

Francois Saint-Jacques commented on ARROW-2026:
---

{code:java}
file: file:/home/fsaintjacques/src/arrow/python/test_file.parquet 
creator: parquet-cpp version 1.5.1-SNAPSHOT 

file schema: schema 
--
last_updated: OPTIONAL INT96 R:0 D:1

row group 1: RC:1 TS:58 OFFSET:4 

last_updated: INT96 SNAPPY DO:4 FPO:32 SZ:58/54/0.93 VC:1 
ENC:PLAIN_DICTIONARY,PLAIN,R
{code}

> [Python] Cast all timestamp resolutions to INT96 
> use_deprecated_int96_timestamps=True
> -
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: c++, parquet, pull-request-available, redshift, 
> timestamps
> Fix For: 0.12.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3829:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Python] Support protocols to extract Arrow objects from third-party classes
> 
>
> Key: ARROW-3829
> URL: https://issues.apache.org/jira/browse/ARROW-3829
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> In the style of NumPy's {{__array__}}, we should be able to ask inputs to 
> {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to 
> Arrow objects. This would allow for example to turn objects that hold an 
> Arrow object internally to expose them directly instead of going a conversion 
> path.
> My current use case involves Pandas {{ExtensionArray}} instances that 
> internally have Arrow objects and should be reused when we pass the whole 
> {{DataFrame}} to {{pa.Table.from_pandas}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3803) [C++/Python] Split C++ and Python unit test Travis CI jobs, run all C++ tests (including Gandiva) together

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721920#comment-16721920
 ] 

Wes McKinney commented on ARROW-3803:
-

[~pitrou] unless you are already far into this, since I've been working a lot 
on the build scripts and CMake stuff this week, I can go ahead and take this

> [C++/Python] Split C++ and Python unit test Travis CI jobs, run all C++ tests 
> (including Gandiva) together
> --
>
> Key: ARROW-3803
> URL: https://issues.apache.org/jira/browse/ARROW-3803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> Our main C++/Python job is bumping up against the 50 minute limit lately, so 
> it is time to do a little bit of reorganization
> I suggest that we do this:
> * Build and test all C++ code including Gandiva in a single job on Linux and 
> macOS
> * Run Python unit tests (but not the C++ tests) on Linux and macOS in a 
> separate job
> Code coverage will need to get uploaded in the Linux jobs for both of these, 
> so a little bit of surgery is required



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721917#comment-16721917
 ] 

Wes McKinney commented on ARROW-3829:
-

I'm moving this to 0.13. If you submit an alpha / experimental version of this 
for 0.12, please go ahead =) 

> [Python] Support protocols to extract Arrow objects from third-party classes
> 
>
> Key: ARROW-3829
> URL: https://issues.apache.org/jira/browse/ARROW-3829
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.13.0
>
>
> In the style of NumPy's {{__array__}}, we should be able to ask inputs to 
> {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to 
> Arrow objects. This would allow for example to turn objects that hold an 
> Arrow object internally to expose them directly instead of going a conversion 
> path.
> My current use case involves Pandas {{ExtensionArray}} instances that 
> internally have Arrow objects and should be reused when we pass the whole 
> {{DataFrame}} to {{pa.Table.from_pandas}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3971) [Python] Remove APIs deprecated in 0.11 and prior

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3971:
--
Labels: pull-request-available  (was: )

> [Python] Remove APIs deprecated in 0.11 and prior
> -
>
> Key: ARROW-3971
> URL: https://issues.apache.org/jira/browse/ARROW-3971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3971) [Python] Remove APIs deprecated in 0.11 and prior

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3971:
---

Assignee: Wes McKinney

> [Python] Remove APIs deprecated in 0.11 and prior
> -
>
> Key: ARROW-3971
> URL: https://issues.apache.org/jira/browse/ARROW-3971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4006) Add CODE_OF_CONDUCT.md

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4006:
--
Labels: pull-request-available  (was: )

> Add CODE_OF_CONDUCT.md
> --
>
> Key: ARROW-4006
> URL: https://issues.apache.org/jira/browse/ARROW-4006
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> The Apache Software Foundation has a code of conduct that applies to its 
> projects
> https://www.apache.org/foundation/policies/conduct.html
> We should add a document to the root of the git repository to direct 
> interested individuals to the CoC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4006) Add CODE_OF_CONDUCT.md

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4006:
---

Assignee: Wes McKinney

> Add CODE_OF_CONDUCT.md
> --
>
> Key: ARROW-4006
> URL: https://issues.apache.org/jira/browse/ARROW-4006
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> The Apache Software Foundation has a code of conduct that applies to its 
> projects
> https://www.apache.org/foundation/policies/conduct.html
> We should add a document to the root of the git repository to direct 
> interested individuals to the CoC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3058) [Python] Feather reads fail with unintuitive error when conversion from pandas yields ChunkedArray

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3058:
--
Labels: pull-request-available  (was: )

> [Python] Feather reads fail with unintuitive error when conversion from 
> pandas yields ChunkedArray
> --
>
> Key: ARROW-3058
> URL: https://issues.apache.org/jira/browse/ARROW-3058
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> See report in 
> https://github.com/wesm/feather/issues/321#issuecomment-412884084
> Individual string columns with more than 2GB are currently unsupported in the 
> Feather format 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1807) [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers

2018-12-14 Thread Siddharth Teotia (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Teotia resolved ARROW-1807.
-
Resolution: Fixed

> [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers
> ---
>
> Key: ARROW-1807
> URL: https://issues.apache.org/jira/browse/ARROW-1807
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Siddharth Teotia
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Consolidate buffers for reducing the volume of objects and heap usage
>  => single buffer for fixed width
> < validity + offsets> = single buffer for var width, list vector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4034) [Ruby] Interface for FileOutputStream doesn't respect append=True

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4034:

Summary: [Ruby] Interface for FileOutputStream doesn't respect append=True  
(was: red-arrow interface for FileOutputStream doesn't respect append=True)

> [Ruby] Interface for FileOutputStream doesn't respect append=True
> -
>
> Key: ARROW-4034
> URL: https://issues.apache.org/jira/browse/ARROW-4034
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Ruby
> Environment: macOS High Sierra version 10.13.4; ruby 2.4.1; gtk-doc, 
> gobject-introspection, boost, Arrow C++ & Parquet C++, Arrow GLib all 
> installed via homebrew
>Reporter: Ian Murray
>Priority: Blocker
>
> It seems that the PR (#1978) that resolved Issue #2018 has not cascaded down 
> through the existing ruby interface.
> I've been experimenting with variations of the `write-file.rb` examples, but 
> passing in the append flag as true 
> (`Arrow::FileOutputStream.open("/tmp/file.arrow", true)`) still results in 
> overwriting the file, and trying the newer interface using truncate and 
> append flags throws `ArgumentError: wrong number of arguments (3 for 2)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4025) [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4025:
--
Labels: pull-request-available  (was: )

> [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some 
> settings
> -
>
> Key: ARROW-4025
> URL: https://issues.apache.org/jira/browse/ARROW-4025
> Project: Apache Arrow
>  Issue Type: Improvement
>Affects Versions: 0.11.1
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> See the bug report in [https://github.com/ray-project/ray/issues/3520]
> I wonder if we can revisit this issue and try to get rid of the workarounds 
> we tried to deploy in the past.
> See also the discussion in [https://github.com/apache/arrow/pull/2096]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4034) red-arrow interface for FileOutputStream doesn't respect append=True

2018-12-14 Thread Ian Murray (JIRA)
Ian Murray created ARROW-4034:
-

 Summary: red-arrow interface for FileOutputStream doesn't respect 
append=True
 Key: ARROW-4034
 URL: https://issues.apache.org/jira/browse/ARROW-4034
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
 Environment: macOS High Sierra version 10.13.4; ruby 2.4.1; gtk-doc, 
gobject-introspection, boost, Arrow C++ & Parquet C++, Arrow GLib all installed 
via homebrew
Reporter: Ian Murray


It seems that the PR (#1978) that resolved Issue #2018 has not cascaded down 
through the existing ruby interface.

I've been experimenting with variations of the `write-file.rb` examples, but 
passing in the append flag as true 
(`Arrow::FileOutputStream.open("/tmp/file.arrow", true)`) still results in 
overwriting the file, and trying the newer interface using truncate and append 
flags throws `ArgumentError: wrong number of arguments (3 for 2)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721862#comment-16721862
 ] 

Wes McKinney commented on ARROW-4033:
-

I partially addressed this in https://github.com/apache/arrow/pull/3174. 
[~fsaintjacques] can you use an alternative to {{realpath}}?

> [C++] thirdparty/download_dependencies.sh uses tools or options not available 
> in older Linuxes
> --
>
> Key: ARROW-4033
> URL: https://issues.apache.org/jira/browse/ARROW-4033
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also 
> {{wget 1.15}} does not have the {{--show-progress}} option



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3230:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Python] Missing comparisons on ChunkedArray, Table
> ---
>
> Key: ARROW-3230
> URL: https://issues.apache.org/jira/browse/ARROW-3230
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.13.0
>
>
> Table and ChunkedArray equality are not implemented, meaning they fall back 
> on identity. Instead they should invoke equals(), as on Column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4029:
--
Labels: pull-request-available  (was: )

> [C++] Define and document naming convention for internal / private header 
> files not to be installed
> ---
>
> Key: ARROW-4029
> URL: https://issues.apache.org/jira/browse/ARROW-4029
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can 
> recognize and exclude any file that is non-public from installation.
> see discussion on https://github.com/apache/arrow/pull/3172



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4029:
---

Assignee: Wes McKinney

> [C++] Define and document naming convention for internal / private header 
> files not to be installed
> ---
>
> Key: ARROW-4029
> URL: https://issues.apache.org/jira/browse/ARROW-4029
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can 
> recognize and exclude any file that is non-public from installation.
> see discussion on https://github.com/apache/arrow/pull/3172



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2475) [Format] Confusing array length description

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2475.
-
Resolution: Fixed
  Assignee: Uwe L. Korn  (was: Wes McKinney)

This was addressed already since the docs were merged

> [Format] Confusing array length description
> ---
>
> Key: ARROW-2475
> URL: https://issues.apache.org/jira/browse/ARROW-2475
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Krisztian Szucs
>Assignee: Uwe L. Korn
>Priority: Trivial
> Fix For: 0.12.0
>
>
> "To encourage developers to compose smaller arrays (each of which contains 
> contiguous memory in its leaf nodes) to create larger array structures 
> possibly exceeding 2^31 - 1 elements, as opposed to allocating very large 
> contiguous memory blocks."
> I think it could use a little more verbose explanation: `to compose smaller 
> arrays to create larger array structures`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3974) [C++] Combine field_builders_ and children_ members in array/builder.h

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3974.
-
Resolution: Fixed

Done in 
https://github.com/apache/arrow/commit/73f94c93d7eee25a43415dfa7a806b887942abd1

> [C++] Combine field_builders_ and children_ members in array/builder.h
> --
>
> Key: ARROW-3974
> URL: https://issues.apache.org/jira/browse/ARROW-3974
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> The intent of {{children_}} was to use these in nested type builders. But 
> {{StructBuilder}} has its own differently-named child builders member 
> {{field_builders_}}. This bit of cruft should be cleaned up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3449:
--
Labels: pull-request-available  (was: )

> [C++] Support CMake 3.2 for "out of the box" builds
> ---
>
> Key: ARROW-3449
> URL: https://issues.apache.org/jira/browse/ARROW-3449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>
> As reported in the 0.11.0 RC1 release vote, some of our dependencies (like 
> gbenchmark) do not build out of the box with CMake 3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4033:
---

 Summary: [C++] thirdparty/download_dependencies.sh uses tools or 
options not available in older Linuxes
 Key: ARROW-4033
 URL: https://issues.apache.org/jira/browse/ARROW-4033
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also 
{{wget 1.15}} does not have the {{--show-progress}} option



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3984) [C++] Exit with error if user hits zstd ExternalProject path

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3984:
---

Assignee: Wes McKinney

> [C++] Exit with error if user hits zstd ExternalProject path
> 
>
> Key: ARROW-3984
> URL: https://issues.apache.org/jira/browse/ARROW-3984
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> We should check the CMake version and exit with a more informative error if 
> {{ARROW_WITH_ZSTD}} is on, but the CMake version is too old



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3449:
---

Assignee: Wes McKinney  (was: Francois Saint-Jacques)

> [C++] Support CMake 3.2 for "out of the box" builds
> ---
>
> Key: ARROW-3449
> URL: https://issues.apache.org/jira/browse/ARROW-3449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> As reported in the 0.11.0 RC1 release vote, some of our dependencies (like 
> gbenchmark) do not build out of the box with CMake 3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721812#comment-16721812
 ] 

Wes McKinney commented on ARROW-3449:
-

There's a bunch of CMake-related issues. Since I'm already digging around in 
these files I'll take care of this

> [C++] Support CMake 3.2 for "out of the box" builds
> ---
>
> Key: ARROW-3449
> URL: https://issues.apache.org/jira/browse/ARROW-3449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> As reported in the 0.11.0 RC1 release vote, some of our dependencies (like 
> gbenchmark) do not build out of the box with CMake 3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3762.
-
Resolution: Fixed

Issue resolved by pull request 3171
[https://github.com/apache/arrow/pull/3171]

> [C++] Parquet arrow::Table reads error when overflowing capacity of 
> BinaryArray
> ---
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Summary: [Python] New pyarrow.Table.from_pylist() function  (was: [Python] 
New pyarrow.Table.from_pydict() function)

> [Python] New pyarrow.Table.from_pylist() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pylist(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pylist(test_list, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(test_list, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(test_list, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if 

[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v in 
pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return 

[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721745#comment-16721745
 ] 

David Lee commented on ARROW-4032:
--

Updated the sample code to include Schema and Safe options..

Passing in a schema will allow conversions from microseconds to milliseconds.

> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for v in 
> pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pydict(pylist, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute, 
today.second, today.microsecond - today.microsecond % 1000)

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]

def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v in 
pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table

test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
'dummy'])

test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])

test2 = from_pydict(pylist, schema=test_schema)
{code}

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
Additional work would be needed to pass in a schema object if you want to 
refine data types further. I think the existing code from from_pandas() to do 
that would work.


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour, 
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pydict(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for v in 
> pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pydict(pylist, schema=test_schema)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": datetime.now()}
> ]
> def from_pydict(pylist, columns):
> arrow_columns = list()
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> return arrow_table
> test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Lee updated ARROW-4032:
-
Description: 
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
Additional work would be needed to pass in a schema object if you want to 
refine data types further. I think the existing code from from_pandas() to do 
that would work.

  was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 


> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": datetime.now()}
> ]
> def from_pydict(pylist, columns):
> arrow_columns = list()
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> return arrow_table
> test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy'])
> {code}
> Additional work would be needed to pass in a schema object if you want to 
> refine data types further. I think the existing code from from_pandas() to do 
> that would work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721727#comment-16721727
 ] 

Wes McKinney commented on ARROW-4032:
-

You can do {{pa.array(pylist)}} already. So if we had a function to convert 
StructArray to Table then this would mostly do what you're describing. This was 
partly the intent of ARROW-40

> [Python] New pyarrow.Table.from_pydict() function
> -
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: David Lee
>Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit 
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>  
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> pylist = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": datetime.now()}
> ]
> def from_pydict(pylist, columns):
> arrow_columns = list()
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for 
> v in pylist]))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> return arrow_table
> test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)
David Lee created ARROW-4032:


 Summary: [Python] New pyarrow.Table.from_pydict() function
 Key: ARROW-4032
 URL: https://issues.apache.org/jira/browse/ARROW-4032
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: David Lee


Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4031) [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721715#comment-16721715
 ] 

Wes McKinney commented on ARROW-4031:
-

Makes sense. If you do start working on this you may want to hold off until 
https://github.com/apache/arrow/pull/3171 is merged (since a bunch of the code 
is moved around)

> [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder
> --
>
> Key: ARROW-4031
> URL: https://issues.apache.org/jira/browse/ARROW-4031
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Benjamin Kietzman
>Priority: Minor
>
> It would be useful to have a specialization of TypedBufferBuilder to simplify 
> building buffers of bits. This could then be utilized by ArrayBuilder (for 
> the null bitmap) and BooleanBuilder (for values)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3184) [C++] Add modular build targets, "all" target, and require explicit target when invoking make or ninja

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3184.
-
Resolution: Fixed

Issue resolved by pull request 3172
[https://github.com/apache/arrow/pull/3172]

> [C++] Add modular build targets, "all" target, and require explicit target 
> when invoking make or ninja
> --
>
> Key: ARROW-3184
> URL: https://issues.apache.org/jira/browse/ARROW-3184
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> This will make it easier to build and install only part of the project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4031) [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder

2018-12-14 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4031:


 Summary: [C++] Refactor ArrayBuilder bitmap logic into 
TypedBufferBuilder
 Key: ARROW-4031
 URL: https://issues.apache.org/jira/browse/ARROW-4031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman


It would be useful to have a specialization of TypedBufferBuilder to simplify 
building buffers of bits. This could then be utilized by ArrayBuilder (for the 
null bitmap) and BooleanBuilder (for values)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3994) [C++] Remove ARROW_GANDIVA_BUILD_TESTS option

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3994.
-
Resolution: Fixed

Resolved in 
https://github.com/apache/arrow/commit/804502f941f808583e9f7043e203533de738d577

> [C++] Remove ARROW_GANDIVA_BUILD_TESTS option
> -
>
> Key: ARROW-3994
> URL: https://issues.apache.org/jira/browse/ARROW-3994
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Gandiva
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is not needed now that both libraries and tests are tied to the same 
> "gandiva" build target and label. So {{ninja gandiva && ctest -L gandiva}} 
> will build only the relevant targets
> Follow up to ARROW-3988



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4030) [CI] Use travis_terminate to halt builds when a step fails

2018-12-14 Thread Francois Saint-Jacques (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721676#comment-16721676
 ] 

Francois Saint-Jacques commented on ARROW-4030:
---

It's apparently worse than this. According to comments of the linked issue, 
Travis will only mark your build as failed only if the last script returned 
non-zero (essentially behaving like a pipe). I'd recommend moving to the rust 
technique https://github.com/rust-lang/rust/pull/12513/files

> [CI] Use travis_terminate to halt builds when a step fails
> --
>
> Key: ARROW-4030
> URL: https://issues.apache.org/jira/browse/ARROW-4030
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> I noticed that Travis CI will soldier onward if a step in its {{script:}} 
> block fails. This wastes build time when there is an error somewhere early on 
> in the testing process
> For example, in the main C++ build, if {{travis_script_cpp.sh}} fails, then 
> the subsequent steps will continue.
> It seems the way to deal with this is to add {{|| travis_terminate 1}} to 
> lines that can fail
> see
> https://medium.com/@manjula.cse/how-to-stop-the-execution-of-travis-pipeline-if-script-exits-with-an-error-f0e5a43206bf
> I also found this discussion
> https://github.com/travis-ci/travis-ci/issues/1066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4030) [CI] Use travis_terminate to halt builds when a step fails

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4030:
---

 Summary: [CI] Use travis_terminate to halt builds when a step fails
 Key: ARROW-4030
 URL: https://issues.apache.org/jira/browse/ARROW-4030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney
 Fix For: 0.12.0


I noticed that Travis CI will soldier onward if a step in its {{script:}} block 
fails. This wastes build time when there is an error somewhere early on in the 
testing process

For example, in the main C++ build, if {{travis_script_cpp.sh}} fails, then the 
subsequent steps will continue.

It seems the way to deal with this is to add {{|| travis_terminate 1}} to lines 
that can fail

see

https://medium.com/@manjula.cse/how-to-stop-the-execution-of-travis-pipeline-if-script-exits-with-an-error-f0e5a43206bf

I also found this discussion

https://github.com/travis-ci/travis-ci/issues/1066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4028) [Rust] Merge parquet-rs codebase

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4028:

Fix Version/s: 0.12.0

> [Rust] Merge parquet-rs codebase
> 
>
> Key: ARROW-4028
> URL: https://issues.apache.org/jira/browse/ARROW-4028
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 0.12.0
>
>
> Initial donation of [parquet-rs|https://github.com/sunchao/parquet-rs], an 
> Apache Parquet implementation in Rust. This subjects to ASF IP clearance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4007) [Java][Plasma] Plasma JNI tests failing

2018-12-14 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4007:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Java][Plasma] Plasma JNI tests failing
> ---
>
> Key: ARROW-4007
> URL: https://issues.apache.org/jira/browse/ARROW-4007
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Wes McKinney
>Priority: Critical
> Fix For: 0.13.0
>
>
> see https://travis-ci.org/apache/arrow/jobs/466819720
> {code}
> [INFO] Total time: 10.633 s
> [INFO] Finished at: 2018-12-12T03:56:33Z
> [INFO] Final Memory: 39M/426M
> [INFO] 
> 
>   linux-vdso.so.1 =>  (0x7ffcff172000)
>   librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f99ecd9e000)
>   libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f99ecb85000)
>   libboost_system.so.1.54.0 => 
> /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0 (0x7f99ec981000)
>   libboost_filesystem.so.1.54.0 => 
> /usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.54.0 (0x7f99ec76b000)
>   libboost_regex.so.1.54.0 => 
> /usr/lib/x86_64-linux-gnu/libboost_regex.so.1.54.0 (0x7f99ec464000)
>   libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
> (0x7f99ec246000)
>   libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
> (0x7f99ebf3)
>   libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f99ebc2a000)
>   libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
> (0x7f99eba12000)
>   libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f99eb649000)
>   libicuuc.so.52 => /usr/lib/x86_64-linux-gnu/libicuuc.so.52 
> (0x7f99eb2d)
>   libicui18n.so.52 => /usr/lib/x86_64-linux-gnu/libicui18n.so.52 
> (0x7f99eaec9000)
>   /lib64/ld-linux-x86-64.so.2 (0x7f99ecfa6000)
>   libicudata.so.52 => /usr/lib/x86_64-linux-gnu/libicudata.so.52 
> (0x7f99e965c000)
>   libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99e9458000)
> /home/travis/build/apache/arrow/cpp/src/plasma/store.cc:985: Allowing the 
> Plasma store to use up to 0.01GB of memory.
> /home/travis/build/apache/arrow/cpp/src/plasma/store.cc:1015: Starting object 
> store with directory /dev/shm and huge page support disabled
> Start process 317574433 OK, cmd = 
> [/home/travis/build/apache/arrow/cpp-install/bin/plasma_store_server  -s  
> /tmp/store89237  -m  1000]
> Start object store success
> Start test.
> Plasma java client put test success.
> Plasma java client get single object test success.
> Plasma java client get multi-object test success.
> ObjectId [B@34c45dca error at PlasmaClient put
> java.lang.Exception: An object with this ID already exists in the plasma 
> store.
>   at org.apache.arrow.plasma.PlasmaClientJNI.create(Native Method)
>   at org.apache.arrow.plasma.PlasmaClient.put(PlasmaClient.java:51)
>   at 
> org.apache.arrow.plasma.PlasmaClientTest.doTest(PlasmaClientTest.java:145)
>   at 
> org.apache.arrow.plasma.PlasmaClientTest.main(PlasmaClientTest.java:220)
> Plasma java client put same object twice exception test success.
> Plasma java client hash test success.
> Plasma java client contains test success.
> Plasma java client metadata get test success.
> Plasma java client delete test success.
> Kill plasma store process forcely
> All test success.
> ~/build/apache/arrow
> {code}
> I didn't see any related code changes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4029:
---

 Summary: [C++] Define and document naming convention for internal 
/ private header files not to be installed
 Key: ARROW-4029
 URL: https://issues.apache.org/jira/browse/ARROW-4029
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can 
recognize and exclude any file that is non-public from installation.

see discussion on https://github.com/apache/arrow/pull/3172



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721533#comment-16721533
 ] 

Wes McKinney commented on ARROW-3449:
-

Per discussion on https://github.com/apache/arrow/pull/3172, I changed the 
default for {{ARROW_GANDIVA_JAVA}} to {{OFF}}. If that's the only component 
that requires a newer CMake, we might simply punt on dealing with the 
UseJava.cmake issue and simply ask that people install the newest CMake if they 
are building that part of the project (which already has a steep list of 
dependencies, so CMake on top of that is not much more to install)

> [C++] Support CMake 3.2 for "out of the box" builds
> ---
>
> Key: ARROW-3449
> URL: https://issues.apache.org/jira/browse/ARROW-3449
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Francois Saint-Jacques
>Priority: Major
> Fix For: 0.12.0
>
>
> As reported in the 0.11.0 RC1 release vote, some of our dependencies (like 
> gbenchmark) do not build out of the box with CMake 3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2026) [Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True

2018-12-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2026:
--
Labels: c++ parquet pull-request-available redshift timestamps  (was: c++ 
parquet redshift timestamps)

> [Python] Cast all timestamp resolutions to INT96 
> use_deprecated_int96_timestamps=True
> -
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Assignee: Francois Saint-Jacques
>Priority: Major
>  Labels: c++, parquet, pull-request-available, redshift, 
> timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds

2018-12-14 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721524#comment-16721524
 ] 

Wes McKinney commented on ARROW-4009:
-

I agree that valgrind does provide useful insights. It can find things (like 
memory leaks) that that ASAN does not. It requires an up to date clang indeed

Lucky because the way we manage memory, leaks in the C++ are rare which has 
been nice



> [CI] Run Valgrind and C++ code coverage in different bulds
> --
>
> Key: ARROW-4009
> URL: https://issues.apache.org/jira/browse/ARROW-4009
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This 
> means the slowness of Valgrind acts as a multiplier of the overhead of 
> outputting coverage information using the instrumentation added by the 
> compiler.
> Instead we should probably emit C++ (and Python) coverage information in a 
> different Travis-CI build without Valgrind enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4015) [Plasma] remove legacy interfaces for plasma manager

2018-12-14 Thread Philipp Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-4015.
---
   Resolution: Fixed
Fix Version/s: 0.12.0

Issue resolved by pull request 3167
[https://github.com/apache/arrow/pull/3167]

> [Plasma] remove legacy interfaces for plasma manager
> 
>
> Key: ARROW-4015
> URL: https://issues.apache.org/jira/browse/ARROW-4015
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Assignee: Zhijun Fu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/issues/3154]
> In legacy ray, interacting with remote plasma stores is done via plasma 
> manager, which is part of ray, and plasma has a few interfaces to support it 
> - namely Fetch() and Wait().
> Currently the legacy ray code has already been removed, and the new raylet 
> uses object manager to interface with remote machine, and these legacy plasma 
> interfaces are no longer used. I think we could remove these legacy 
> interfaces to cleanup code and avoid confusion.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds

2018-12-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721301#comment-16721301
 ] 

Antoine Pitrou commented on ARROW-4009:
---

(Valgrind, while slow, finds useful insight)

> [CI] Run Valgrind and C++ code coverage in different bulds
> --
>
> Key: ARROW-4009
> URL: https://issues.apache.org/jira/browse/ARROW-4009
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This 
> means the slowness of Valgrind acts as a multiplier of the overhead of 
> outputting coverage information using the instrumentation added by the 
> compiler.
> Instead we should probably emit C++ (and Python) coverage information in a 
> different Travis-CI build without Valgrind enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds

2018-12-14 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721300#comment-16721300
 ] 

Antoine Pitrou commented on ARROW-4009:
---

I'm not up-to-date on ASAN. Does it require a recent clang for useful results?

> [CI] Run Valgrind and C++ code coverage in different bulds
> --
>
> Key: ARROW-4009
> URL: https://issues.apache.org/jira/browse/ARROW-4009
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This 
> means the slowness of Valgrind acts as a multiplier of the overhead of 
> outputting coverage information using the instrumentation added by the 
> compiler.
> Instead we should probably emit C++ (and Python) coverage information in a 
> different Travis-CI build without Valgrind enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3979) [Gandiva] fix all valgrind reported errors

2018-12-14 Thread shyam narayan singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shyam narayan singh reassigned ARROW-3979:
--

Assignee: shyam narayan singh  (was: Pindikura Ravindra)

> [Gandiva] fix all valgrind reported errors
> --
>
> Key: ARROW-3979
> URL: https://issues.apache.org/jira/browse/ARROW-3979
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Reporter: Pindikura Ravindra
>Assignee: shyam narayan singh
>Priority: Major
>
> Travis reports lots of valgrind errors when running gandiva tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3701) [Gandiva] Add support for decimal operations

2018-12-14 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721208#comment-16721208
 ] 

Pindikura Ravindra commented on ARROW-3701:
---

As part of my PR, I'm adding more benchmarks to gandiva/benchmarks.cc - this'll 
exercise both the arrow-decimal code and gandiva-decimal code.

 

> [Gandiva] Add support for decimal operations
> 
>
> Key: ARROW-3701
> URL: https://issues.apache.org/jira/browse/ARROW-3701
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Gandiva
>Reporter: Pindikura Ravindra
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> To begin with, will add support for 128-bit decimals. There are two parts :
>  # llvm_generator needs to understand decimal types (value, precision, scale)
>  # code decimal operations : add/subtract/multiply/divide/mod/..
>  ** This will be c++ code that can be pre-compiled to emit IR code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)