[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/15/18 3:58 AM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = arrow_table.schema.names rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986 ] David Lee commented on ARROW-4032: -- Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721986#comment-16721986 ] David Lee edited comment on ARROW-4032 at 12/15/18 3:53 AM: Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes
[ https://issues.apache.org/jira/browse/ARROW-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721984#comment-16721984 ] Francois Saint-Jacques commented on ARROW-4033: --- I suppose that realpath is used to get the absolute path such that the subsequent exports are independent of the relative path. It could be replaced by `readlink -f` which is also part of coreutils (but older). > [C++] thirdparty/download_dependencies.sh uses tools or options not available > in older Linuxes > -- > > Key: ARROW-4033 > URL: https://issues.apache.org/jira/browse/ARROW-4033 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also > {{wget 1.15}} does not have the {{--show-progress}} option -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2026) [Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True
[ https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721955#comment-16721955 ] Francois Saint-Jacques commented on ARROW-2026: --- {code:java} file: file:/home/fsaintjacques/src/arrow/python/test_file.parquet creator: parquet-cpp version 1.5.1-SNAPSHOT file schema: schema -- last_updated: OPTIONAL INT96 R:0 D:1 row group 1: RC:1 TS:58 OFFSET:4 last_updated: INT96 SNAPPY DO:4 FPO:32 SZ:58/54/0.93 VC:1 ENC:PLAIN_DICTIONARY,PLAIN,R {code} > [Python] Cast all timestamp resolutions to INT96 > use_deprecated_int96_timestamps=True > - > > Key: ARROW-2026 > URL: https://issues.apache.org/jira/browse/ARROW-2026 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS: Mac OS X 10.13.2 > Python: 3.6.4 > PyArrow: 0.8.0 >Reporter: Diego Argueta >Assignee: Francois Saint-Jacques >Priority: Major > Labels: c++, parquet, pull-request-available, redshift, > timestamps > Fix For: 0.12.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, > timestamps are only written as 96-bit integers if the timestamp has > nanosecond resolution. This is a problem because Amazon Redshift timestamps > only have microsecond resolution but require them to be stored in 96-bit > format in Parquet files. > I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps > to be written as 96 bits, regardless of resolution. If this is a deliberate > design decision, it'd be immensely helpful if it were explicitly documented > as part of the argument. > > To reproduce: > > 1. Create a table with a timestamp having microsecond or millisecond > resolution, and save it to a Parquet file. Be sure to set > `use_deprecated_int96_timestamps` to True. > > {code:java} > import datetime > import pyarrow > from pyarrow import parquet > schema = pyarrow.schema([ > pyarrow.field('last_updated', pyarrow.timestamp('us')), > ]) > data = [ > pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')), > ] > table = pyarrow.Table.from_arrays(data, ['last_updated']) > with open('test_file.parquet', 'wb') as fdesc: > parquet.write_table(table, fdesc, > use_deprecated_int96_timestamps=True) > {code} > > 2. Inspect the file. I used parquet-tools: > > {noformat} > dak@tux ~ $ parquet-tools meta test_file.parquet > file: file:/Users/dak/test_file.parquet > creator: parquet-cpp version 1.3.2-SNAPSHOT > file schema: schema > > last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1 > row group 1: RC:1 TS:76 OFFSET:4 > > last_updated: INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 > ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes
[ https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3829: Fix Version/s: (was: 0.12.0) 0.13.0 > [Python] Support protocols to extract Arrow objects from third-party classes > > > Key: ARROW-3829 > URL: https://issues.apache.org/jira/browse/ARROW-3829 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.13.0 > > > In the style of NumPy's {{__array__}}, we should be able to ask inputs to > {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to > Arrow objects. This would allow for example to turn objects that hold an > Arrow object internally to expose them directly instead of going a conversion > path. > My current use case involves Pandas {{ExtensionArray}} instances that > internally have Arrow objects and should be reused when we pass the whole > {{DataFrame}} to {{pa.Table.from_pandas}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3803) [C++/Python] Split C++ and Python unit test Travis CI jobs, run all C++ tests (including Gandiva) together
[ https://issues.apache.org/jira/browse/ARROW-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721920#comment-16721920 ] Wes McKinney commented on ARROW-3803: - [~pitrou] unless you are already far into this, since I've been working a lot on the build scripts and CMake stuff this week, I can go ahead and take this > [C++/Python] Split C++ and Python unit test Travis CI jobs, run all C++ tests > (including Gandiva) together > -- > > Key: ARROW-3803 > URL: https://issues.apache.org/jira/browse/ARROW-3803 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > Our main C++/Python job is bumping up against the 50 minute limit lately, so > it is time to do a little bit of reorganization > I suggest that we do this: > * Build and test all C++ code including Gandiva in a single job on Linux and > macOS > * Run Python unit tests (but not the C++ tests) on Linux and macOS in a > separate job > Code coverage will need to get uploaded in the Linux jobs for both of these, > so a little bit of surgery is required -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3829) [Python] Support protocols to extract Arrow objects from third-party classes
[ https://issues.apache.org/jira/browse/ARROW-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721917#comment-16721917 ] Wes McKinney commented on ARROW-3829: - I'm moving this to 0.13. If you submit an alpha / experimental version of this for 0.12, please go ahead =) > [Python] Support protocols to extract Arrow objects from third-party classes > > > Key: ARROW-3829 > URL: https://issues.apache.org/jira/browse/ARROW-3829 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.13.0 > > > In the style of NumPy's {{__array__}}, we should be able to ask inputs to > {{pa.array}}, {{pa.Table.from_X}}, ... whether they can convert themselves to > Arrow objects. This would allow for example to turn objects that hold an > Arrow object internally to expose them directly instead of going a conversion > path. > My current use case involves Pandas {{ExtensionArray}} instances that > internally have Arrow objects and should be reused when we pass the whole > {{DataFrame}} to {{pa.Table.from_pandas}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3971) [Python] Remove APIs deprecated in 0.11 and prior
[ https://issues.apache.org/jira/browse/ARROW-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3971: -- Labels: pull-request-available (was: ) > [Python] Remove APIs deprecated in 0.11 and prior > - > > Key: ARROW-3971 > URL: https://issues.apache.org/jira/browse/ARROW-3971 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3971) [Python] Remove APIs deprecated in 0.11 and prior
[ https://issues.apache.org/jira/browse/ARROW-3971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3971: --- Assignee: Wes McKinney > [Python] Remove APIs deprecated in 0.11 and prior > - > > Key: ARROW-3971 > URL: https://issues.apache.org/jira/browse/ARROW-3971 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4006) Add CODE_OF_CONDUCT.md
[ https://issues.apache.org/jira/browse/ARROW-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4006: -- Labels: pull-request-available (was: ) > Add CODE_OF_CONDUCT.md > -- > > Key: ARROW-4006 > URL: https://issues.apache.org/jira/browse/ARROW-4006 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > The Apache Software Foundation has a code of conduct that applies to its > projects > https://www.apache.org/foundation/policies/conduct.html > We should add a document to the root of the git repository to direct > interested individuals to the CoC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4006) Add CODE_OF_CONDUCT.md
[ https://issues.apache.org/jira/browse/ARROW-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-4006: --- Assignee: Wes McKinney > Add CODE_OF_CONDUCT.md > -- > > Key: ARROW-4006 > URL: https://issues.apache.org/jira/browse/ARROW-4006 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > The Apache Software Foundation has a code of conduct that applies to its > projects > https://www.apache.org/foundation/policies/conduct.html > We should add a document to the root of the git repository to direct > interested individuals to the CoC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3058) [Python] Feather reads fail with unintuitive error when conversion from pandas yields ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3058: -- Labels: pull-request-available (was: ) > [Python] Feather reads fail with unintuitive error when conversion from > pandas yields ChunkedArray > -- > > Key: ARROW-3058 > URL: https://issues.apache.org/jira/browse/ARROW-3058 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > See report in > https://github.com/wesm/feather/issues/321#issuecomment-412884084 > Individual string columns with more than 2GB are currently unsupported in the > Feather format -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-1807) [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers
[ https://issues.apache.org/jira/browse/ARROW-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Teotia resolved ARROW-1807. - Resolution: Fixed > [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers > --- > > Key: ARROW-1807 > URL: https://issues.apache.org/jira/browse/ARROW-1807 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Siddharth Teotia >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Consolidate buffers for reducing the volume of objects and heap usage > => single buffer for fixed width > < validity + offsets> = single buffer for var width, list vector -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4034) [Ruby] Interface for FileOutputStream doesn't respect append=True
[ https://issues.apache.org/jira/browse/ARROW-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4034: Summary: [Ruby] Interface for FileOutputStream doesn't respect append=True (was: red-arrow interface for FileOutputStream doesn't respect append=True) > [Ruby] Interface for FileOutputStream doesn't respect append=True > - > > Key: ARROW-4034 > URL: https://issues.apache.org/jira/browse/ARROW-4034 > Project: Apache Arrow > Issue Type: Bug > Components: Ruby > Environment: macOS High Sierra version 10.13.4; ruby 2.4.1; gtk-doc, > gobject-introspection, boost, Arrow C++ & Parquet C++, Arrow GLib all > installed via homebrew >Reporter: Ian Murray >Priority: Blocker > > It seems that the PR (#1978) that resolved Issue #2018 has not cascaded down > through the existing ruby interface. > I've been experimenting with variations of the `write-file.rb` examples, but > passing in the append flag as true > (`Arrow::FileOutputStream.open("/tmp/file.arrow", true)`) still results in > overwriting the file, and trying the newer interface using truncate and > append flags throws `ArgumentError: wrong number of arguments (3 for 2)`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4025) [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some settings
[ https://issues.apache.org/jira/browse/ARROW-4025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4025: -- Labels: pull-request-available (was: ) > [Python] TensorFlow/PyTorch arrow ThreadPool workarounds not working in some > settings > - > > Key: ARROW-4025 > URL: https://issues.apache.org/jira/browse/ARROW-4025 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.11.1 >Reporter: Philipp Moritz >Priority: Major > Labels: pull-request-available > > See the bug report in [https://github.com/ray-project/ray/issues/3520] > I wonder if we can revisit this issue and try to get rid of the workarounds > we tried to deploy in the past. > See also the discussion in [https://github.com/apache/arrow/pull/2096] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4034) red-arrow interface for FileOutputStream doesn't respect append=True
Ian Murray created ARROW-4034: - Summary: red-arrow interface for FileOutputStream doesn't respect append=True Key: ARROW-4034 URL: https://issues.apache.org/jira/browse/ARROW-4034 Project: Apache Arrow Issue Type: Bug Components: Ruby Environment: macOS High Sierra version 10.13.4; ruby 2.4.1; gtk-doc, gobject-introspection, boost, Arrow C++ & Parquet C++, Arrow GLib all installed via homebrew Reporter: Ian Murray It seems that the PR (#1978) that resolved Issue #2018 has not cascaded down through the existing ruby interface. I've been experimenting with variations of the `write-file.rb` examples, but passing in the append flag as true (`Arrow::FileOutputStream.open("/tmp/file.arrow", true)`) still results in overwriting the file, and trying the newer interface using truncate and append flags throws `ArgumentError: wrong number of arguments (3 for 2)`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes
[ https://issues.apache.org/jira/browse/ARROW-4033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721862#comment-16721862 ] Wes McKinney commented on ARROW-4033: - I partially addressed this in https://github.com/apache/arrow/pull/3174. [~fsaintjacques] can you use an alternative to {{realpath}}? > [C++] thirdparty/download_dependencies.sh uses tools or options not available > in older Linuxes > -- > > Key: ARROW-4033 > URL: https://issues.apache.org/jira/browse/ARROW-4033 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also > {{wget 1.15}} does not have the {{--show-progress}} option -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3230) [Python] Missing comparisons on ChunkedArray, Table
[ https://issues.apache.org/jira/browse/ARROW-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3230: Fix Version/s: (was: 0.12.0) 0.13.0 > [Python] Missing comparisons on ChunkedArray, Table > --- > > Key: ARROW-3230 > URL: https://issues.apache.org/jira/browse/ARROW-3230 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.13.0 > > > Table and ChunkedArray equality are not implemented, meaning they fall back > on identity. Instead they should invoke equals(), as on Column. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed
[ https://issues.apache.org/jira/browse/ARROW-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4029: -- Labels: pull-request-available (was: ) > [C++] Define and document naming convention for internal / private header > files not to be installed > --- > > Key: ARROW-4029 > URL: https://issues.apache.org/jira/browse/ARROW-4029 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can > recognize and exclude any file that is non-public from installation. > see discussion on https://github.com/apache/arrow/pull/3172 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed
[ https://issues.apache.org/jira/browse/ARROW-4029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-4029: --- Assignee: Wes McKinney > [C++] Define and document naming convention for internal / private header > files not to be installed > --- > > Key: ARROW-4029 > URL: https://issues.apache.org/jira/browse/ARROW-4029 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can > recognize and exclude any file that is non-public from installation. > see discussion on https://github.com/apache/arrow/pull/3172 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2475) [Format] Confusing array length description
[ https://issues.apache.org/jira/browse/ARROW-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2475. - Resolution: Fixed Assignee: Uwe L. Korn (was: Wes McKinney) This was addressed already since the docs were merged > [Format] Confusing array length description > --- > > Key: ARROW-2475 > URL: https://issues.apache.org/jira/browse/ARROW-2475 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Krisztian Szucs >Assignee: Uwe L. Korn >Priority: Trivial > Fix For: 0.12.0 > > > "To encourage developers to compose smaller arrays (each of which contains > contiguous memory in its leaf nodes) to create larger array structures > possibly exceeding 2^31 - 1 elements, as opposed to allocating very large > contiguous memory blocks." > I think it could use a little more verbose explanation: `to compose smaller > arrays to create larger array structures` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3974) [C++] Combine field_builders_ and children_ members in array/builder.h
[ https://issues.apache.org/jira/browse/ARROW-3974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3974. - Resolution: Fixed Done in https://github.com/apache/arrow/commit/73f94c93d7eee25a43415dfa7a806b887942abd1 > [C++] Combine field_builders_ and children_ members in array/builder.h > -- > > Key: ARROW-3974 > URL: https://issues.apache.org/jira/browse/ARROW-3974 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > The intent of {{children_}} was to use these in nested type builders. But > {{StructBuilder}} has its own differently-named child builders member > {{field_builders_}}. This bit of cruft should be cleaned up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds
[ https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3449: -- Labels: pull-request-available (was: ) > [C++] Support CMake 3.2 for "out of the box" builds > --- > > Key: ARROW-3449 > URL: https://issues.apache.org/jira/browse/ARROW-3449 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > > As reported in the 0.11.0 RC1 release vote, some of our dependencies (like > gbenchmark) do not build out of the box with CMake 3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes
Wes McKinney created ARROW-4033: --- Summary: [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes Key: ARROW-4033 URL: https://issues.apache.org/jira/browse/ARROW-4033 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also {{wget 1.15}} does not have the {{--show-progress}} option -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3984) [C++] Exit with error if user hits zstd ExternalProject path
[ https://issues.apache.org/jira/browse/ARROW-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3984: --- Assignee: Wes McKinney > [C++] Exit with error if user hits zstd ExternalProject path > > > Key: ARROW-3984 > URL: https://issues.apache.org/jira/browse/ARROW-3984 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > We should check the CMake version and exit with a more informative error if > {{ARROW_WITH_ZSTD}} is on, but the CMake version is too old -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds
[ https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3449: --- Assignee: Wes McKinney (was: Francois Saint-Jacques) > [C++] Support CMake 3.2 for "out of the box" builds > --- > > Key: ARROW-3449 > URL: https://issues.apache.org/jira/browse/ARROW-3449 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > As reported in the 0.11.0 RC1 release vote, some of our dependencies (like > gbenchmark) do not build out of the box with CMake 3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds
[ https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721812#comment-16721812 ] Wes McKinney commented on ARROW-3449: - There's a bunch of CMake-related issues. Since I'm already digging around in these files I'll take care of this > [C++] Support CMake 3.2 for "out of the box" builds > --- > > Key: ARROW-3449 > URL: https://issues.apache.org/jira/browse/ARROW-3449 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > As reported in the 0.11.0 RC1 release vote, some of our dependencies (like > gbenchmark) do not build out of the box with CMake 3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3762) [C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3762. - Resolution: Fixed Issue resolved by pull request 3171 [https://github.com/apache/arrow/pull/3171] > [C++] Parquet arrow::Table reads error when overflowing capacity of > BinaryArray > --- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pylist() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Summary: [Python] New pyarrow.Table.from_pylist() function (was: [Python] New pyarrow.Table.from_pydict() function) > [Python] New pyarrow.Table.from_pylist() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pylist(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pylist(test_list, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(test_list, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table =
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(test_list, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table =
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe)) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721745#comment-16721745 ] David Lee commented on ARROW-4032: -- Updated the sample code to include Schema and Safe options.. Passing in a schema will allow conversions from microseconds to milliseconds. > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for v in > pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pydict(pylist, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime # convert microseconds to milliseconds. More support for MS in parquet. today = datetime.now() today = datetime(today.year, today.month, today.day, today.hour, today.minute, today.second, today.microsecond - today.microsecond % 1000) pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": today} ] def from_pydict(pylist, schema=None, columns=None, safe=True): arrow_columns = list() if schema: columns = schema.names if not columns: return for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) if schema: arrow_table = arrow_table.cast(schema, safe=safe) return arrow_table test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', 'dummy']) test_schema = pa.schema([ pa.field('name', pa.string()), pa.field('age', pa.int16()), pa.field('city', pa.string()), pa.field('birthday', pa.timestamp('ms')) ]) test2 = from_pydict(pylist, schema=test_schema) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} Additional work would be needed to pass in a schema object if you want to refine data types further. I think the existing code from from_pandas() to do that would work. > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pydict(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for v in > pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pydict(pylist, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pydict(pylist, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": datetime.now()} > ] > def from_pydict(pylist, columns): > arrow_columns = list() > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > return arrow_table > test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lee updated ARROW-4032: - Description: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} Additional work would be needed to pass in a schema object if you want to refine data types further. I think the existing code from from_pandas() to do that would work. was: Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime test_list = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": datetime.now()} > ] > def from_pydict(pylist, columns): > arrow_columns = list() > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > return arrow_table > test = from_pydict(test_list, ['name' , 'age', 'city', 'birthday', 'dummy']) > {code} > Additional work would be needed to pass in a schema object if you want to > refine data types further. I think the existing code from from_pandas() to do > that would work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721727#comment-16721727 ] Wes McKinney commented on ARROW-4032: - You can do {{pa.array(pylist)}} already. So if we had a function to convert StructArray to Table then this would mostly do what you're describing. This was partly the intent of ARROW-40 > [Python] New pyarrow.Table.from_pydict() function > - > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: David Lee >Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > pylist = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": datetime.now()} > ] > def from_pydict(pylist, columns): > arrow_columns = list() > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist])) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > return arrow_table > test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy']) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function
David Lee created ARROW-4032: Summary: [Python] New pyarrow.Table.from_pydict() function Key: ARROW-4032 URL: https://issues.apache.org/jira/browse/ARROW-4032 Project: Apache Arrow Issue Type: Task Components: Python Reporter: David Lee Here's a proposal to create a pyarrow.Table.from_pydict() function. Right now only pyarrow.Table.from_pandas() exist and there are inherit problems using Pandas with NULL support for Int(s) and Boolean(s) [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: Sample python code on how this would work. {code:java} import pyarrow as pa from datetime import datetime pylist = [ {"name": "Tom", "age": 10}, {"name": "Mark", "age": 5, "city": "San Francisco"}, {"name": "Pam", "age": 7, "birthday": datetime.now()} ] def from_pydict(pylist, columns): arrow_columns = list() for column in columns: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy']) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4031) [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder
[ https://issues.apache.org/jira/browse/ARROW-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721715#comment-16721715 ] Wes McKinney commented on ARROW-4031: - Makes sense. If you do start working on this you may want to hold off until https://github.com/apache/arrow/pull/3171 is merged (since a bunch of the code is moved around) > [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder > -- > > Key: ARROW-4031 > URL: https://issues.apache.org/jira/browse/ARROW-4031 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Benjamin Kietzman >Priority: Minor > > It would be useful to have a specialization of TypedBufferBuilder to simplify > building buffers of bits. This could then be utilized by ArrayBuilder (for > the null bitmap) and BooleanBuilder (for values) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3184) [C++] Add modular build targets, "all" target, and require explicit target when invoking make or ninja
[ https://issues.apache.org/jira/browse/ARROW-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3184. - Resolution: Fixed Issue resolved by pull request 3172 [https://github.com/apache/arrow/pull/3172] > [C++] Add modular build targets, "all" target, and require explicit target > when invoking make or ninja > -- > > Key: ARROW-3184 > URL: https://issues.apache.org/jira/browse/ARROW-3184 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > This will make it easier to build and install only part of the project -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4031) [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder
Benjamin Kietzman created ARROW-4031: Summary: [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder Key: ARROW-4031 URL: https://issues.apache.org/jira/browse/ARROW-4031 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Benjamin Kietzman It would be useful to have a specialization of TypedBufferBuilder to simplify building buffers of bits. This could then be utilized by ArrayBuilder (for the null bitmap) and BooleanBuilder (for values) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3994) [C++] Remove ARROW_GANDIVA_BUILD_TESTS option
[ https://issues.apache.org/jira/browse/ARROW-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3994. - Resolution: Fixed Resolved in https://github.com/apache/arrow/commit/804502f941f808583e9f7043e203533de738d577 > [C++] Remove ARROW_GANDIVA_BUILD_TESTS option > - > > Key: ARROW-3994 > URL: https://issues.apache.org/jira/browse/ARROW-3994 > Project: Apache Arrow > Issue Type: Improvement > Components: Gandiva >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is not needed now that both libraries and tests are tied to the same > "gandiva" build target and label. So {{ninja gandiva && ctest -L gandiva}} > will build only the relevant targets > Follow up to ARROW-3988 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4030) [CI] Use travis_terminate to halt builds when a step fails
[ https://issues.apache.org/jira/browse/ARROW-4030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721676#comment-16721676 ] Francois Saint-Jacques commented on ARROW-4030: --- It's apparently worse than this. According to comments of the linked issue, Travis will only mark your build as failed only if the last script returned non-zero (essentially behaving like a pipe). I'd recommend moving to the rust technique https://github.com/rust-lang/rust/pull/12513/files > [CI] Use travis_terminate to halt builds when a step fails > -- > > Key: ARROW-4030 > URL: https://issues.apache.org/jira/browse/ARROW-4030 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > I noticed that Travis CI will soldier onward if a step in its {{script:}} > block fails. This wastes build time when there is an error somewhere early on > in the testing process > For example, in the main C++ build, if {{travis_script_cpp.sh}} fails, then > the subsequent steps will continue. > It seems the way to deal with this is to add {{|| travis_terminate 1}} to > lines that can fail > see > https://medium.com/@manjula.cse/how-to-stop-the-execution-of-travis-pipeline-if-script-exits-with-an-error-f0e5a43206bf > I also found this discussion > https://github.com/travis-ci/travis-ci/issues/1066 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4030) [CI] Use travis_terminate to halt builds when a step fails
Wes McKinney created ARROW-4030: --- Summary: [CI] Use travis_terminate to halt builds when a step fails Key: ARROW-4030 URL: https://issues.apache.org/jira/browse/ARROW-4030 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Wes McKinney Fix For: 0.12.0 I noticed that Travis CI will soldier onward if a step in its {{script:}} block fails. This wastes build time when there is an error somewhere early on in the testing process For example, in the main C++ build, if {{travis_script_cpp.sh}} fails, then the subsequent steps will continue. It seems the way to deal with this is to add {{|| travis_terminate 1}} to lines that can fail see https://medium.com/@manjula.cse/how-to-stop-the-execution-of-travis-pipeline-if-script-exits-with-an-error-f0e5a43206bf I also found this discussion https://github.com/travis-ci/travis-ci/issues/1066 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4028) [Rust] Merge parquet-rs codebase
[ https://issues.apache.org/jira/browse/ARROW-4028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4028: Fix Version/s: 0.12.0 > [Rust] Merge parquet-rs codebase > > > Key: ARROW-4028 > URL: https://issues.apache.org/jira/browse/ARROW-4028 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 0.12.0 > > > Initial donation of [parquet-rs|https://github.com/sunchao/parquet-rs], an > Apache Parquet implementation in Rust. This subjects to ASF IP clearance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4007) [Java][Plasma] Plasma JNI tests failing
[ https://issues.apache.org/jira/browse/ARROW-4007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4007: Fix Version/s: (was: 0.12.0) 0.13.0 > [Java][Plasma] Plasma JNI tests failing > --- > > Key: ARROW-4007 > URL: https://issues.apache.org/jira/browse/ARROW-4007 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Wes McKinney >Priority: Critical > Fix For: 0.13.0 > > > see https://travis-ci.org/apache/arrow/jobs/466819720 > {code} > [INFO] Total time: 10.633 s > [INFO] Finished at: 2018-12-12T03:56:33Z > [INFO] Final Memory: 39M/426M > [INFO] > > linux-vdso.so.1 => (0x7ffcff172000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f99ecd9e000) > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x7f99ecb85000) > libboost_system.so.1.54.0 => > /usr/lib/x86_64-linux-gnu/libboost_system.so.1.54.0 (0x7f99ec981000) > libboost_filesystem.so.1.54.0 => > /usr/lib/x86_64-linux-gnu/libboost_filesystem.so.1.54.0 (0x7f99ec76b000) > libboost_regex.so.1.54.0 => > /usr/lib/x86_64-linux-gnu/libboost_regex.so.1.54.0 (0x7f99ec464000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x7f99ec246000) > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > (0x7f99ebf3) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f99ebc2a000) > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 > (0x7f99eba12000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f99eb649000) > libicuuc.so.52 => /usr/lib/x86_64-linux-gnu/libicuuc.so.52 > (0x7f99eb2d) > libicui18n.so.52 => /usr/lib/x86_64-linux-gnu/libicui18n.so.52 > (0x7f99eaec9000) > /lib64/ld-linux-x86-64.so.2 (0x7f99ecfa6000) > libicudata.so.52 => /usr/lib/x86_64-linux-gnu/libicudata.so.52 > (0x7f99e965c000) > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x7f99e9458000) > /home/travis/build/apache/arrow/cpp/src/plasma/store.cc:985: Allowing the > Plasma store to use up to 0.01GB of memory. > /home/travis/build/apache/arrow/cpp/src/plasma/store.cc:1015: Starting object > store with directory /dev/shm and huge page support disabled > Start process 317574433 OK, cmd = > [/home/travis/build/apache/arrow/cpp-install/bin/plasma_store_server -s > /tmp/store89237 -m 1000] > Start object store success > Start test. > Plasma java client put test success. > Plasma java client get single object test success. > Plasma java client get multi-object test success. > ObjectId [B@34c45dca error at PlasmaClient put > java.lang.Exception: An object with this ID already exists in the plasma > store. > at org.apache.arrow.plasma.PlasmaClientJNI.create(Native Method) > at org.apache.arrow.plasma.PlasmaClient.put(PlasmaClient.java:51) > at > org.apache.arrow.plasma.PlasmaClientTest.doTest(PlasmaClientTest.java:145) > at > org.apache.arrow.plasma.PlasmaClientTest.main(PlasmaClientTest.java:220) > Plasma java client put same object twice exception test success. > Plasma java client hash test success. > Plasma java client contains test success. > Plasma java client metadata get test success. > Plasma java client delete test success. > Kill plasma store process forcely > All test success. > ~/build/apache/arrow > {code} > I didn't see any related code changes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed
Wes McKinney created ARROW-4029: --- Summary: [C++] Define and document naming convention for internal / private header files not to be installed Key: ARROW-4029 URL: https://issues.apache.org/jira/browse/ARROW-4029 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can recognize and exclude any file that is non-public from installation. see discussion on https://github.com/apache/arrow/pull/3172 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3449) [C++] Support CMake 3.2 for "out of the box" builds
[ https://issues.apache.org/jira/browse/ARROW-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721533#comment-16721533 ] Wes McKinney commented on ARROW-3449: - Per discussion on https://github.com/apache/arrow/pull/3172, I changed the default for {{ARROW_GANDIVA_JAVA}} to {{OFF}}. If that's the only component that requires a newer CMake, we might simply punt on dealing with the UseJava.cmake issue and simply ask that people install the newest CMake if they are building that part of the project (which already has a steep list of dependencies, so CMake on top of that is not much more to install) > [C++] Support CMake 3.2 for "out of the box" builds > --- > > Key: ARROW-3449 > URL: https://issues.apache.org/jira/browse/ARROW-3449 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Francois Saint-Jacques >Priority: Major > Fix For: 0.12.0 > > > As reported in the 0.11.0 RC1 release vote, some of our dependencies (like > gbenchmark) do not build out of the box with CMake 3.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2026) [Python] Cast all timestamp resolutions to INT96 use_deprecated_int96_timestamps=True
[ https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2026: -- Labels: c++ parquet pull-request-available redshift timestamps (was: c++ parquet redshift timestamps) > [Python] Cast all timestamp resolutions to INT96 > use_deprecated_int96_timestamps=True > - > > Key: ARROW-2026 > URL: https://issues.apache.org/jira/browse/ARROW-2026 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS: Mac OS X 10.13.2 > Python: 3.6.4 > PyArrow: 0.8.0 >Reporter: Diego Argueta >Assignee: Francois Saint-Jacques >Priority: Major > Labels: c++, parquet, pull-request-available, redshift, > timestamps > Fix For: 0.12.0 > > > When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, > timestamps are only written as 96-bit integers if the timestamp has > nanosecond resolution. This is a problem because Amazon Redshift timestamps > only have microsecond resolution but require them to be stored in 96-bit > format in Parquet files. > I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps > to be written as 96 bits, regardless of resolution. If this is a deliberate > design decision, it'd be immensely helpful if it were explicitly documented > as part of the argument. > > To reproduce: > > 1. Create a table with a timestamp having microsecond or millisecond > resolution, and save it to a Parquet file. Be sure to set > `use_deprecated_int96_timestamps` to True. > > {code:java} > import datetime > import pyarrow > from pyarrow import parquet > schema = pyarrow.schema([ > pyarrow.field('last_updated', pyarrow.timestamp('us')), > ]) > data = [ > pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')), > ] > table = pyarrow.Table.from_arrays(data, ['last_updated']) > with open('test_file.parquet', 'wb') as fdesc: > parquet.write_table(table, fdesc, > use_deprecated_int96_timestamps=True) > {code} > > 2. Inspect the file. I used parquet-tools: > > {noformat} > dak@tux ~ $ parquet-tools meta test_file.parquet > file: file:/Users/dak/test_file.parquet > creator: parquet-cpp version 1.3.2-SNAPSHOT > file schema: schema > > last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1 > row group 1: RC:1 TS:76 OFFSET:4 > > last_updated: INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 > ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds
[ https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721524#comment-16721524 ] Wes McKinney commented on ARROW-4009: - I agree that valgrind does provide useful insights. It can find things (like memory leaks) that that ASAN does not. It requires an up to date clang indeed Lucky because the way we manage memory, leaks in the C++ are rare which has been nice > [CI] Run Valgrind and C++ code coverage in different bulds > -- > > Key: ARROW-4009 > URL: https://issues.apache.org/jira/browse/ARROW-4009 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > > Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This > means the slowness of Valgrind acts as a multiplier of the overhead of > outputting coverage information using the instrumentation added by the > compiler. > Instead we should probably emit C++ (and Python) coverage information in a > different Travis-CI build without Valgrind enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4015) [Plasma] remove legacy interfaces for plasma manager
[ https://issues.apache.org/jira/browse/ARROW-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Moritz resolved ARROW-4015. --- Resolution: Fixed Fix Version/s: 0.12.0 Issue resolved by pull request 3167 [https://github.com/apache/arrow/pull/3167] > [Plasma] remove legacy interfaces for plasma manager > > > Key: ARROW-4015 > URL: https://issues.apache.org/jira/browse/ARROW-4015 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Assignee: Zhijun Fu >Priority: Minor > Labels: pull-request-available > Fix For: 0.12.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > [https://github.com/apache/arrow/issues/3154] > In legacy ray, interacting with remote plasma stores is done via plasma > manager, which is part of ray, and plasma has a few interfaces to support it > - namely Fetch() and Wait(). > Currently the legacy ray code has already been removed, and the new raylet > uses object manager to interface with remote machine, and these legacy plasma > interfaces are no longer used. I think we could remove these legacy > interfaces to cleanup code and avoid confusion. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds
[ https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721301#comment-16721301 ] Antoine Pitrou commented on ARROW-4009: --- (Valgrind, while slow, finds useful insight) > [CI] Run Valgrind and C++ code coverage in different bulds > -- > > Key: ARROW-4009 > URL: https://issues.apache.org/jira/browse/ARROW-4009 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > > Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This > means the slowness of Valgrind acts as a multiplier of the overhead of > outputting coverage information using the instrumentation added by the > compiler. > Instead we should probably emit C++ (and Python) coverage information in a > different Travis-CI build without Valgrind enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4009) [CI] Run Valgrind and C++ code coverage in different bulds
[ https://issues.apache.org/jira/browse/ARROW-4009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721300#comment-16721300 ] Antoine Pitrou commented on ARROW-4009: --- I'm not up-to-date on ASAN. Does it require a recent clang for useful results? > [CI] Run Valgrind and C++ code coverage in different bulds > -- > > Key: ARROW-4009 > URL: https://issues.apache.org/jira/browse/ARROW-4009 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > > Currently, we run Valgrind on a coverage-enabled C++ build on Travis-CI. This > means the slowness of Valgrind acts as a multiplier of the overhead of > outputting coverage information using the instrumentation added by the > compiler. > Instead we should probably emit C++ (and Python) coverage information in a > different Travis-CI build without Valgrind enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3979) [Gandiva] fix all valgrind reported errors
[ https://issues.apache.org/jira/browse/ARROW-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shyam narayan singh reassigned ARROW-3979: -- Assignee: shyam narayan singh (was: Pindikura Ravindra) > [Gandiva] fix all valgrind reported errors > -- > > Key: ARROW-3979 > URL: https://issues.apache.org/jira/browse/ARROW-3979 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva >Reporter: Pindikura Ravindra >Assignee: shyam narayan singh >Priority: Major > > Travis reports lots of valgrind errors when running gandiva tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3701) [Gandiva] Add support for decimal operations
[ https://issues.apache.org/jira/browse/ARROW-3701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721208#comment-16721208 ] Pindikura Ravindra commented on ARROW-3701: --- As part of my PR, I'm adding more benchmarks to gandiva/benchmarks.cc - this'll exercise both the arrow-decimal code and gandiva-decimal code. > [Gandiva] Add support for decimal operations > > > Key: ARROW-3701 > URL: https://issues.apache.org/jira/browse/ARROW-3701 > Project: Apache Arrow > Issue Type: Task > Components: Gandiva >Reporter: Pindikura Ravindra >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > Time Spent: 6h 50m > Remaining Estimate: 0h > > To begin with, will add support for 128-bit decimals. There are two parts : > # llvm_generator needs to understand decimal types (value, precision, scale) > # code decimal operations : add/subtract/multiply/divide/mod/.. > ** This will be c++ code that can be pre-compiled to emit IR code -- This message was sent by Atlassian JIRA (v7.6.3#76005)