[jira] [Assigned] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
[ https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ji Liu reassigned ARROW-6111: - Assignee: Micah Kornfield (was: Ji Liu) > [Java] Support LargeVarChar and LargeBinary types and add integration test > with C++ > --- > > Key: ARROW-6111 > URL: https://issues.apache.org/jira/browse/ARROW-6111 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Blocker > Fix For: 0.15.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
[ https://issues.apache.org/jira/browse/ARROW-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6211: -- Labels: pull-request-available (was: ) > [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface > - > > Key: ARROW-6211 > URL: https://issues.apache.org/jira/browse/ARROW-6211 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Pindikura Ravindra >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > > This is a follow-up from [https://github.com/apache/arrow/pull/4933] > > public interface VectorVisitor \{..} > > In ValueVector : > public OUT accept(VectorVisitor > visitor, IN value) throws EX; > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6245) [Java] Provide an interface for numeric vectors
[ https://issues.apache.org/jira/browse/ARROW-6245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6245: -- Labels: pull-request-available (was: ) > [Java] Provide an interface for numeric vectors > --- > > Key: ARROW-6245 > URL: https://issues.apache.org/jira/browse/ARROW-6245 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > > We want to provide an interface for all vectors with numeric types (small > int, float4, float8, etc). This interface will make it convenient for many > operations on a vector, like average, sum, variance, etc. With this > interface, the client code will be greatly simplified, with many > branches/switch removed. > > The design is similar to BaseIntVector (the interface for all integer > vectors). We provide 3 methods for setting & getting numeric values: > setWithPossibleRounding > setSafeWithPossibleRounding > getValueAsDouble -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6248) [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3
Alexander Schepanovski created ARROW-6248: - Summary: [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3 Key: ARROW-6248 URL: https://issues.apache.org/jira/browse/ARROW-6248 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.14.1 Reporter: Alexander Schepanovski When file is absent pyarrow throws {code:python} ArrowIOError('HDFS file does not exist: ...') {code} which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for this particular purpose. Also, {{.errno}} property is empty (should be 2) so one needs to match by error message to check for particular error. *P.S.* There is no {{FileNotFoundError}} in Python 2, but there is `.errno` property there. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6248) [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3
[ https://issues.apache.org/jira/browse/ARROW-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Schepanovski updated ARROW-6248: -- Description: When file is absent pyarrow throws {code:python} ArrowIOError('HDFS file does not exist: ...') {code} which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for this particular purpose. Also, {{.errno}} property is empty (should be 2) so one needs to match by error message to check for particular error. *P.S.* There is no {{FileNotFoundError}} in Python 2, but there is {{.errno}} property there. was: When file is absent pyarrow throws {code:python} ArrowIOError('HDFS file does not exist: ...') {code} which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for this particular purpose. Also, {{.errno}} property is empty (should be 2) so one needs to match by error message to check for particular error. *P.S.* There is no {{FileNotFoundError}} in Python 2, but there is `.errno` property there. > [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3 > -- > > Key: ARROW-6248 > URL: https://issues.apache.org/jira/browse/ARROW-6248 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.14.1 >Reporter: Alexander Schepanovski >Priority: Minor > > When file is absent pyarrow throws > {code:python} > ArrowIOError('HDFS file does not exist: ...') > {code} > which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would > be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for > this particular purpose. Also, {{.errno}} property is empty (should be 2) so > one needs to match by error message to check for particular error. > *P.S.* There is no {{FileNotFoundError}} in Python 2, but there is > {{.errno}} property there. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-4111) [Python] Create time types from Python sequences of integers
[ https://issues.apache.org/jira/browse/ARROW-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-4111: Labels: beginner (was: ) > [Python] Create time types from Python sequences of integers > > > Key: ARROW-4111 > URL: https://issues.apache.org/jira/browse/ARROW-4111 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > This works for dates, but not times: > {code} > > traceback > > > > def test_to_pandas_deduplicate_date_time(): > nunique = 100 > repeats = 10 > > unique_values = list(range(nunique)) > > cases = [ > # array type, to_pandas options > ('date32', {'date_as_object': True}), > ('date64', {'date_as_object': True}), > ('time32[ms]', {}), > ('time64[us]', {}) > ] > > for array_type, pandas_options in cases: > > arr = pa.array(unique_values * repeats, type=array_type) > pyarrow/tests/test_convert_pandas.py:2392: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > pyarrow/array.pxi:175: in pyarrow.lib.array > return _sequence_to_array(obj, mask, size, type, pool, from_pandas) > pyarrow/array.pxi:36: in pyarrow.lib._sequence_to_array > check_status(ConvertPySequence(sequence, mask, options, &out)) > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > raise ArrowInvalid(message) > E pyarrow.lib.ArrowInvalid: ../src/arrow/python/python_to_arrow.cc:1012 : > ../src/arrow/python/iterators.h:70 : Could not convert 0 with type int: > converting to time32 > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5912) [Python] conversion from datetime objects with mixed timezones should normalize to UTC
[ https://issues.apache.org/jira/browse/ARROW-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-5912: Labels: beginner (was: ) > [Python] conversion from datetime objects with mixed timezones should > normalize to UTC > -- > > Key: ARROW-5912 > URL: https://issues.apache.org/jira/browse/ARROW-5912 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: beginner > Fix For: 1.0.0 > > > Currently, when having objects with mixed timezones, they are each separately > interpreted as their local time: > {code:python} > >>> ts_pd_paris = pd.Timestamp("1970-01-01 01:00", tz="Europe/Paris") > >>> ts_pd_paris > Timestamp('1970-01-01 01:00:00+0100', tz='Europe/Paris') > >>> ts_pd_helsinki = pd.Timestamp("1970-01-01 02:00", tz="Europe/Helsinki") > >>> ts_pd_helsinki > Timestamp('1970-01-01 02:00:00+0200', tz='Europe/Helsinki') > >>> a = pa.array([ts_pd_paris, ts_pd_helsinki]) > >>> > >>> > >>> a > > [ > 1970-01-01 01:00:00.00, > 1970-01-01 02:00:00.00 > ] > >>> a.type > TimestampType(timestamp[us]) > {code} > So both times are actually about the same moment in time (the same value in > UTC; in pandas their stored {{value}} is also the same), but once converted > to pyarrow, they are both tz-naive but no longer the same time. That seems > rather unexpected and a source for bugs. > I think a better option would be to normalize to UTC, and result in a > tz-aware TimestampArray with UTC as timezone. > That is also the behaviour of pandas if you force the conversion to result in > datetimes (by default pandas will keep them as object array preserving the > different timezones). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-1984) [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a LocalDateTime
[ https://issues.apache.org/jira/browse/ARROW-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-1984: Labels: beginner (was: ) > [Java] NullableDateMilliVector.getObject() should return a LocalDate, not a > LocalDateTime > - > > Key: ARROW-1984 > URL: https://issues.apache.org/jira/browse/ARROW-1984 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Vanco Buca >Priority: Minor > Labels: beginner > > NullableDateMilliVector.getObject() today returns a LocalDateTime. However, > this vector is used to store date information, and thus, getObject() should > return a LocalDate. > Please note: there already exists a vector that returns LocalDateTime -- > the NullableTimestampMilliVector. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5722) [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray
[ https://issues.apache.org/jira/browse/ARROW-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-5722: Labels: beginner (was: ) > [Rust] Implement std::fmt::Debug for ListArray, BinaryArray and StructArray > --- > > Key: ARROW-5722 > URL: https://issues.apache.org/jira/browse/ARROW-5722 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Chao Sun >Priority: Major > Labels: beginner > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5374) [Python] pa.read_record_batch() doesn't work
[ https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-5374: Labels: begin (was: ) > [Python] pa.read_record_batch() doesn't work > > > Key: ARROW-5374 > URL: https://issues.apache.org/jira/browse/ARROW-5374 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Labels: begin > > {code:python} > >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], > >>> names=['strs']) > >>> > >>> stream = pa.BufferOutputStream() > >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema) > >>> writer.write_batch(batch) > >>> > >>> > >>> writer.close() > >>> > >>> > >>> buf = stream.getvalue() > >>> > >>> > >>> pa.read_record_batch(buf, batch.schema) > >>> > >>> > Traceback (most recent call last): > File "", line 1, in > pa.read_record_batch(buf, batch.schema) > File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch > check_status(ReadRecordBatch(deref(message.message.get()), > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > raise ArrowIOError(message) > ArrowIOError: Expected IPC message of type schema got record batch > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5374) [Python] pa.read_record_batch() doesn't work
[ https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-5374: Labels: beginner (was: begin) > [Python] pa.read_record_batch() doesn't work > > > Key: ARROW-5374 > URL: https://issues.apache.org/jira/browse/ARROW-5374 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner > > {code:python} > >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], > >>> names=['strs']) > >>> > >>> stream = pa.BufferOutputStream() > >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema) > >>> writer.write_batch(batch) > >>> > >>> > >>> writer.close() > >>> > >>> > >>> buf = stream.getvalue() > >>> > >>> > >>> pa.read_record_batch(buf, batch.schema) > >>> > >>> > Traceback (most recent call last): > File "", line 1, in > pa.read_record_batch(buf, batch.schema) > File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch > check_status(ReadRecordBatch(deref(message.message.get()), > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > raise ArrowIOError(message) > ArrowIOError: Expected IPC message of type schema got record batch > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-3552) [Python] Implement pa.RecordBatch.serialize_to to write single message to an OutputStream
[ https://issues.apache.org/jira/browse/ARROW-3552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-3552: Labels: beginner (was: ) > [Python] Implement pa.RecordBatch.serialize_to to write single message to an > OutputStream > - > > Key: ARROW-3552 > URL: https://issues.apache.org/jira/browse/ARROW-3552 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > > {{RecordBatch.serialize}} writes in memory. This would help with shared > memory worksflows. See also pyarrow.ipc.write_tensor -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5248) [Python] support dateutil timezones
[ https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-5248: Labels: beginner (was: ) > [Python] support dateutil timezones > --- > > Key: ARROW-5248 > URL: https://issues.apache.org/jira/browse/ARROW-5248 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Joris Van den Bossche >Priority: Minor > Labels: beginner > > The {{dateutil}} packages also provides a set of timezone objects > (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. > In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone > fixed offset): > {code} > In [2]: import dateutil.tz > > > In [3]: import pyarrow as pa > > > In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels')) > > > ... > ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in > pyarrow.lib.tzinfo_to_string() > ValueError: Unable to convert timezone > `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string > {code} > But pandas also supports dateutil timezones. As a consequence, when having a > pandas DataFrame that uses a dateutil timezone, you get an error when > converting to an arrow table. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-3776) [Rust] Mark methods that do not perform bounds checking as unsafe
[ https://issues.apache.org/jira/browse/ARROW-3776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-3776: Labels: beginner (was: ) > [Rust] Mark methods that do not perform bounds checking as unsafe > - > > Key: ARROW-3776 > URL: https://issues.apache.org/jira/browse/ARROW-3776 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Priority: Minor > Labels: beginner > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-2619) [Rust] Move JSON serde code to separate file/module
[ https://issues.apache.org/jira/browse/ARROW-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-2619: Labels: beginner (was: ) > [Rust] Move JSON serde code to separate file/module > --- > > Key: ARROW-2619 > URL: https://issues.apache.org/jira/browse/ARROW-2619 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Andy Grove >Priority: Minor > Labels: beginner > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-4176) [C++/Python] Human readable arrow schema comparison
[ https://issues.apache.org/jira/browse/ARROW-4176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lidavidm updated ARROW-4176: Labels: beginner (was: ) > [C++/Python] Human readable arrow schema comparison > --- > > Key: ARROW-4176 > URL: https://issues.apache.org/jira/browse/ARROW-4176 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Florian Jetter >Priority: Minor > Labels: beginner > > When working with arrow schemas it would be helpful to have a human readable > representation of the diff between two schemas. > This could be either exposed as a function returning a string/diff object or > via a function raising an Exception with this information. > For instance: > {code} > schema_diff = get_schema_diff(schema1, schema2) > expected_diff = """ > - col_changed: int8 > + col_changed: double > + col_additional: int8 > """ > assert schema_diff == expected_diff > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6240) [Ruby] Arrow::Decimal128Array returns BigDecimal
[ https://issues.apache.org/jira/browse/ARROW-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro resolved ARROW-6240. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5089 [https://github.com/apache/arrow/pull/5089] > [Ruby] Arrow::Decimal128Array returns BigDecimal > > > Key: ARROW-6240 > URL: https://issues.apache.org/jira/browse/ARROW-6240 > Project: Apache Arrow > Issue Type: Improvement > Components: Ruby >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper
Ji Liu created ARROW-6249: - Summary: [Java] Remove useless class ByteArrayWrapper Key: ARROW-6249 URL: https://issues.apache.org/jira/browse/ARROW-6249 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Ji Liu Assignee: Ji Liu This class was introduced into encoding part to compare byte[] values equals. Since now we compare value/vector equals by new added visitor API by ARROW-6022 instead of comparing {{getObject}}, this class is no use anymore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper
[ https://issues.apache.org/jira/browse/ARROW-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6249: -- Labels: pull-request-available (was: ) > [Java] Remove useless class ByteArrayWrapper > > > Key: ARROW-6249 > URL: https://issues.apache.org/jira/browse/ARROW-6249 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > > This class was introduced into encoding part to compare byte[] values equals. > Since now we compare value/vector equals by new added visitor API by > ARROW-6022 instead of comparing {{getObject}}, this class is no use anymore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point
Ji Liu created ARROW-6250: - Summary: [Java] Implement ApproxEqualsVisitor comparing approx for floating point Key: ARROW-6250 URL: https://issues.apache.org/jira/browse/ARROW-6250 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Ji Liu Assignee: Ji Liu Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} for comparing range/vector. And ARROW-6211 is created to make {{ValueVector}} work with generic visitor. We should also implement {{ApproxEqualsVisitor}} to compare floating point just like cpp does [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6251) [Developer] Add PR merge tool to apache/arrow-site
Wes McKinney created ARROW-6251: --- Summary: [Developer] Add PR merge tool to apache/arrow-site Key: ARROW-6251 URL: https://issues.apache.org/jira/browse/ARROW-6251 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Wes McKinney Fix For: 0.15.0 This will help with creating clean patches and also keeping JIRA clean -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6246) [Website] Add link to R documentation site
[ https://issues.apache.org/jira/browse/ARROW-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6246. - Resolution: Fixed Fix Version/s: 0.15.0 https://github.com/apache/arrow-site/commit/41d02ac5e96fafd3dc7663d5214cdc7cd0dedb26 > [Website] Add link to R documentation site > -- > > Key: ARROW-6246 > URL: https://issues.apache.org/jira/browse/ARROW-6246 > Project: Apache Arrow > Issue Type: Improvement > Components: Website >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > ARROW-6139 added the R documentation at /docs/r/, but we still need to link > to it from the website header. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6250) [Java] Implement ApproxEqualsVisitor comparing approx for floating point
[ https://issues.apache.org/jira/browse/ARROW-6250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908116#comment-16908116 ] Ji Liu commented on ARROW-6250: --- cc [~pravindra] > [Java] Implement ApproxEqualsVisitor comparing approx for floating point > > > Key: ARROW-6250 > URL: https://issues.apache.org/jira/browse/ARROW-6250 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Critical > > Currently we already implemented {{RangeEqualsVisitor/VectorEqualsVisitor}} > for comparing range/vector. > And ARROW-6211 is created to make {{ValueVector}} work with generic visitor. > We should also implement {{ApproxEqualsVisitor}} to compare floating point > just like cpp does > [https://github.com/apache/arrow/blob/master/cpp/src/arrow/compare.cc] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908131#comment-16908131 ] Joris Van den Bossche commented on ARROW-5610: -- OK, I am making some progress on this (I initially was disregarding the parametrized type case, so we indeed need C++ <-> Python interaction). I have basic roundtripping with a parametrized type working. Eg in Python an implementor can do: {code:java} class PeriodType(pa.GenericExtensionType): def __init__(self, freq): # attributes need to be set first before calling super init (as that calls serialize) self.freq = freq pa.lib.GenericExtensionType.__init__(self, pa.int64(), 'pandas.period') def __arrow_ext_serialize__(self): return "freq={}".format(self.freq).encode() @classmethod def __arrow_ext_deserialize__(cls, storage_type, serialized): serialized = serialized.decode() assert serialized.startswith("freq=") freq = serialized.split('=')[1] return PeriodType(freq) period_type = PeriodType('D') pa.lib.register(period_type) {code} and that can roundtrip IPC with the "pandas.period" extension name (so not a generic "arrow.py_extension"). I based the above interface (the {{__arrow_ext_serialize_}} _and {{}}_{{_arrow_ext_deserialize__}} methods to implement) on the existing {{PyExtensionType}} that Antoine implemented. {quote}> I assume the generic ExtensionType would have a Python "vtable" for Python subclasses to implement the C++ methods {quote} So currently I based myself on the existing {{PyExtensionType}} and copied the approach there to store a weakref to an instance and the class of the Python subclass the user defines. That seems to work, but I am not familiar enough with this to judge if the vtable approach (as used in PyFlightServer) would be better. {quote}> The registration method would need to support parameterized types as well (i.e. registering multiple instances of the same type with different parameters). {quote} Is that needed? My current idea is that you would register a certain type once (with _some_ parametrization, so you don't have to register each possible parametrization). Because we register in C++ based on the name, so otherwise the name would need to include the parameter. Actually, writing this down now, that could also be an option (currently I use the serialized metadata for storing the parametrization). Other questions I still need to answer: - What to do with registration and unregistration? It would be nice if a user didn't need to register a type manually (in python that could be done with a metaclass to register the subclass on definition, but not sure that is possible in cython) Also for unregistering, since that is needed to avoid segfaults on shutdown, we probably need to keep a python side registry of the C++-registered types to ensure we unregister them on shutdown. - Do we want to keep the current {{PyExtensionType}} based on pickle? I think the main advantage compared to the new implementation is that when reading an IPC message, the type does not need to be registered to be recognized (for the unpickling, it is enough that the module is importable, but does not need to be imported manually by the user). But on the other hand it gives two largely overlapping alternatives. I will try to clean up and push to a draft PR, which will be easier to get an idea. > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908138#comment-16908138 ] Wes McKinney commented on ARROW-6058: - So far we don't have a minimal reproduction of the issue so it's very hard for other developers in this project to help. Since you are encountering the problem, you are the best positioned to reproduce the issue or determine the root cause. > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5374) [Python] pa.read_record_batch() doesn't work
[ https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5374: Fix Version/s: 0.15.0 > [Python] pa.read_record_batch() doesn't work > > > Key: ARROW-5374 > URL: https://issues.apache.org/jira/browse/ARROW-5374 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner > Fix For: 0.15.0 > > > {code:python} > >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], > >>> names=['strs']) > >>> > >>> stream = pa.BufferOutputStream() > >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema) > >>> writer.write_batch(batch) > >>> > >>> > >>> writer.close() > >>> > >>> > >>> buf = stream.getvalue() > >>> > >>> > >>> pa.read_record_batch(buf, batch.schema) > >>> > >>> > Traceback (most recent call last): > File "", line 1, in > pa.read_record_batch(buf, batch.schema) > File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch > check_status(ReadRecordBatch(deref(message.message.get()), > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > raise ArrowIOError(message) > ArrowIOError: Expected IPC message of type schema got record batch > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6248) [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3
[ https://issues.apache.org/jira/browse/ARROW-6248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908165#comment-16908165 ] Wes McKinney commented on ARROW-6248: - Seems reasonable. Would you like to submit a PR? > [Python] Use FileNotFoundError in HadoopFileSystem.open() in Python 3 > -- > > Key: ARROW-6248 > URL: https://issues.apache.org/jira/browse/ARROW-6248 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.14.1 >Reporter: Alexander Schepanovski >Priority: Minor > > When file is absent pyarrow throws > {code:python} > ArrowIOError('HDFS file does not exist: ...') > {code} > which inherits from {{IOError}} and {{pyarrow.lib.ArrowException}}, it would > be better if that was {{FileNotFoundError}} a subclass of {{IOError}} for > this particular purpose. Also, {{.errno}} property is empty (should be 2) so > one needs to match by error message to check for particular error. > *P.S.* There is no {{FileNotFoundError}} in Python 2, but there is > {{.errno}} property there. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream
[ https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5374: Summary: [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream (was: [Python] pa.read_record_batch() doesn't work) > [Python] Misleading error message when calling pyarrow.read_record_batch on a > complete IPC stream > - > > Key: ARROW-5374 > URL: https://issues.apache.org/jira/browse/ARROW-5374 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner > Fix For: 0.15.0 > > > {code:python} > >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], > >>> names=['strs']) > >>> > >>> stream = pa.BufferOutputStream() > >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema) > >>> writer.write_batch(batch) > >>> > >>> > >>> writer.close() > >>> > >>> > >>> buf = stream.getvalue() > >>> > >>> > >>> pa.read_record_batch(buf, batch.schema) > >>> > >>> > Traceback (most recent call last): > File "", line 1, in > pa.read_record_batch(buf, batch.schema) > File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch > check_status(ReadRecordBatch(deref(message.message.get()), > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > raise ArrowIOError(message) > ArrowIOError: Expected IPC message of type schema got record batch > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5374) [Python] Misleading error message when calling pyarrow.read_record_batch on a complete IPC stream
[ https://issues.apache.org/jira/browse/ARROW-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908168#comment-16908168 ] Wes McKinney commented on ARROW-5374: - I updated the issue title so it does not mislead contributors > [Python] Misleading error message when calling pyarrow.read_record_batch on a > complete IPC stream > - > > Key: ARROW-5374 > URL: https://issues.apache.org/jira/browse/ARROW-5374 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner > Fix For: 0.15.0 > > > {code:python} > >>> batch = pa.RecordBatch.from_arrays([pa.array([b"foo"], type=pa.utf8())], > >>> names=['strs']) > >>> > >>> stream = pa.BufferOutputStream() > >>> writer = pa.RecordBatchStreamWriter(stream, batch.schema) > >>> writer.write_batch(batch) > >>> > >>> > >>> writer.close() > >>> > >>> > >>> buf = stream.getvalue() > >>> > >>> > >>> pa.read_record_batch(buf, batch.schema) > >>> > >>> > Traceback (most recent call last): > File "", line 1, in > pa.read_record_batch(buf, batch.schema) > File "pyarrow/ipc.pxi", line 583, in pyarrow.lib.read_record_batch > check_status(ReadRecordBatch(deref(message.message.get()), > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > raise ArrowIOError(message) > ArrowIOError: Expected IPC message of type schema got record batch > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-5610) [Python] Define extension type API in Python to "receive" or "send" a foreign extension type
[ https://issues.apache.org/jira/browse/ARROW-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5610: -- Labels: pull-request-available (was: ) > [Python] Define extension type API in Python to "receive" or "send" a foreign > extension type > > > Key: ARROW-5610 > URL: https://issues.apache.org/jira/browse/ARROW-5610 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > In work in ARROW-840, a static {{arrow.py_extension_type}} name is used. > There will be cases where an extension type is coming from another > programming language (e.g. Java), so it would be useful to be able to "plug > in" a Python extension type subclass that will be used to deserialize the > extension type coming over the wire. This has some different API requirements > since the serialized representation of the type will not have knowledge of > Python pickling, etc. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6252) [Python] Add pyarrow.Array.diff_contents method
Wes McKinney created ARROW-6252: --- Summary: [Python] Add pyarrow.Array.diff_contents method Key: ARROW-6252 URL: https://issues.apache.org/jira/browse/ARROW-6252 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.15.0 This would expose the Array diffing functionality in Python to make it easier to see why arrays are unequal -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6230: --- Summary: [R] Reading in Parquet files are 20x slower than reading fst files in R (was: [R] Reading in parquent files are 20x slower than reading fst files in R) > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Priority: Major > Fix For: 0.14.1 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6230: --- Affects Version/s: 0.14.0 > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.14.0 > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Priority: Major > Labels: parquet > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6230: --- Labels: paragraph (was: ) > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Priority: Major > Labels: paragraph > Fix For: 0.14.1 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6230: --- Labels: parquet (was: paragraph) > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Priority: Major > Labels: parquet > Fix For: 0.14.1 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6230: --- Fix Version/s: (was: 0.14.1) > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Priority: Major > Labels: parquet > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6230. - Resolution: Cannot Reproduce Assignee: Wes McKinney Fix Version/s: 0.15.0 Resolving for 0.15.0. If after 0.15.0 comes out there are performance or memory use problems please reopen this issue or open a new issue. Thanks! > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.14.0 > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Reopened] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-6230: - > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.14.0 > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6230) [R] Reading in Parquet files are 20x slower than reading fst files in R
[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6230. - Resolution: Fixed > [R] Reading in Parquet files are 20x slower than reading fst files in R > --- > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.14.0 > Environment: Windows 10 Pro and Ubuntu >Reporter: Zhuo Jia Dai >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6180) [C++] Create InputStream that is an isolated reader of a segment of a RandomAccessFile
[ https://issues.apache.org/jira/browse/ARROW-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6180. - Resolution: Fixed Issue resolved by pull request 5085 [https://github.com/apache/arrow/pull/5085] > [C++] Create InputStream that is an isolated reader of a segment of a > RandomAccessFile > -- > > Key: ARROW-6180 > URL: https://issues.apache.org/jira/browse/ARROW-6180 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > If different threads wants to do buffered reads over different portions of a > file (and they are unable to create their own separate file handles), they > may clobber each other. I would propose creating an object that keeps the > RandomAccessFile internally and implements the InputStream API in a way that > is safe from other threads changing the file position -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6253) [Python] Expose "enable_buffered_stream" option from parquet::ReaderProperties in pyarrow.parquet.read_table
Wes McKinney created ARROW-6253: --- Summary: [Python] Expose "enable_buffered_stream" option from parquet::ReaderProperties in pyarrow.parquet.read_table Key: ARROW-6253 URL: https://issues.apache.org/jira/browse/ARROW-6253 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.15.0 See also PARQUET-1370 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile
Dongha Lee created ARROW-6254: - Summary: [Rust][Parquet] Parquet dependency fails to compile Key: ARROW-6254 URL: https://issues.apache.org/jira/browse/ARROW-6254 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.14.1 Reporter: Dongha Lee Hi, I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran `cargo build`. But unfortunately, it with a large error message. Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It failed both on arch and ubuntu. I tried to build directly in `.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it failed. I cloned arrow repository and tried to build in the directory `rust/parquet` and it succeeded. But as soon I moved the rust/parquet to some other location, the build failed. So my guess is that the failure has to do something with dependent modules `rust/arrow`. Is this a known issue? I couldn't find any ticket for that. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile
[ https://issues.apache.org/jira/browse/ARROW-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908424#comment-16908424 ] Paddy Horan commented on ARROW-6254: This is not a known error, we use relative paths within the workspace for development like [here]([https://github.com/apache/arrow/blob/master/rust/parquet/Cargo.toml#L43]). I guess when publishing to creates.io we need to publish arrow, parquet then datafusion and update the Cargo.toml for parquet and datafusion before we publish. If you change parquet's Cargo.toml to: arrow = "0.14.1" does it compile when moved as you described above? > [Rust][Parquet] Parquet dependency fails to compile > --- > > Key: ARROW-6254 > URL: https://issues.apache.org/jira/browse/ARROW-6254 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.14.1 >Reporter: Dongha Lee >Priority: Major > > Hi, > I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran > `cargo build`. But unfortunately, it with a large error message. > Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It > failed both on arch and ubuntu. > I tried to build directly in > `.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it > failed. > I cloned arrow repository and tried to build in the directory `rust/parquet` > and it succeeded. But as soon I moved the rust/parquet to some other > location, the build failed. So my guess is that the failure has to do > something with dependent modules `rust/arrow`. > Is this a known issue? I couldn't find any ticket for that. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6254) [Rust][Parquet] Parquet dependency fails to compile
[ https://issues.apache.org/jira/browse/ARROW-6254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908433#comment-16908433 ] Dongha Lee commented on ARROW-6254: --- As far as I remember, it didn't work. But I can double check it later. And in `rust/parquet` I couldn't find any line `extern crate arrow`. I am a rust newbie, but I guess it's always using the local dependencies. > [Rust][Parquet] Parquet dependency fails to compile > --- > > Key: ARROW-6254 > URL: https://issues.apache.org/jira/browse/ARROW-6254 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.14.1 >Reporter: Dongha Lee >Priority: Major > > Hi, > I set up a blank rust project, added dependency `parquet = "0.14.1"` and ran > `cargo build`. But unfortunately, it with a large error message. > Use used rust nightly: `cargo 1.38.0-nightly` and `rustc 1.38.0-nightly`. It > failed both on arch and ubuntu. > I tried to build directly in > `.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.14.1` but it > failed. > I cloned arrow repository and tried to build in the directory `rust/parquet` > and it succeeded. But as soon I moved the rust/parquet to some other > location, the build failed. So my guess is that the failure has to do > something with dependent modules `rust/arrow`. > Is this a known issue? I couldn't find any ticket for that. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6255) [Rust] [Parquet] Cannot use any published parquet crate due to parquet-format breaking change
Andy Grove created ARROW-6255: - Summary: [Rust] [Parquet] Cannot use any published parquet crate due to parquet-format breaking change Key: ARROW-6255 URL: https://issues.apache.org/jira/browse/ARROW-6255 Project: Apache Arrow Issue Type: Bug Components: Rust Affects Versions: 0.14.1, 0.14.0, 0.13.0, 0.12.1, 0.12.0 Reporter: Andy Grove Fix For: 0.15.0 As a user who wants to use the Rust version of Arrow, I am unable to use any of the previously published versions due to the recent breaking change in parquet-format 2.5.0. To reproduce, simply create an empty Rust project using "cargo init example --bin", add a dependency on "parquet-0.14.1" and attempt to build the project. {code:java} Compiling parquet v0.13.0 error[E0599]: no variant or associated item named `BOOLEAN` found for type `parquet_format::parquet_format::Type` in the current scope --> /Users/agrove/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-0.13.0/src/basic.rs:408:28 | 408 | parquet::Type::BOOLEAN => Type::BOOLEAN, | ^^^ variant or associated item not found in `parquet_format::parquet_format::Type`{code} This bug has already been fixed in master, but there is no usable published crate. We could consider publishing a 0.14.2 to resolve this or just wait until the 0.15.0 release. We could also consider using this Jira to at least document a workaround, if one exists (maybe Cargo provides a mechanism for overriding transitive dependencies?). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6256) [Rust] parquet-format should be released by Apache process
Andy Grove created ARROW-6256: - Summary: [Rust] parquet-format should be released by Apache process Key: ARROW-6256 URL: https://issues.apache.org/jira/browse/ARROW-6256 Project: Apache Arrow Issue Type: Improvement Components: Rust Affects Versions: 0.14.1 Reporter: Andy Grove Fix For: 0.15.0 The Arrow parquet crate depends on the parquet-format crate. Parquet-format 2.5.0 was recently released and has breaking changes compared to 2.4.0. This means that previously published Arrow Parquet/DataFusion crates are now unusable out the box (see https://issues.apache.org/jira/browse/ARROW-6255). We should bring parquet-format into an Apache release process to avoid this type of issue in the future. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6257) [C++] Add fnmatch compatible globbing function
[ https://issues.apache.org/jira/browse/ARROW-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Kietzman updated ARROW-6257: - Description: This will be useful for the filesystems module and in datasource discovery, which uses it. Behavior should be compatible with http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html was:This will be useful for the filesystems module and in datasource discovery, which uses it > [C++] Add fnmatch compatible globbing function > -- > > Key: ARROW-6257 > URL: https://issues.apache.org/jira/browse/ARROW-6257 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Benjamin Kietzman >Assignee: Benjamin Kietzman >Priority: Major > > This will be useful for the filesystems module and in datasource discovery, > which uses it. > Behavior should be compatible with > http://pubs.opengroup.org/onlinepubs/95399/functions/fnmatch.html -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6257) [C++] Add fnmatch compatible globbing function
Benjamin Kietzman created ARROW-6257: Summary: [C++] Add fnmatch compatible globbing function Key: ARROW-6257 URL: https://issues.apache.org/jira/browse/ARROW-6257 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Benjamin Kietzman Assignee: Benjamin Kietzman This will be useful for the filesystems module and in datasource discovery, which uses it -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6139) [Documentation][R] Build R docs (pkgdown) site and add to arrow-site
[ https://issues.apache.org/jira/browse/ARROW-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-6139. Resolution: Fixed Fix Version/s: 0.15.0 > [Documentation][R] Build R docs (pkgdown) site and add to arrow-site > > > Key: ARROW-6139 > URL: https://issues.apache.org/jira/browse/ARROW-6139 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R, Website >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Now that the R package is up on CRAN, we should publish the documentation > site. We should get this up before we publish the blog post (ARROW-6041) so > that we can link to it in the post. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6139) [Documentation][R] Build R docs (pkgdown) site and add to arrow-site
[ https://issues.apache.org/jira/browse/ARROW-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6139: --- Labels: (was: pull-request-available) > [Documentation][R] Build R docs (pkgdown) site and add to arrow-site > > > Key: ARROW-6139 > URL: https://issues.apache.org/jira/browse/ARROW-6139 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R, Website >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Fix For: 0.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Now that the R package is up on CRAN, we should publish the documentation > site. We should get this up before we publish the blog post (ARROW-6041) so > that we can link to it in the post. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6151) [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate information
[ https://issues.apache.org/jira/browse/ARROW-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908474#comment-16908474 ] Neal Richardson commented on ARROW-6151: Any further thoughts [~wesmckinn] or can we close this? > [R] See if possible to generate r/inst/NOTICE.txt rather than duplicate > information > --- > > Key: ARROW-6151 > URL: https://issues.apache.org/jira/browse/ARROW-6151 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Wes McKinney >Priority: Major > > I noticed this file -- I am concerned about its maintainability. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-4316) Reusing arrow.so for both Python and R
[ https://issues.apache.org/jira/browse/ARROW-4316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-4316. Resolution: Duplicate https://issues.apache.org/jira/browse/ARROW-5956 looks to be a more contemporary request of the same thing, so closing in favor of that one. > Reusing arrow.so for both Python and R > -- > > Key: ARROW-4316 > URL: https://issues.apache.org/jira/browse/ARROW-4316 > Project: Apache Arrow > Issue Type: Bug > Components: Python, R >Affects Versions: 0.12.0 > Environment: Ubuntu 16.04, R 3.4.4, pyarrow 0.12, cmake 3.12 >Reporter: Jeffrey Wong >Priority: Major > > My team uses both pyarrow and R arrow, we'd like both libraries to link to > the same arrow.so file for consistency. pyarrow ships both arrow.so and > parquet.so, if I can reuse those .so's to link R that would guarantee > consistency. > Under arrow v0.11.1 I was able to link R against libarrow.so found under > pyarrow by passing LIB_DIR to the R [configure > file|https://github.com/apache/arrow/blob/master/r/configure]. However, in > v0.12.0 I am no longer able to do that. Here is a reproducible example on > Ubuntu 16.04 which produces the error: > > {code:java} > sh: line 1: 5404 Segmentation fault (core dumped) '/usr/lib/R/bin/R' > --no-save --slave 2>&1 < '/tmp/RtmpyOuz4g/file14716feda8fc' > *** caught segfault *** > address 0x7f160f026250, cause 'invalid permissions' > An irrecoverable exception occurred. R is aborting now ... > {code} > > Reproducible example: > {code:java} > # get the parquet headers which are not shipped with pyarrow > > tee /etc/apt/sources.list.d/apache-arrow.list < deb [arch=amd64] https://dl.bintray.com/apache/arrow/$(lsb_release --id > --short | tr 'A-Z' 'a-z')/ $(lsb_release --codename --short) main > deb-src [] https://dl.bintray.com/apache/arrow/$(lsb_release --id --short | > tr 'A-Z' 'a-z')/ $(lsb_release --codename --short) main > APT_LINE > apt-get update > mkdir /tmp/arrow_headers; cd /tmp/arrow_headers > apt-get download --allow-unauthenticated libparquet-dev > ar -x libparquet-dev_0.12.0-1_amd64.deb > tar -xJvf data.tar.xz > > #get pyarrow v0.12 > > pip3 install pyarrow --upgrade > #figure out where pyarrow is > PY_ARROW_PATH=$(python3 -c "import pyarrow, os; > print(os.path.dirname(pyarrow.__file__))") > PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)") > PYTHON_LIBDIR=$(python3 -c "import sysconfig; > print(sysconfig.get_config_var('LIBDIR'))") > > # pyarrow doesn't ship parquet headers. Copy the ones from apt into the > pyarrow dir > mkdir $PY_ARROW_PATH/include/parquet > cp -r /tmp/arrow_headers/usr/include/parquet/* > $PY_ARROW_PATH/include/parquet/ > > #install R arrow > echo "export > LD_LIBRARY_PATH=\"\${LD_LIBRARY_PATH}:${PYTHON_LIBDIR}:${PY_ARROW_PATH}\"" | > tee -a /usr/lib/R/etc/ldpaths > git clone https://github.com/apache/arrow.git /tmp/arrow > cd /tmp/arrow/r > git checkout "apache-arrow-${PY_ARROW_VERSION}" > sed -i "/Depends: R/c\Depends: R (>= 3.4)" DESCRIPTION > sed -i "s/PKG_CXXFLAGS=/PKG_CXXFLAGS=-D_GLIBCXX_USE_CXX11_ABI=0 /g" > src/Makevars.in > R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include > LIB_DIR=$PY_ARROW_PATH" {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6182) [R] Add note to README about r-arrow conda installation
[ https://issues.apache.org/jira/browse/ARROW-6182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-6182: --- Summary: [R] Add note to README about r-arrow conda installation (was: [R] Package fails to load with error `CXXABI_1.3.11' not found ) > [R] Add note to README about r-arrow conda installation > > > Key: ARROW-6182 > URL: https://issues.apache.org/jira/browse/ARROW-6182 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.14.1 > Environment: Ubuntu 16.04.6 >Reporter: Ian Cook >Priority: Major > > I'm able to successfully install the C++ and Python libraries from > conda-forge, then successfully install the R package from CRAN if I use > {{--no-test-load}}. But after installation, the R package fails to load > because {{dyn.load("arrow.so")}} fails. It throws this error when loading: > {code:java} > unable to load shared object '~/R/arrow/libs/arrow.so': > /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found > (required by ~/.conda/envs/python3.6/lib/libarrow.so.14) > {code} > Do the Arrow C++ libraries actually require GCC 7.1.0 / CXXABI_1.3.11? If > not, what might explain this error message? Thanks. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6170) [R] "docker-compose build r" is slow
[ https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6170: -- Assignee: Neal Richardson > [R] "docker-compose build r" is slow > > > Key: ARROW-6170 > URL: https://issues.apache.org/jira/browse/ARROW-6170 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools, R >Reporter: Antoine Pitrou >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Apparently it installs and compiles all packages in single-thread mode. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6170) [R] "docker-compose build r" is slow
[ https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-6170: -- Assignee: Antoine Pitrou (was: Neal Richardson) > [R] "docker-compose build r" is slow > > > Key: ARROW-6170 > URL: https://issues.apache.org/jira/browse/ARROW-6170 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools, R >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > Apparently it installs and compiles all packages in single-thread mode. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5134) [R][CI] Run nightly tests against multiple R versions
[ https://issues.apache.org/jira/browse/ARROW-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-5134: -- Assignee: Neal Richardson > [R][CI] Run nightly tests against multiple R versions > - > > Key: ARROW-5134 > URL: https://issues.apache.org/jira/browse/ARROW-5134 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Krisztian Szucs >Assignee: Neal Richardson >Priority: Major > Fix For: 1.0.0 > > > This requires to fix the docker-compose build of R, which is failing > currently: > https://travis-ci.org/kszucs/crossbow/builds/508343597 > Reproducible locally with command: > {code} > docker-compose build cpp > docker-compose build r > docker-compose run r > {code} > Then introduce an {{R_VERSION}} build argument to the dockerfile, similarly > like > the python docker-compose defines and uses {{PYTHON_VERSION}}, see: > - https://github.com/apache/arrow/blob/master/python/Dockerfile#L21 > - https://github.com/apache/arrow/blob/master/docker-compose.yml#L247-L259 > Then add to the nightly builds, similarly like python: > - https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml#L29-L31 > - https://github.com/apache/arrow/blob/master/dev/tasks/tests.yml#L153-L184 > There is already a {{docker-r}} definition, the only difference is to export > an > {{R_VERSION}} environment variable. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5956) [R] Ability for R to link to C++ libraries from pyarrow Wheel
[ https://issues.apache.org/jira/browse/ARROW-5956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908480#comment-16908480 ] Neal Richardson commented on ARROW-5956: [~jeffreyw] could you try setting {{R_LD_LIBRARY_PATH}} instead of {{LD_LIBRARY_PATH?}} [https://github.com/apache/arrow/blob/master/r/README.Rmd#L132] (For context, see discussion starting here: [https://github.com/apache/arrow/pull/5036#issuecomment-519703937]) > [R] Ability for R to link to C++ libraries from pyarrow Wheel > - > > Key: ARROW-5956 > URL: https://issues.apache.org/jira/browse/ARROW-5956 > Project: Apache Arrow > Issue Type: New Feature > Components: R > Environment: Ubuntu 16.04, R 3.4.4, python 3.6.5 >Reporter: Jeffrey Wong >Priority: Major > > I have installed pyarrow 0.14.0 and want to be able to also use R arrow. In > my work I use rpy2 a lot to exchange python data structures with R data > structures, so would like R arrow to link against the exact same .so files > found in pyarrow > > > When I pass in include_dir and lib_dir to R's configure, pointing to > pyarrow's include and pyarrow's root directories, I am able to compile R's > arrow.so file. However, I am unable to load it in an R session, getting the > error: > > {code:java} > > dyn.load('arrow.so') > Error in dyn.load("arrow.so") : > unable to load shared object '/tmp/arrow2/r/src/arrow.so': > /tmp/arrow2/r/src/arrow.so: undefined symbol: > _ZNK5arrow11StructArray14GetFieldByNameERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE{code} > > > Steps to reproduce: > > Install pyarrow, which also ships libarrow.so and libparquet.so > > {code:java} > pip3 install pyarrow --upgrade --user > PY_ARROW_PATH=$(python3 -c "import pyarrow, os; > print(os.path.dirname(pyarrow.__file__))") > PY_ARROW_VERSION=$(python3 -c "import pyarrow; print(pyarrow.__version__)") > ln -s $PY_ARROW_PATH/libarrow.so.14 $PY_ARROW_PATH/libarrow.so > ln -s $PY_ARROW_PATH/libparquet.so.14 $PY_ARROW_PATH/libparquet.so > {code} > > > Add to LD_LIBRARY_PATH > > {code:java} > sudo tee -a /usr/lib/R/etc/ldpaths < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH" > export LD_LIBRARY_PATH > LINES > sudo tee -a /usr/lib/rstudio-server/bin/r-ldpath < LD_LIBRARY_PATH="\${LD_LIBRARY_PATH}:$PY_ARROW_PATH" > export LD_LIBRARY_PATH > LINES > export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:$PY_ARROW_PATH" > {code} > > > Install r arrow from source > {code:java} > git clone https://github.com/apache/arrow.git /tmp/arrow2 > cd /tmp/arrow2/r > git checkout tags/apache-arrow-0.14.0 > R CMD INSTALL ./ --configure-vars="INCLUDE_DIR=$PY_ARROW_PATH/include > LIB_DIR=$PY_ARROW_PATH"{code} > > I have noticed that the R package for arrow no longer has an RcppExports, but > instead an arrowExports. Could it be that the lack of RcppExports has made it > difficult to find GetFieldByName? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6258) [R] Add macOS build scripts
Neal Richardson created ARROW-6258: -- Summary: [R] Add macOS build scripts Key: ARROW-6258 URL: https://issues.apache.org/jira/browse/ARROW-6258 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson CRAN builds binary packages for Windows and macOS. It generally does this by building on its servers and bundling all dependencies in the R package. This has been accomplished by having separate processes for building and hosting system dependencies, and then downloading and bundling those with scripts that get executed at install time (and then create the binary package as a side effect). ARROW-3758 added the Windows PKGBUILD and related packaging scripts and ran them on our Appveyor. This ticket is to do the same for the macOS scripts. The purpose of these tickets is to bring the whole build pipeline under our version control and CI so that we can address any C++ build and dependency changes as they arise and not be surprised when it comes time to cut a release. A side benefit is that they also enable us to offer a nightly binary package repository with minimal additional effort. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6186) [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev debian package
[ https://issues.apache.org/jira/browse/ARROW-6186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6186. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5050 [https://github.com/apache/arrow/pull/5050] > [Packaging][C++] Plasma headers not included for ubuntu-xenial libplasma-dev > debian package > --- > > Key: ARROW-6186 > URL: https://issues.apache.org/jira/browse/ARROW-6186 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Plasma, Packaging >Affects Versions: 0.14.1 >Reporter: Wannes G >Assignee: Sutou Kouhei >Priority: Major > Labels: debian, packaging, pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > See > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian.ubuntu-xenial/libplasma-dev.install] > Issue is still present on latest master branch, the debian install script is > correct: > [https://github.com/kou/arrow/blob/master/dev/tasks/linux-packages/debian/libplasma-dev.install] > The first line is missing from the ubuntu install script causing no headers > to be installed when apt-get is used to install libplasma-dev. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
Wes McKinney created ARROW-6259: --- Summary: [C++][CI] Flatbuffers-related failures in CI on macOS Key: ARROW-6259 URL: https://issues.apache.org/jira/browse/ARROW-6259 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.15.0 This seemingly has just started happening randomly today https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6204) [GLib] Add garrow_array_is_in_chunked_array()
[ https://issues.apache.org/jira/browse/ARROW-6204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6204. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5086 [https://github.com/apache/arrow/pull/5086] > [GLib] Add garrow_array_is_in_chunked_array() > - > > Key: ARROW-6204 > URL: https://issues.apache.org/jira/browse/ARROW-6204 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > This is follow-up of > [https://github.com/apache/arrow/pull/5047#issuecomment-520103706]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908490#comment-16908490 ] Wes McKinney commented on ARROW-6259: - Comparing * failure https://api.travis-ci.org/v3/job/572381802/log.txt * success (1 commit prior) https://api.travis-ci.org/v3/job/572286191/log.txt it appears that the conda toolchain upgraded from clang 4.0.1 to clang 8.0.0 > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908493#comment-16908493 ] Wes McKinney commented on ARROW-6259: - conda-forge confirms the compiler switch occurred this afternoon https://gitter.im/conda-forge/conda-forge.github.io?at=5d55d1e0beba830fff9ce0b3 probably we'll have to suppress the compiler warning > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6259: --- Assignee: Wes McKinney > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6258) [R] Add macOS build scripts
[ https://issues.apache.org/jira/browse/ARROW-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6258: -- Labels: pull-request-available (was: ) > [R] Add macOS build scripts > --- > > Key: ARROW-6258 > URL: https://issues.apache.org/jira/browse/ARROW-6258 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > > CRAN builds binary packages for Windows and macOS. It generally does this by > building on its servers and bundling all dependencies in the R package. This > has been accomplished by having separate processes for building and hosting > system dependencies, and then downloading and bundling those with scripts > that get executed at install time (and then create the binary package as a > side effect). > ARROW-3758 added the Windows PKGBUILD and related packaging scripts and ran > them on our Appveyor. This ticket is to do the same for the macOS scripts. > The purpose of these tickets is to bring the whole build pipeline under our > version control and CI so that we can address any C++ build and dependency > changes as they arise and not be surprised when it comes time to cut a > release. A side benefit is that they also enable us to offer a nightly binary > package repository with minimal additional effort. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908497#comment-16908497 ] Wes McKinney commented on ARROW-6259: - Reported upstream to Flatbuffers https://github.com/google/flatbuffers/issues/5482 > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Fix For: 0.15.0 > > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6260) [Website] Use deploy key on Travis to build and push to asf-site
Neal Richardson created ARROW-6260: -- Summary: [Website] Use deploy key on Travis to build and push to asf-site Key: ARROW-6260 URL: https://issues.apache.org/jira/browse/ARROW-6260 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Neal Richardson Assignee: Neal Richardson ARROW-4473 added CI/CD for the website, but there was some discomfort about having a committer provide a GitHub personal access token to do the pushing of the built site to the asf-site branch. Investigate using GitHub Deploy Keys instead, which are scoped to a single repository, not all public repositories that a user has access to. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6259: -- Labels: pull-request-available (was: ) > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6170) [R] "docker-compose build r" is slow
[ https://issues.apache.org/jira/browse/ARROW-6170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sutou Kouhei resolved ARROW-6170. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5039 [https://github.com/apache/arrow/pull/5039] > [R] "docker-compose build r" is slow > > > Key: ARROW-6170 > URL: https://issues.apache.org/jira/browse/ARROW-6170 > Project: Apache Arrow > Issue Type: Bug > Components: Developer Tools, R >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Apparently it installs and compiles all packages in single-thread mode. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6119) [Python] PyArrow import fails on Windows Python 3.7
[ https://issues.apache.org/jira/browse/ARROW-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6119: Fix Version/s: 0.15.0 > [Python] PyArrow import fails on Windows Python 3.7 > --- > > Key: ARROW-6119 > URL: https://issues.apache.org/jira/browse/ARROW-6119 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Windows, Python 3.7 >Reporter: Paul Suganthan >Priority: Major > Fix For: 0.15.0 > > > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file
[ https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908539#comment-16908539 ] Wes McKinney commented on ARROW-5980: - setuptools does not understand symlinks during the wheel build -- previously the shared libraries were being duplicated inside the wheel instead of symlinked. If you can resolve the issue without duplicating the shared libraries please submit a PR > [Python] Missing libarrow.so and libarrow_python.so in wheel file > - > > Key: ARROW-5980 > URL: https://issues.apache.org/jira/browse/ARROW-5980 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.14.0 >Reporter: Haowei Yu >Priority: Major > Labels: wheel > > I have installed the pyarrow 0.14.0 but it seems that by default you did not > provide symlink of libarrow.so and libarrow_python.so. Only .so file with > version suffix is provided. Hence, I cannot use the output of > pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link > option. > If you provide symlink, I can pass following to the linker to specify the > library to link. e.g. g++ -L/ -larrow -larrow_python > However, right now, the ld ouput complains not being able to find -larrow and > -larrow_python -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-5980) [Python] Missing libarrow.so and libarrow_python.so in wheel file
[ https://issues.apache.org/jira/browse/ARROW-5980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908540#comment-16908540 ] Wes McKinney commented on ARROW-5980: - In the meantime I would suggest developing against the conda packages which don't have this issue > [Python] Missing libarrow.so and libarrow_python.so in wheel file > - > > Key: ARROW-5980 > URL: https://issues.apache.org/jira/browse/ARROW-5980 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.14.0 >Reporter: Haowei Yu >Priority: Major > Labels: wheel > > I have installed the pyarrow 0.14.0 but it seems that by default you did not > provide symlink of libarrow.so and libarrow_python.so. Only .so file with > version suffix is provided. Hence, I cannot use the output of > pyarrow.get_libraries() and pyarrow.get_library_dirs() to build my link > option. > If you provide symlink, I can pass following to the linker to specify the > library to link. e.g. g++ -L/ -larrow -larrow_python > However, right now, the ld ouput complains not being able to find -larrow and > -larrow_python -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-3243) [C++] Upgrade jemalloc to version 5
[ https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3243: --- Assignee: Antoine Pitrou > [C++] Upgrade jemalloc to version 5 > --- > > Key: ARROW-3243 > URL: https://issues.apache.org/jira/browse/ARROW-3243 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Philipp Moritz >Assignee: Antoine Pitrou >Priority: Major > > Is it possible/feasible to upgrade jemalloc to version 5 and assume that > version? I'm asking because I've been working towards replacing dlmalloc in > plasma with jemalloc, which makes some of the code much nicer and removes > some of the issues we had with dlmalloc, but it requires jemalloc APIs that > are only available starting from jemalloc version 5, in particular, I'm using > the extent_hooks_t capability. > For now I can submit a patch that uses a different version of jemalloc in > plasma and then we can figure out how to deal with it (maybe there is a way > to make it work with older versions). What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-3243) [C++] Upgrade jemalloc to version 5
[ https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908542#comment-16908542 ] Wes McKinney commented on ARROW-3243: - See https://github.com/apache/arrow/commit/f913d8f0adff71c288a10f6c1b0ad2d1ab3e9e32 > [C++] Upgrade jemalloc to version 5 > --- > > Key: ARROW-3243 > URL: https://issues.apache.org/jira/browse/ARROW-3243 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Philipp Moritz >Priority: Major > > Is it possible/feasible to upgrade jemalloc to version 5 and assume that > version? I'm asking because I've been working towards replacing dlmalloc in > plasma with jemalloc, which makes some of the code much nicer and removes > some of the issues we had with dlmalloc, but it requires jemalloc APIs that > are only available starting from jemalloc version 5, in particular, I'm using > the extent_hooks_t capability. > For now I can submit a patch that uses a different version of jemalloc in > plasma and then we can figure out how to deal with it (maybe there is a way > to make it work with older versions). What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-3243) [C++] Upgrade jemalloc to version 5
[ https://issues.apache.org/jira/browse/ARROW-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3243. - Resolution: Fixed We're using jemalloc 5.2.0 now > [C++] Upgrade jemalloc to version 5 > --- > > Key: ARROW-3243 > URL: https://issues.apache.org/jira/browse/ARROW-3243 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Philipp Moritz >Priority: Major > > Is it possible/feasible to upgrade jemalloc to version 5 and assume that > version? I'm asking because I've been working towards replacing dlmalloc in > plasma with jemalloc, which makes some of the code much nicer and removes > some of the issues we had with dlmalloc, but it requires jemalloc APIs that > are only available starting from jemalloc version 5, in particular, I'm using > the extent_hooks_t capability. > For now I can submit a patch that uses a different version of jemalloc in > plasma and then we can figure out how to deal with it (maybe there is a way > to make it work with older versions). What are your thoughts? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6015) [Python] pyarrow: `DLL load failed` when importing on windows
[ https://issues.apache.org/jira/browse/ARROW-6015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6015: Fix Version/s: 0.15.0 > [Python] pyarrow: `DLL load failed` when importing on windows > -- > > Key: ARROW-6015 > URL: https://issues.apache.org/jira/browse/ARROW-6015 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.14.1 >Reporter: Ruslan Kuprieiev >Priority: Major > Fix For: 0.15.0 > > > When installing pyarrow 0.14.1 on windows 10 x64 with python 3.7, you get: > >>> import pyarrow > Traceback (most recent call last): > File "", line 1, in > File "C:\Python37\lib\site-packages\pyarrow\__init__.py", line 49, in > > from pyarrow.lib import cpu_count, set_cpu_count > ImportError: DLL load failed: The specified module could not be found. > On 0.14.0 everything works fine. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion
[ https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-4844. --- Resolution: Not A Problem Assignee: (was: Uwe L. Korn) If I'm not mistaken this issue is not causing problems anymore > Static libarrow is missing vendored libdouble-conversion > > > Key: ARROW-4844 > URL: https://issues.apache.org/jira/browse/ARROW-4844 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.1 >Reporter: Jeroen >Priority: Major > > When trying to statically link the R bindings to libarrow.a, I get linking > errors which suggest that libdouble-conversion.a was not properly embedded in > libarrow.a. This problem happens on both MacOS and Windows. > Here is the arrow build log: > https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7 > {code} > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion
[ https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908545#comment-16908545 ] Jeroen commented on ARROW-4844: --- I'm working around it now by linking an external libdouble-conversion rather than the vendored one. > Static libarrow is missing vendored libdouble-conversion > > > Key: ARROW-4844 > URL: https://issues.apache.org/jira/browse/ARROW-4844 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.1 >Reporter: Jeroen >Priority: Major > > When trying to statically link the R bindings to libarrow.a, I get linking > errors which suggest that libdouble-conversion.a was not properly embedded in > libarrow.a. This problem happens on both MacOS and Windows. > Here is the arrow build log: > https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7 > {code} > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion
[ https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908546#comment-16908546 ] Wes McKinney commented on ARROW-4844: - If you're statically linking that's the correct approach for right now. Shipping a complete vendored library toolchain is probably a fairly extensive project, so a volunteer is free to take up that work in the future. > Static libarrow is missing vendored libdouble-conversion > > > Key: ARROW-4844 > URL: https://issues.apache.org/jira/browse/ARROW-4844 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.1 >Reporter: Jeroen >Priority: Major > > When trying to statically link the R bindings to libarrow.a, I get linking > errors which suggest that libdouble-conversion.a was not properly embedded in > libarrow.a. This problem happens on both MacOS and Windows. > Here is the arrow build log: > https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7 > {code} > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6261) [C++] Install any bundled components and add installed CMake or pkgconfig configuration to enable downstream linkers to utilize bundled libraries when statically linking
Wes McKinney created ARROW-6261: --- Summary: [C++] Install any bundled components and add installed CMake or pkgconfig configuration to enable downstream linkers to utilize bundled libraries when statically linking Key: ARROW-6261 URL: https://issues.apache.org/jira/browse/ARROW-6261 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney The objective of this change would be to make it easier for toolchain builders to ship bundled thirdparty libraries together with the Arrow libraries in case there is a particular library version that is only used when linking with {{libarrow.a}}. In theory configuration could be added to arrowTargets.cmake (or pkgconfig) to simplify static linking -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-4844) Static libarrow is missing vendored libdouble-conversion
[ https://issues.apache.org/jira/browse/ARROW-4844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908555#comment-16908555 ] Wes McKinney commented on ARROW-4844: - I opened https://issues.apache.org/jira/browse/ARROW-6261 to be the umbrella issue for the project > Static libarrow is missing vendored libdouble-conversion > > > Key: ARROW-4844 > URL: https://issues.apache.org/jira/browse/ARROW-4844 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.1 >Reporter: Jeroen >Priority: Major > > When trying to statically link the R bindings to libarrow.a, I get linking > errors which suggest that libdouble-conversion.a was not properly embedded in > libarrow.a. This problem happens on both MacOS and Windows. > Here is the arrow build log: > https://ci.appveyor.com/project/jeroen/rtools-packages/builds/23015303/job/mtgl6rvfde502iu7 > {code} > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(cast.cc.obj):(.text+0x1c77c): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x5fda): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6097): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToDouble(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6589): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: > > C:/rtools40/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/8.3.0/../../../../lib/libarrow.a(converter.cc.obj):(.text+0x6647): > undefined reference to > `double_conversion::StringToDoubleConverter::StringToFloat(char const*, int, > int*) const' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6259) [C++][CI] Flatbuffers-related failures in CI on macOS
[ https://issues.apache.org/jira/browse/ARROW-6259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6259. - Resolution: Fixed Issue resolved by pull request 5096 [https://github.com/apache/arrow/pull/5096] > [C++][CI] Flatbuffers-related failures in CI on macOS > - > > Key: ARROW-6259 > URL: https://issues.apache.org/jira/browse/ARROW-6259 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > This seemingly has just started happening randomly today > https://travis-ci.org/apache/arrow/jobs/572381802#L2864 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6262) [Developer] Show JIRA issue before merging
Sutou Kouhei created ARROW-6262: --- Summary: [Developer] Show JIRA issue before merging Key: ARROW-6262 URL: https://issues.apache.org/jira/browse/ARROW-6262 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Sutou Kouhei Assignee: Sutou Kouhei It's useful to confirm whehter the associated JIRA issue is right or not. We couldn't find wrong associated JIRA issue after we merge the pull request https://github.com/apache/arrow/pull/5050 . -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6262) [Developer] Show JIRA issue before merging
[ https://issues.apache.org/jira/browse/ARROW-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-6262: -- Labels: pull-request-available (was: ) > [Developer] Show JIRA issue before merging > -- > > Key: ARROW-6262 > URL: https://issues.apache.org/jira/browse/ARROW-6262 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Minor > Labels: pull-request-available > > It's useful to confirm whehter the associated JIRA issue is right or not. > > We couldn't find wrong associated JIRA issue after we merge the pull request > https://github.com/apache/arrow/pull/5050 . -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6130) [Release] Use 0.15.0 as the next release
[ https://issues.apache.org/jira/browse/ARROW-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6130. - Resolution: Fixed Issue resolved by pull request 5007 [https://github.com/apache/arrow/pull/5007] > [Release] Use 0.15.0 as the next release > > > Key: ARROW-6130 > URL: https://issues.apache.org/jira/browse/ARROW-6130 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908622#comment-16908622 ] Wong Chung Hoi commented on ARROW-6058: --- Hi all, below is a simple piece of code to reproduce the issue using: s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 The file generated is roughly 170MB ``` import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') ``` ``` Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599) ``` > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908622#comment-16908622 ] Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:03 AM: Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} was (Author: hoi): Hi all, below is a simple piece of code to reproduce the issue using: s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 The file generated is roughly 170MB ``` import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') ``` ``` Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599) ``` > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6249) [Java] Remove useless class ByteArrayWrapper
[ https://issues.apache.org/jira/browse/ARROW-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6249. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5093 [https://github.com/apache/arrow/pull/5093] > [Java] Remove useless class ByteArrayWrapper > > > Key: ARROW-6249 > URL: https://issues.apache.org/jira/browse/ARROW-6249 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This class was introduced into encoding part to compare byte[] values equals. > Since now we compare value/vector equals by new added visitor API by > ARROW-6022 instead of comparing {{getObject}}, this class is no use anymore. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6038. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 4983 [https://github.com/apache/arrow/pull/4983] > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Priority: Minor > Labels: pull-request-available, windows > Fix For: 0.15.0 > > Attachments: segfault_ex.py > > Time Spent: 40m > Remaining Estimate: 0h > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6038: Component/s: Python C++ > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available, windows > Fix For: 0.15.0 > > Attachments: segfault_ex.py > > Time Spent: 50m > Remaining Estimate: 0h > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-6038: --- Assignee: Antoine Pitrou > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available, windows > Fix For: 0.15.0 > > Attachments: segfault_ex.py > > Time Spent: 50m > Remaining Estimate: 0h > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6263) [Python] RecordBatch.from_arrays does not check array types against a passed schema
Wes McKinney created ARROW-6263: --- Summary: [Python] RecordBatch.from_arrays does not check array types against a passed schema Key: ARROW-6263 URL: https://issues.apache.org/jira/browse/ARROW-6263 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.15.0 Example came from ARROW-6038 {code} In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema) Out[4]: In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) In [6]: rb Out[6]: In [7]: rb.schema Out[7]: col: string In [8]: rb[0] Out[8]: 0 nulls {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6219) [Java] Add API for JDBC adapter that can convert less then the full result set at a time.
[ https://issues.apache.org/jira/browse/ARROW-6219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6219. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5075 [https://github.com/apache/arrow/pull/5075] > [Java] Add API for JDBC adapter that can convert less then the full result > set at a time. > - > > Key: ARROW-6219 > URL: https://issues.apache.org/jira/browse/ARROW-6219 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 9.5h > Remaining Estimate: 0h > > Somehow we should configure number of rows per batch and either let clients > iterate or provide an iterator API. Otherwise for large result sets we might > run out of memory. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908642#comment-16908642 ] Wes McKinney commented on ARROW-6038: - I confirmed that the MWE is behaving properly now {code} $ python ~/Downloads/segfault_ex.py Creating table Traceback (most recent call last): File "/home/wesm/Downloads/segfault_ex.py", line 11, in pa.RecordBatch.from_arrays([pa.array(["C", "C", "C"])], schema), File "pyarrow/table.pxi", line 1117, in pyarrow.lib.Table.from_batches return pyarrow_wrap_table(c_table) File "pyarrow/public-api.pxi", line 316, in pyarrow.lib.pyarrow_wrap_table check_status(ctable.get().Validate()) File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status raise ArrowInvalid(message) pyarrow.lib.ArrowInvalid: Column 0: In chunk 1 expected type string but saw null {code} This is still weird and dangerous though: {code} In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema) Out[4]: In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) In [6]: rb Out[6]: In [7]: rb.schema Out[7]: col: string In [8]: rb[0] Out[8]: 0 nulls {code} I opened ARROW-6263 > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available, windows > Fix For: 0.15.0 > > Attachments: segfault_ex.py > > Time Spent: 50m > Remaining Estimate: 0h > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-6058: Fix Version/s: 0.15.0 > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908622#comment-16908622 ] Wong Chung Hoi edited comment on ARROW-6058 at 8/16/19 2:39 AM: Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd import numpy as np pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} was (Author: hoi): Hi all, below is a simple piece of code to reproduce the issue using: {code:java} s3fs==0.3.3 pyarrow==0.14.1 pandas==0.24.0 {code} The file generated is roughly 170MB {code:java} import pandas as pd >>> import numpy as np >>> pd.DataFrame(np.random.randint(0, 1, (1000, 10)), columns=[str(i) >>> for i in range(10)]).to_parquet('s3://path/to/file.snappy.parquet') >>> pd.read_parquet('s3://path/to/file.snappy.parquet') {code} {code:java} Traceback (most recent call last): File "", line 1, in File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 282, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pandas/io/parquet.py", line 129, in read **kwargs).to_pandas() File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 1216, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/hoi/.pyenv/versions/3.7.2/lib/python3.7/site-packages/pyarrow/parquet.py", line 216, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1086, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (304272) than expected (979599){code} > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6058) [Python][Parquet] Failure when reading Parquet file from S3
[ https://issues.apache.org/jira/browse/ARROW-6058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16908648#comment-16908648 ] Wes McKinney commented on ARROW-6058: - Thank you, that's great! I added to the 0.15.0 milestone. I've been working a lot on Parquet stuff lately so if no one looks at it first I'll try to look before the release horizon closes > [Python][Parquet] Failure when reading Parquet file from S3 > > > Key: ARROW-6058 > URL: https://issues.apache.org/jira/browse/ARROW-6058 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.14.1 >Reporter: Siddharth >Priority: Major > Labels: parquet > Fix For: 0.15.0 > > > I am reading parquet data from S3 and get ArrowIOError error. > Size of the data: 32 part files 90 MB each (3GB approx) > Number of records: Approx 100M > Code Snippet: > {code:java} > from s3fs import S3FileSystem > import pyarrow.parquet as pq > s3 = S3FileSystem() > dataset = pq.ParquetDataset("s3://location", filesystem=s3) > df = dataset.read_pandas().to_pandas() > {code} > Stack Trace: > {code:java} > df = dataset.read_pandas().to_pandas() > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1113, in read_pandas > return self.read(use_pandas_metadata=True, **kwargs) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line > 1085, in read > use_pandas_metadata=use_pandas_metadata) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, > in read > table = reader.read(**options) > File "/root/.local/lib/python3.6/site-packages/pyarrow/parquet.py", line 216, > in read > use_threads=use_threads) > File "pyarrow/_parquet.pyx", line 1086, in > pyarrow._parquet.ParquetReader.read_all > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: Unexpected end of stream: Page was smaller (197092) > than expected (263929) > {code} > > *Note: Same code works on relatively smaller dataset (approx < 50M records)* > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6212) [Java] Support vector rank operation
[ https://issues.apache.org/jira/browse/ARROW-6212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-6212. Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5066 [https://github.com/apache/arrow/pull/5066] > [Java] Support vector rank operation > > > Key: ARROW-6212 > URL: https://issues.apache.org/jira/browse/ARROW-6212 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Given an unsorted vector, we want to get the index of the ith smallest > element in the vector. This function is supported by the rank operation. > We provide an implementation that gets the index with the desired rank, > without sorting the vector (the vector is left intact), and the > implementation takes O( n ) time, where n is the vector length. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6262) [Developer] Show JIRA issue before merging
[ https://issues.apache.org/jira/browse/ARROW-6262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-6262. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5097 [https://github.com/apache/arrow/pull/5097] > [Developer] Show JIRA issue before merging > -- > > Key: ARROW-6262 > URL: https://issues.apache.org/jira/browse/ARROW-6262 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Sutou Kouhei >Assignee: Sutou Kouhei >Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > It's useful to confirm whehter the associated JIRA issue is right or not. > > We couldn't find wrong associated JIRA issue after we merge the pull request > https://github.com/apache/arrow/pull/5050 . -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-5952) [Python] Segfault when reading empty table with category as pandas dataframe
[ https://issues.apache.org/jira/browse/ARROW-5952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-5952: --- Assignee: Joris Van den Bossche > [Python] Segfault when reading empty table with category as pandas dataframe > > > Key: ARROW-5952 > URL: https://issues.apache.org/jira/browse/ARROW-5952 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0, 0.14.1 > Environment: Linux 3.10.0-327.36.3.el7.x86_64 > Python 3.6.8 > Pandas 0.24.2 > Pyarrow 0.14.0 >Reporter: Daniel Nugent >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > I have two short sample programs which demonstrate the issue: > {code:java} > import pyarrow as pa > import pandas as pd > empty = pd.DataFrame({'foo':[]},dtype='category') > table = pa.Table.from_pandas(empty) > outfile = pa.output_stream('bar') > writer = pa.RecordBatchFileWriter(outfile,table.schema) > writer.write(table) > writer.close() > {code} > {code:java} > import pyarrow as pa > pa.ipc.open_file('bar').read_pandas() > Segmentation fault > {code} > My apologies if this was already reported elsewhere, I searched but could not > find an issue which seemed to refer to the same behavior. -- This message was sent by Atlassian JIRA (v7.6.14#76016)