[jira] [Created] (ARROW-18360) [Python] Incorrectly passing schema=None to do_put crashes
Bryan Cutler created ARROW-18360: Summary: [Python] Incorrectly passing schema=None to do_put crashes Key: ARROW-18360 URL: https://issues.apache.org/jira/browse/ARROW-18360 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 9.0.0 Reporter: Bryan Cutler In pyarrow.flight, putting an incorrect value of None for schema in do_put will lead to a core dump. In pyarrow 9.0.0, trying to enter the command leads to a {code} In [3]: writer, reader = client.do_put(flight.FlightDescriptor.for_command(cmd), schema=None) Segmentation fault (core dumped) {code} In pyarrow 7.0.0, the kernel crashes after attempting to access the writer and I got the following: {code} In [38]: client = flight.FlightClient('grpc+tls://localhost:9643', disable_server_verification=True) In [39]: writer, reader = client.do_put(flight.FlightDescriptor.for_command(cmd), None) In [40]: writer./home/conda/feedstock_root/build_artifacts/arrow-cpp-ext_1644752264449/work/cpp/src/arrow/flight/client.cc:736: Check failed: (batch_writer_) != (nullptr) miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(+0x66288c)[0x7f0feeae088c] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0x101)[0x7f0feeae0c91] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.700(+0x7c1e1)[0x7f0fa9e331e1] miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so(+0x17cf1a)[0x7f0fefe7ff1a] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(+0x144814)[0x559a7cb8f814] miniconda3/envs/dev/bin/python(+0x1445bf)[0x559a7cb8f5bf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44] miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac] miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(+0x151ef3)[0x559a7cb9cef3] miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44] miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x1311)[0x559a7cb7fbd1] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5] miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x66f)[0x559a7cb7ef2f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(+0x1416f5)[0x559a7cb8c6f5] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x52)[0x559a7cb8c4a2] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d] miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03] miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x9ca)[0x559a7cb7f28a] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178] miniconda3/envs/dev/bin/python(+0x1602d9)[0x559a7cbab2d9] miniconda3/envs/dev/bin/python(+0x19d5f5)[0x559a7cbe85f5] miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc] miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
[jira] [Created] (ARROW-15831) [Java] Upgrade Flight dependencies
Bryan Cutler created ARROW-15831: Summary: [Java] Upgrade Flight dependencies Key: ARROW-15831 URL: https://issues.apache.org/jira/browse/ARROW-15831 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler Upgrade grpc, netty and protobuf dependencies for Flight -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15746) [Java] Add arrow-flight pom to list of artifacts to deploy
Bryan Cutler created ARROW-15746: Summary: [Java] Add arrow-flight pom to list of artifacts to deploy Key: ARROW-15746 URL: https://issues.apache.org/jira/browse/ARROW-15746 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler The arrow-flight pom is currently not being deployed, see https://lists.apache.org/thread/fbrgvf30os5h4ox7fk4txrlgdp1g5g4g -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15722) [Java] Improve error message for ListVector with wrong number of children
Bryan Cutler created ARROW-15722: Summary: [Java] Improve error message for ListVector with wrong number of children Key: ARROW-15722 URL: https://issues.apache.org/jira/browse/ARROW-15722 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler If a ListVector is made without any children, the error message will say "Lists have only one child. Found: []". The wording could be improved a little to let the user know what went wrong. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14198) [Java] Upgrade Netty and gRPC dependencies
Bryan Cutler created ARROW-14198: Summary: [Java] Upgrade Netty and gRPC dependencies Key: ARROW-14198 URL: https://issues.apache.org/jira/browse/ARROW-14198 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler Current versions in use are quite old and subject to vulnerabilities. See https://www.cvedetails.com/cve/CVE-2021-21409/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13872) [Java] ExtensionTypeVector does not work with RangeEqualsVisitor
Bryan Cutler created ARROW-13872: Summary: [Java] ExtensionTypeVector does not work with RangeEqualsVisitor Key: ARROW-13872 URL: https://issues.apache.org/jira/browse/ARROW-13872 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 5.0.0 Reporter: Bryan Cutler Assignee: Bryan Cutler When using an ExtensionTypeVector with a RangeEqualsVector to compare with another extension type vector, it fails because in vector.accept() the extension type defers to the underlyingVector, but this is not done for the vector initially set in the RangeEqualsVisitor, so it ends up either failing due to different types or attempting to cast the extension vector to the underlying vector type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13076) [Java] Enable ExtensionType to use StructVector and UnionVector for underlying storage
Bryan Cutler created ARROW-13076: Summary: [Java] Enable ExtensionType to use StructVector and UnionVector for underlying storage Key: ARROW-13076 URL: https://issues.apache.org/jira/browse/ARROW-13076 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler Currently, an ExtensionTypeVector has a type constraint for the underlying storage to extend BaseValueVector. StructVector, UnionVector and DenseUnionVector do not extend this base class. After ARROW-13044, Union vectors will extend the ValueVector interface and the extension vector type constrain could be relaxed to this interface to allow the above vector types to be used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13044) [Java] Union vectors should extend BaseValueVector
Bryan Cutler created ARROW-13044: Summary: [Java] Union vectors should extend BaseValueVector Key: ARROW-13044 URL: https://issues.apache.org/jira/browse/ARROW-13044 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 4.0.1 Reporter: Bryan Cutler Assignee: Bryan Cutler I was going to try using a DenseUnionVector as the underlying vector of an extension type but it's not currently possible because ExtensionTypeVector has a type constraint for the underlying storage to extend BaseValueVector and the union vectors do not extend this class. It should be possible for UnionVector and DenseUnionVector to extend AbstractContainerVector, which is a subclass of BaseValueVector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11739) [Java] Add API for getBufferSize() with density to BaseVariableWidthVector
Bryan Cutler created ARROW-11739: Summary: [Java] Add API for getBufferSize() with density to BaseVariableWidthVector Key: ARROW-11739 URL: https://issues.apache.org/jira/browse/ARROW-11739 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11382) [Java] NullVector field name can't be set
Bryan Cutler created ARROW-11382: Summary: [Java] NullVector field name can't be set Key: ARROW-11382 URL: https://issues.apache.org/jira/browse/ARROW-11382 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Currently, the Java NullVector has a default Field name fixed to DATA_VECTOR_NAME, which is "$data$". This should be able to be changed by the user, probably by having an alternate constructor that accepts a name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10512) [Python] Arrow to Pandas conversion promotes child array to float for NULL values
Bryan Cutler created ARROW-10512: Summary: [Python] Arrow to Pandas conversion promotes child array to float for NULL values Key: ARROW-10512 URL: https://issues.apache.org/jira/browse/ARROW-10512 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Bryan Cutler When converting a nested Arrow array to Pandas, if a child array is an integer type with NULL values, it gets promoted to floating point and NULL values are replaced with NaNs. Since the Pandas conversion for these types results in Python objects, it is not necessary to use NaN and `None` values could be inserted instead. This is the case for ListType, MapType and StructType, etc. {code} In [4]: s = pd.Series([[1, 2, 3], [4, 5, None]]) In [5]: arr = pa.Array.from_pandas(s) In [6]: arr.type Out[6]: ListType(list) In [7]: arr.to_pandas() Out[7]: 0[1.0, 2.0, 3.0] 1[4.0, 5.0, nan] dtype: object {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10457) [CI] Fix Spark branch-3.0 integration tests
Bryan Cutler created ARROW-10457: Summary: [CI] Fix Spark branch-3.0 integration tests Key: ARROW-10457 URL: https://issues.apache.org/jira/browse/ARROW-10457 Project: Apache Arrow Issue Type: Improvement Reporter: Bryan Cutler The Spark branch-3.0 is currently failing because this branch has not been updated or patched to use the latest Arrow Java, see https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0. The Spark branch-3.0 has already been released and only able to get bug fixes. Instead of patching the Spark build, we should not try to rebuild Spark with the latest Arrow Java, and instead only test against the latest pyarrow. This should work, but might also need a minor Python patch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10260) [Python] Missing MapType to Pandas dtype
Bryan Cutler created ARROW-10260: Summary: [Python] Missing MapType to Pandas dtype Key: ARROW-10260 URL: https://issues.apache.org/jira/browse/ARROW-10260 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Bryan Cutler The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype mapping for {{to_pandas_dtype()}} {code:java} In [2]: d = pa.map_(pa.int64(), pa.float64()) In [3]: d.to_pandas_dtype() --- NotImplementedError Traceback (most recent call last) in > 1 d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10151) [Python] Add support MapArray to_pandas conversion
Bryan Cutler created ARROW-10151: Summary: [Python] Add support MapArray to_pandas conversion Key: ARROW-10151 URL: https://issues.apache.org/jira/browse/ARROW-10151 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Bryan Cutler Fix For: 2.0.0 MapArray does not currently support to_pandas conversion and raises a {{Status::NotImplemented("No known equivalent Pandas block for Arrow data of type ")}} Conversion from Pandas seems to work, but should verify there are tests in place. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9750) [Doc][Python] Add usage of py.Array scalar operations behavior
Bryan Cutler created ARROW-9750: --- Summary: [Doc][Python] Add usage of py.Array scalar operations behavior Key: ARROW-9750 URL: https://issues.apache.org/jira/browse/ARROW-9750 Project: Apache Arrow Issue Type: Improvement Components: Documentation, Python Affects Versions: 1.0.0 Reporter: Bryan Cutler Recent changes in 1.0.0 affected the way pyarrow.Array scalars handle operations such as equality. For example, an equality check will compare object equivalence and return False no matter what the value is. Since this could be confusing to the user, there should be some documentation on this behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9576) [Doc] Fix error in code example for extension types
Bryan Cutler created ARROW-9576: --- Summary: [Doc] Fix error in code example for extension types Key: ARROW-9576 URL: https://issues.apache.org/jira/browse/ARROW-9576 Project: Apache Arrow Issue Type: Improvement Components: Documentation Affects Versions: 1.0.0 Reporter: Bryan Cutler Assignee: Bryan Cutler There is an error in the example code using an undefined variable `arr` instead of `self` here https://arrow.apache.org/docs/python/extending_types.html#conversion-to-pandas -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9545) [Java] Add forward compatibility checks for unrecognized future MetadataVersion
Bryan Cutler created ARROW-9545: --- Summary: [Java] Add forward compatibility checks for unrecognized future MetadataVersion Key: ARROW-9545 URL: https://issues.apache.org/jira/browse/ARROW-9545 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Bryan Cutler Assignee: Wes McKinney Fix For: 1.0.0 We should have no need of these checks in theory, but they present a safeguard should some years in the future it became necessary to increment the MetadataVersion. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9357) [Java] Document how to set netty/unsafe allocators
Bryan Cutler created ARROW-9357: --- Summary: [Java] Document how to set netty/unsafe allocators Key: ARROW-9357 URL: https://issues.apache.org/jira/browse/ARROW-9357 Project: Apache Arrow Issue Type: Task Components: Documentation Reporter: Bryan Cutler There are now 2 allocators available, one based on netty and one using unsafe apis. We should provide end-user documentation on which one is default and how to set and use each one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9356) [Java] Remove Netty dependency from arrow-vector
Bryan Cutler created ARROW-9356: --- Summary: [Java] Remove Netty dependency from arrow-vector Key: ARROW-9356 URL: https://issues.apache.org/jira/browse/ARROW-9356 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Fix For: 1.0.0 Cleanup remaining usage of Netty from arrow-vector and remove as a dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++
[ https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6111. - Resolution: Fixed Issue resolved by pull request 6425 [https://github.com/apache/arrow/pull/6425] > [Java] Support LargeVarChar and LargeBinary types and add integration test > with C++ > --- > > Key: ARROW-6111 > URL: https://issues.apache.org/jira/browse/ARROW-6111 > Project: Apache Arrow > Issue Type: New Feature > Components: Integration, Java >Reporter: Micah Kornfield >Assignee: Liya Fan >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow
[ https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101844#comment-17101844 ] Bryan Cutler commented on ARROW-8731: - [~are...@wayfair.com] you should be able to use a newer version of pyarrow with pyspark 2.4.4 by following the instructions here https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x > Error when using toPandas with PyArrow > -- > > Key: ARROW-8731 > URL: https://issues.apache.org/jira/browse/ARROW-8731 > Project: Apache Arrow > Issue Type: Bug > Environment: Python Environment on the worker and driver > - jupyter==1.0.0 > - pandas==1.0.3 > - pyarrow==0.14.0 > - pyspark==2.4.0 > - py4j==0.10.7 > - pyarrow==0.14.0 >Reporter: Andrew Redd >Priority: Blocker > > I'm getting the following error when calling toPandas on a spark dataframe. I > imagine my pyspark and pyarrow versions are clashing somehow but I haven't > found this same issue by anyone else online > * This is a blocker to our use of pyarrow on a project > > {code:java} > --- > TypeError Traceback (most recent call last) > in > > 1 df.limit(100).toPandas() > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self) >2119 _check_dataframe_localize_timestamps >2120 import pyarrow > -> 2121 batches = self._collectAsArrow() >2122 if len(batches) > 0: >2123 table = pyarrow.Table.from_batches(batches) > /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in > _collectAsArrow(self) >2177 with SCCallSiteSync(self._sc) as css: >2178 sock_info = self._jdf.collectAsArrowToPython() > -> 2179 return list(_load_from_socket(sock_info, > ArrowStreamSerializer())) >2180 >2181 > ## > /venv/lib/python3.6/site-packages/pyspark/rdd.py in > _load_from_socket(sock_info, serializer) > 142 > 143 def _load_from_socket(sock_info, serializer): > --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info) > 145 # The RDD materialization time is unpredicable, if we set a > timeout for socket reading > 146 # operation, it will very possibly fail. See SPARK-18281. > TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were > given > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7610) [Java] Finish support for 64 bit int allocations
[ https://issues.apache.org/jira/browse/ARROW-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7610. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6323 [https://github.com/apache/arrow/pull/6323] > [Java] Finish support for 64 bit int allocations > - > > Key: ARROW-7610 > URL: https://issues.apache.org/jira/browse/ARROW-7610 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Micah Kornfield >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 8.5h > Remaining Estimate: 0h > > 1. Add an allocator capable of allocating larger then 2GB of data. > 2. Do end-to-end round trip trip on a larger vector/record batch size. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays
[ https://issues.apache.org/jira/browse/ARROW-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-8386. - Fix Version/s: 0.17.0 Resolution: Fixed Issue resolved by pull request 6889 [https://github.com/apache/arrow/pull/6889] > [Python] pyarrow.jvm raises error for empty Arrays > -- > > Key: ARROW-8386 > URL: https://issues.apache.org/jira/browse/ARROW-8386 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.17.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In the pyarrow.jvm module, when there is an empty array in Java, trying to > create it in python raises a ValueError. This is because for an empty array, > Java returns an empty list of buffers, then pyarrow.jvm attempts to create > the array with pa.Array.from_buffers with an empty list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays
Bryan Cutler created ARROW-8386: --- Summary: [Python] pyarrow.jvm raises error for empty Arrays Key: ARROW-8386 URL: https://issues.apache.org/jira/browse/ARROW-8386 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Bryan Cutler Assignee: Bryan Cutler In the pyarrow.jvm module, when there is an empty array in Java, trying to create it in python raises a ValueError. This is because for an empty array, Java returns an empty list of buffers, then pyarrow.jvm attempts to create the array with pa.Array.from_buffers with an empty list. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-5649) [Integration][C++][Java] Create round trip integration test for extension types
[ https://issues.apache.org/jira/browse/ARROW-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052379#comment-17052379 ] Bryan Cutler commented on ARROW-5649: - [~npr] I think in the scope of our current integration tests, yes this is effectively the same. It would be nice to test the addition steps of creating the extension type array across implementations and verifying it works as expected. Not sure how that would be done in our existing integration testing framework though. > [Integration][C++][Java] Create round trip integration test for extension > types > --- > > Key: ARROW-5649 > URL: https://issues.apache.org/jira/browse/ARROW-5649 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Integration, Java >Affects Versions: 0.16.0 >Reporter: Micah Kornfield >Priority: Major > Fix For: 1.0.0 > > > With Java and C++ code merged we should verify round-trip of the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
[ https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7966: Component/s: Integration FlightRPC > [Integration][Flight][C++] Client should verify each batch independently > > > Key: ARROW-7966 > URL: https://issues.apache.org/jira/browse/ARROW-7966 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC, Integration >Reporter: Bryan Cutler >Priority: Major > > Currently the C++ Flight test client in {{test_integration_client.cc}} reads > all batches from JSON into a Table, reads all batches in the flight stream > from the server into a Table, then compares the Tables for equality. This is > potentially a problem because a record batch might have specific information > that is then lost in the conversion to a Table. For example, if the server > sends empty batches, the resulting Table would not be different from one with > no empty batches. > Instead, the client should check each record batch from the JSON file against > each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently
Bryan Cutler created ARROW-7966: --- Summary: [Integration][Flight][C++] Client should verify each batch independently Key: ARROW-7966 URL: https://issues.apache.org/jira/browse/ARROW-7966 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Bryan Cutler Currently the C++ Flight test client in {{test_integration_client.cc}} reads all batches from JSON into a Table, reads all batches in the flight stream from the server into a Table, then compares the Tables for equality. This is potentially a problem because a record batch might have specific information that is then lost in the conversion to a Table. For example, if the server sends empty batches, the resulting Table would not be different from one with no empty batches. Instead, the client should check each record batch from the JSON file against each record batch from the server independently. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7933) [Java][Flight][Tests] Add roundtrip tests for Java Flight Test Client
Bryan Cutler created ARROW-7933: --- Summary: [Java][Flight][Tests] Add roundtrip tests for Java Flight Test Client Key: ARROW-7933 URL: https://issues.apache.org/jira/browse/ARROW-7933 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC, Java Reporter: Bryan Cutler There should be some built-in roundtrip tests for Java Flight IntegrationTestClient -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7899) [Integration][Java] null type integration test
[ https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7899. - Resolution: Fixed Issue resolved by pull request 6476 [https://github.com/apache/arrow/pull/6476] > [Integration][Java] null type integration test > -- > > Key: ARROW-7899 > URL: https://issues.apache.org/jira/browse/ARROW-7899 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java >Reporter: Neal Richardson >Assignee: Bryan Cutler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > From [https://github.com/apache/arrow/pull/6368] > > h3. *[lidavidm|https://github.com/lidavidm]* commented [2 days > ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218] > |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208] > > If I'm not mistaken, this means that we only compare the data fully if > there's actual data in both JSON and in the Arrow file? > Though the Flight test is also potentially wrong: > > [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173] > It only compares the last batch sent over the wire.| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7899) [Integration][Java] null type integration test
[ https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-7899: --- Assignee: Bryan Cutler > [Integration][Java] null type integration test > -- > > Key: ARROW-7899 > URL: https://issues.apache.org/jira/browse/ARROW-7899 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java >Reporter: Neal Richardson >Assignee: Bryan Cutler >Priority: Blocker > Fix For: 1.0.0 > > > From [https://github.com/apache/arrow/pull/6368] > > h3. *[lidavidm|https://github.com/lidavidm]* commented [2 days > ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218] > |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208] > > If I'm not mistaken, this means that we only compare the data fully if > there's actual data in both JSON and in the Arrow file? > Though the Flight test is also potentially wrong: > > [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173] > It only compares the last batch sent over the wire.| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7899) [Integration][Java] null type integration test
[ https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041330#comment-17041330 ] Bryan Cutler commented on ARROW-7899: - I can look into this > [Integration][Java] null type integration test > -- > > Key: ARROW-7899 > URL: https://issues.apache.org/jira/browse/ARROW-7899 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java >Reporter: Neal Richardson >Priority: Blocker > Fix For: 1.0.0 > > > From [https://github.com/apache/arrow/pull/6368] > > h3. *[lidavidm|https://github.com/lidavidm]* commented [2 days > ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218] > |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208] > > If I'm not mistaken, this means that we only compare the data fully if > there's actual data in both JSON and in the Arrow file? > Though the Flight test is also potentially wrong: > > [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173] > It only compares the last batch sent over the wire.| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7899) [Integration][Java] null type integration test
[ https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7899: Description: >From [https://github.com/apache/arrow/pull/6368] h3. *[lidavidm|https://github.com/lidavidm]* commented [2 days ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218] |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208] If I'm not mistaken, this means that we only compare the data fully if there's actual data in both JSON and in the Arrow file? Though the Flight test is also potentially wrong: [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173] It only compares the last batch sent over the wire.| > [Integration][Java] null type integration test > -- > > Key: ARROW-7899 > URL: https://issues.apache.org/jira/browse/ARROW-7899 > Project: Apache Arrow > Issue Type: Bug > Components: Integration, Java >Reporter: Neal Richardson >Priority: Blocker > Fix For: 1.0.0 > > > From [https://github.com/apache/arrow/pull/6368] > > h3. *[lidavidm|https://github.com/lidavidm]* commented [2 days > ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218] > |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208] > > If I'm not mistaken, this means that we only compare the data fully if > there's actual data in both JSON and in the Arrow file? > Though the Flight test is also potentially wrong: > > [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173] > It only compares the last batch sent over the wire.| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info
[ https://issues.apache.org/jira/browse/ARROW-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7467. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6094 [https://github.com/apache/arrow/pull/6094] > [Java] ComplexCopier does incorrect copy for Map nullable info > -- > > Key: ARROW-7467 > URL: https://issues.apache.org/jira/browse/ARROW-7467 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > The {{MapVector}} and its 'value' vector are nullable, and its > {{structVector}} and 'key' vector are non-nullable. > However, the {{MapVector}} generated by ComplexCopier has all nullable fields > which is not correct. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7405) [Java] ListVector isEmpty API is incorrect
[ https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7405: Priority: Minor (was: Major) > [Java] ListVector isEmpty API is incorrect > -- > > Key: ARROW-7405 > URL: https://issues.apache.org/jira/browse/ARROW-7405 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Currently {{isEmpty}} API is always return false in > {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not > overwrite this method. > This will lead to incorrect result, for example, a {{ListVector}} with data > [1,2], null, [], [5,6] should get [false, false, true, false] with this API, > but now it would return [false, false, false, false]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7405) [Java] ListVector isEmpty API is incorrect
[ https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7405. - Resolution: Fixed Resolved from https://github.com/apache/arrow/pull/6044 > [Java] ListVector isEmpty API is incorrect > -- > > Key: ARROW-7405 > URL: https://issues.apache.org/jira/browse/ARROW-7405 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently {{isEmpty}} API is always return false in > {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not > overwrite this method. > This will lead to incorrect result, for example, a {{ListVector}} with data > [1,2], null, [], [5,6] should get [false, false, true, false] with this API, > but now it would return [false, false, false, false]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7405) [Java] ListVector isEmpty API is incorrect
[ https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7405: Fix Version/s: 1.0.0 > [Java] ListVector isEmpty API is incorrect > -- > > Key: ARROW-7405 > URL: https://issues.apache.org/jira/browse/ARROW-7405 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently {{isEmpty}} API is always return false in > {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not > overwrite this method. > This will lead to incorrect result, for example, a {{ListVector}} with data > [1,2], null, [], [5,6] should get [false, false, true, false] with this API, > but now it would return [false, false, false, false]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7770) [Release] Archery does not use correct integration test args
[ https://issues.apache.org/jira/browse/ARROW-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7770. - Resolution: Duplicate > [Release] Archery does not use correct integration test args > > > Key: ARROW-7770 > URL: https://issues.apache.org/jira/browse/ARROW-7770 > Project: Apache Arrow > Issue Type: Bug > Components: Archery >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > When using release verification script and selecting integration tests, > Archery ignores selected tests and runs all tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7770) [Release] Archery does not use correct integration test args
Bryan Cutler created ARROW-7770: --- Summary: [Release] Archery does not use correct integration test args Key: ARROW-7770 URL: https://issues.apache.org/jira/browse/ARROW-7770 Project: Apache Arrow Issue Type: Bug Components: Archery Reporter: Bryan Cutler Assignee: Bryan Cutler When using release verification script and selecting integration tests, Archery ignores selected tests and runs all tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error
[ https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026977#comment-17026977 ] Bryan Cutler commented on ARROW-7723: - Thanks for the explaination [~wesm] , makes sense > [Python] StructArray timestamp type with timezone to_pandas convert error > -- > > Key: ARROW-7723 > URL: https://issues.apache.org/jira/browse/ARROW-7723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > When a {{StructArray}} has a child that is a timestamp with a timezone, the > {{to_pandas}} conversion outputs an int64 instead of a timestamp > {code:java} > In [1]: import pyarrow as pa >...: import pandas as pd >...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': > pd.Timestamp.now()}]) >...: > > In [2]: arr.to_pandas() > > Out[2]: > 0{'end': 2020-01-29 11:38:02.792681, 'start': 2... > dtype: object > In [3]: ts = pd.Timestamp.now() > > In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) > > In [5]: arr2.to_pandas() > > Out[5]: > 0 2020-01-29 06:38:47.848944-05:00 > dtype: datetime64[ns, America/New_York] > In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) > > In [7]: arr.to_pandas() > > Out[7]: > 0{'start': 1580297927848944000, 'stop': 1580297... > dtype: object > {code} > from https://github.com/apache/arrow/pull/6312 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error
[ https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7723: Priority: Blocker (was: Major) > [Python] StructArray timestamp type with timezone to_pandas convert error > -- > > Key: ARROW-7723 > URL: https://issues.apache.org/jira/browse/ARROW-7723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Priority: Blocker > > When a {{StructArray}} has a child that is a timestamp with a timezone, the > {{to_pandas}} conversion outputs an int64 instead of a timestamp > {code:java} > In [1]: import pyarrow as pa >...: import pandas as pd >...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': > pd.Timestamp.now()}]) >...: > > In [2]: arr.to_pandas() > > Out[2]: > 0{'end': 2020-01-29 11:38:02.792681, 'start': 2... > dtype: object > In [3]: ts = pd.Timestamp.now() > > In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) > > In [5]: arr2.to_pandas() > > Out[5]: > 0 2020-01-29 06:38:47.848944-05:00 > dtype: datetime64[ns, America/New_York] > In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) > > In [7]: arr.to_pandas() > > Out[7]: > 0{'start': 1580297927848944000, 'stop': 1580297... > dtype: object > {code} > from https://github.com/apache/arrow/pull/6312 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error
[ https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026233#comment-17026233 ] Bryan Cutler commented on ARROW-7723: - This is a regression, so marking as blocker for now. > [Python] StructArray timestamp type with timezone to_pandas convert error > -- > > Key: ARROW-7723 > URL: https://issues.apache.org/jira/browse/ARROW-7723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Priority: Blocker > Fix For: 0.16.0 > > > When a {{StructArray}} has a child that is a timestamp with a timezone, the > {{to_pandas}} conversion outputs an int64 instead of a timestamp > {code:java} > In [1]: import pyarrow as pa >...: import pandas as pd >...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': > pd.Timestamp.now()}]) >...: > > In [2]: arr.to_pandas() > > Out[2]: > 0{'end': 2020-01-29 11:38:02.792681, 'start': 2... > dtype: object > In [3]: ts = pd.Timestamp.now() > > In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) > > In [5]: arr2.to_pandas() > > Out[5]: > 0 2020-01-29 06:38:47.848944-05:00 > dtype: datetime64[ns, America/New_York] > In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) > > In [7]: arr.to_pandas() > > Out[7]: > 0{'start': 1580297927848944000, 'stop': 1580297... > dtype: object > {code} > from https://github.com/apache/arrow/pull/6312 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error
[ https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-7723: Fix Version/s: 0.16.0 > [Python] StructArray timestamp type with timezone to_pandas convert error > -- > > Key: ARROW-7723 > URL: https://issues.apache.org/jira/browse/ARROW-7723 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Priority: Blocker > Fix For: 0.16.0 > > > When a {{StructArray}} has a child that is a timestamp with a timezone, the > {{to_pandas}} conversion outputs an int64 instead of a timestamp > {code:java} > In [1]: import pyarrow as pa >...: import pandas as pd >...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': > pd.Timestamp.now()}]) >...: > > In [2]: arr.to_pandas() > > Out[2]: > 0{'end': 2020-01-29 11:38:02.792681, 'start': 2... > dtype: object > In [3]: ts = pd.Timestamp.now() > > In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) > > In [5]: arr2.to_pandas() > > Out[5]: > 0 2020-01-29 06:38:47.848944-05:00 > dtype: datetime64[ns, America/New_York] > In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) > > In [7]: arr.to_pandas() > > Out[7]: > 0{'start': 1580297927848944000, 'stop': 1580297... > dtype: object > {code} > from https://github.com/apache/arrow/pull/6312 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error
Bryan Cutler created ARROW-7723: --- Summary: [Python] StructArray timestamp type with timezone to_pandas convert error Key: ARROW-7723 URL: https://issues.apache.org/jira/browse/ARROW-7723 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Bryan Cutler When a {{StructArray}} has a child that is a timestamp with a timezone, the {{to_pandas}} conversion outputs an int64 instead of a timestamp {code:java} In [1]: import pyarrow as pa ...: import pandas as pd ...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': pd.Timestamp.now()}]) ...: In [2]: arr.to_pandas() Out[2]: 0{'end': 2020-01-29 11:38:02.792681, 'start': 2... dtype: object In [3]: ts = pd.Timestamp.now() In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) In [5]: arr2.to_pandas() Out[5]: 0 2020-01-29 06:38:47.848944-05:00 dtype: datetime64[ns, America/New_York] In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) In [7]: arr.to_pandas() Out[7]: 0{'start': 1580297927848944000, 'stop': 1580297... dtype: object {code} from https://github.com/apache/arrow/pull/6312 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7709) [Python] Conversion from Table Column to Pandas loses name for Timestamps
[ https://issues.apache.org/jira/browse/ARROW-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025493#comment-17025493 ] Bryan Cutler commented on ARROW-7709: - >From [https://github.com/apache/arrow/pull/6294#issuecomment-579468239,] Joris >knows where this issue is and said he could fix this soon. > [Python] Conversion from Table Column to Pandas loses name for Timestamps > - > > Key: ARROW-7709 > URL: https://issues.apache.org/jira/browse/ARROW-7709 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Priority: Major > > When converting a Table timestamp column to Pandas, the name of the column is > lost in the resulting series. > {code:java} > In [23]: a1 = pa.array([pd.Timestamp.now()]) > > In [24]: a2 = pa.array([1]) > > In [25]: t = pa.Table.from_arrays([a1, a2], ['ts', 'a']) > > In [26]: for c in t: > ...: print(c.to_pandas()) > ...: > > 0 2020-01-28 13:17:26.738708 > dtype: datetime64[ns] > 01 > Name: a, dtype: int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7693) [CI] Fix test-conda-python-3.7-spark-master nightly errors
[ https://issues.apache.org/jira/browse/ARROW-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7693. - Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6294 [https://github.com/apache/arrow/pull/6294] > [CI] Fix test-conda-python-3.7-spark-master nightly errors > -- > > Key: ARROW-7693 > URL: https://issues.apache.org/jira/browse/ARROW-7693 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Spark master renamed some tests, need to update -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7709) [Python] Conversion from Table Column to Pandas loses name for Timestamps
Bryan Cutler created ARROW-7709: --- Summary: [Python] Conversion from Table Column to Pandas loses name for Timestamps Key: ARROW-7709 URL: https://issues.apache.org/jira/browse/ARROW-7709 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Bryan Cutler When converting a Table timestamp column to Pandas, the name of the column is lost in the resulting series. {code:java} In [23]: a1 = pa.array([pd.Timestamp.now()]) In [24]: a2 = pa.array([1]) In [25]: t = pa.Table.from_arrays([a1, a2], ['ts', 'a']) In [26]: for c in t: ...: print(c.to_pandas()) ...: 0 2020-01-28 13:17:26.738708 dtype: datetime64[ns] 01 Name: a, dtype: int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7693) [CI] Fix test-conda-python-3.7-spark-master nightly errors
Bryan Cutler created ARROW-7693: --- Summary: [CI] Fix test-conda-python-3.7-spark-master nightly errors Key: ARROW-7693 URL: https://issues.apache.org/jira/browse/ARROW-7693 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration Reporter: Bryan Cutler Assignee: Bryan Cutler Spark master renamed some tests, need to update -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7596) [Python] Only apply zero-copy DataFrame block optimizations when split_blocks=True
[ https://issues.apache.org/jira/browse/ARROW-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023156#comment-17023156 ] Bryan Cutler commented on ARROW-7596: - Linking some discussion on the mailing list regarding pyspark behavior with this option: [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html] {noformat} Joris Van den Bossche Fri, 24 Jan 2020 02:11:05 -0800 Hi Bryan, For the case that the column is no timestamp and was not modified: I don't think it will take copies of the full dataframe by assigning columns in a loop like that. But it is still doing work (it will copy data for that column into the array holding those data for 2D blocks), and which can easily be avoided I think by only assigning back when the column was actually modified (eg by moving the is_datetime64tz_dtype inline in the loop iterating through all columns, so you can only write back if actually having tz-aware data).Further, even if you do the above to avoid writing back to the dataframe when not needed, I am not sure you should directly try to use the new zero-copy feature of the Table.to_pandas conversion (with split_blocks=True). It depends very much on what further happens with the converted dataframe. Once you do some operations in pandas, those splitted blocks will get combined (resulting in a memory copy then), and it also means you can't modify the dataframe (if this dataframe is used in python UDFs, it might limit what can be done in those UDFs. Just guessing here, I don't know the pyspark code well enough). Joris On Thu, 23 Jan 2020 at 21:03, Bryan Cutler wrote: > Thanks for investigating this and the quick fix Joris and Wes! I just have > a couple questions about the behavior observed here. The pyspark code > assigns either the same series back to the pandas.DataFrame or makes some > modifications if it is a timestamp. In the case there are no timestamps, is > this potentially making extra copies or will it be unable to take advantage > of new zero-copy features in pyarrow? For the case of having timestamp > columns that need to be modified, is there a more efficient way to create a > new dataframe with only copies of the modified series? Thanks! > > Bryan {noformat} > [Python] Only apply zero-copy DataFrame block optimizations when > split_blocks=True > -- > > Key: ARROW-7596 > URL: https://issues.apache.org/jira/browse/ARROW-7596 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Blocker > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Follow up to ARROW-3789 since there is downstream code that assumes that the > DataFrame produced always has all mutable blocks -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter
[ https://issues.apache.org/jira/browse/ARROW-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7472. - Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6101 [https://github.com/apache/arrow/pull/6101] > [Java] Fix some incorrect behavior in UnionListWriter > - > > Key: ARROW-7472 > URL: https://issues.apache.org/jira/browse/ARROW-7472 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} > APIs seems incorrect. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-4856) [Integration] The spark integration test exceeds the maximum time limit on travis
[ https://issues.apache.org/jira/browse/ARROW-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-4856. - Resolution: Fixed Closing as the Spark testing has been passing > [Integration] The spark integration test exceeds the maximum time limit on > travis > - > > Key: ARROW-4856 > URL: https://issues.apache.org/jira/browse/ARROW-4856 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See build https://travis-ci.org/kszucs/crossbow/builds/505179756 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7502) [Integration] Remove Spark Integration patch that not needed anymore
[ https://issues.apache.org/jira/browse/ARROW-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-7502. - Fix Version/s: 0.16.0 Resolution: Fixed Issue resolved by pull request 6129 [https://github.com/apache/arrow/pull/6129] > [Integration] Remove Spark Integration patch that not needed anymore > > > Key: ARROW-7502 > URL: https://issues.apache.org/jira/browse/ARROW-7502 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Trivial > Labels: pull-request-available > Fix For: 0.16.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Apache Spark master has been updated to work with Arrow 0.15.1 after the > binary protocol change and patching Spark master is no longer necessary to > build with current Arrow, so the previous patch can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7502) [Integration] Remove Spark Integration patch that not needed anymore
Bryan Cutler created ARROW-7502: --- Summary: [Integration] Remove Spark Integration patch that not needed anymore Key: ARROW-7502 URL: https://issues.apache.org/jira/browse/ARROW-7502 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration Reporter: Bryan Cutler Assignee: Bryan Cutler Apache Spark master has been updated to work with Arrow 0.15.1 after the binary protocol change and patching Spark master is no longer necessary to build with current Arrow, so the previous patch can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4856) [Integration] The spark integration test exceeds the maximum time limit on travis
[ https://issues.apache.org/jira/browse/ARROW-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009109#comment-17009109 ] Bryan Cutler commented on ARROW-4856: - I believe this is resolved, is that correct [~kszucs] ? > [Integration] The spark integration test exceeds the maximum time limit on > travis > - > > Key: ARROW-4856 > URL: https://issues.apache.org/jira/browse/ARROW-4856 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Priority: Major > > See build https://travis-ci.org/kszucs/crossbow/builds/505179756 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-2524) [Java] [TEST] Run spark integration tests regularly
[ https://issues.apache.org/jira/browse/ARROW-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-2524. - Resolution: Fixed Closing this as Spark integration tests are being run nightly > [Java] [TEST] Run spark integration tests regularly > --- > > Key: ARROW-2524 > URL: https://issues.apache.org/jira/browse/ARROW-2524 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Krisztian Szucs >Priority: Major > > For example nightly builds, along with dask and hdfs tests, see > https://github.com/apache/arrow/pull/1890 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7223) [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true
[ https://issues.apache.org/jira/browse/ARROW-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987157#comment-16987157 ] Bryan Cutler commented on ARROW-7223: - Thanks [~lidavidm] , it might be the case (which seems likely) that there is not much we can do about this. At the very least, it would be good to have a record of this info for consumers of Arrow Java that also might encounter the issue. > [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true > -- > > Key: ARROW-7223 > URL: https://issues.apache.org/jira/browse/ARROW-7223 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Bryan Cutler >Priority: Major > > After ARROW-3191, consumers of Arrow Java with a JDK 9 and above are required > to set the JVM property "io.netty.tryReflectionSetAccessible=true" at > startup, each time Arrow code is run, as documented at > https://github.com/apache/arrow/tree/master/java#java-properties. Not doing > this will result in the error "java.lang.UnsupportedOperationException: > sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available", > making Arrow unusable out-of-the-box. > This proposes to automatically set the property if not already set in the > following steps: > 1) check to see if the property io.netty.tryReflectionSetAccessible has been > set > 2) if not set, automatically set to "true" > 3) else if set to "false", catch the Netty error and prepend the error > message with the suggested setting of "true" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7223) [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true
Bryan Cutler created ARROW-7223: --- Summary: [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true Key: ARROW-7223 URL: https://issues.apache.org/jira/browse/ARROW-7223 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler After ARROW-3191, consumers of Arrow Java with a JDK 9 and above are required to set the JVM property "io.netty.tryReflectionSetAccessible=true" at startup, each time Arrow code is run, as documented at https://github.com/apache/arrow/tree/master/java#java-properties. Not doing this will result in the error "java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available", making Arrow unusable out-of-the-box. This proposes to automatically set the property if not already set in the following steps: 1) check to see if the property io.netty.tryReflectionSetAccessible has been set 2) if not set, automatically set to "true" 3) else if set to "false", catch the Netty error and prepend the error message with the suggested setting of "true" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
[ https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976844#comment-16976844 ] Bryan Cutler commented on ARROW-4890: - Sorry, I'm not sure of any documentation with the limits. It would be great to get that down somewhere and there should be a better error message for this, but maybe it should be done on the Spark side. > [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1 > - > > Key: ARROW-4890 > URL: https://issues.apache.org/jira/browse/ARROW-4890 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Cloudera cdh5.13.3 > Cloudera Spark 2.3.0.cloudera3 >Reporter: Abdeali Kothari >Priority: Major > Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png > > > Creating this in Arrow project as the traceback seems to suggest this is an > issue in Arrow. > Continuation from the conversation on the > https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E > When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error: > {noformat} > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", > line 279, in load_stream > for batch in reader: > File "pyarrow/ipc.pxi", line 265, in __iter__ > File "pyarrow/ipc.pxi", line 281, in > pyarrow.lib._RecordBatchReader.read_next_batch > File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: read length must be positive or -1 > {noformat} > as my dataset size starts increasing that I want to group on. Here is a > reproducible code snippet where I can reproduce this. > Note: My actual dataset is much larger and has many more unique IDs and is a > valid usecase where I cannot simplify this groupby in any way. I have > stripped out all the logic to make this example as simple as I could. > {code:java} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell' > import findspark > findspark.init() > import pyspark > from pyspark.sql import functions as F, types as T > import pandas as pd > spark = pyspark.sql.SparkSession.builder.getOrCreate() > pdf1 = pd.DataFrame( > [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]], > columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4'] > ) > df1 = spark.createDataFrame(pd.concat([pdf1 for i in > range(429)]).reset_index()).drop('index') > pdf2 = pd.DataFrame( > [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", > "abcdefghijklmno"]], > columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6'] > ) > df2 = spark.createDataFrame(pd.concat([pdf2 for i in > range(48993)]).reset_index()).drop('index') > df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner') > def myudf(df): > return df > df4 = df3 > udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf) > df5 = df4.groupBy('df1_c1').apply(udf) > print('df5.count()', df5.count()) > # df5.write.parquet('/tmp/temp.parquet', mode='overwrite') > {code} > I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per > executor too. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6820. - Resolution: Fixed Issue resolved by pull request 5821 [https://github.com/apache/arrow/pull/5821] > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-6820: --- Assignee: Bryan Cutler > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Assignee: Bryan Cutler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7173) Add test to verify Map field names can be arbitrary
Bryan Cutler created ARROW-7173: --- Summary: Add test to verify Map field names can be arbitrary Key: ARROW-7173 URL: https://issues.apache.org/jira/browse/ARROW-7173 Project: Apache Arrow Issue Type: Test Components: Integration Reporter: Bryan Cutler A Map has child fields and the format spec only recommends that they be named "entries", "key", and "value" but could be named anything. Currently, integration tests for Map arrays verify the exchanged schema is equal, so the child fields are always named the same. There should be tests that use different names to verify implementations can accept this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only
[ https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6930. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 5693 [https://github.com/apache/arrow/pull/5693] > [Java] Create utility class for populating vector values used for test > purpose only > --- > > Key: ARROW-6930 > URL: https://issues.apache.org/jira/browse/ARROW-6930 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Micah Kornfield >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 11h 20m > Remaining Estimate: 0h > > There is a lot of verbosity in the construction of Arrays for testing > purposes (multiple lines of setSafe(...) or set(...). > We should start adding a utility class to make test setup clearer and more > concise, note this class should be located in arrow-vector test package and > could be used in other module’s testing by adding dependency: > {{}} > {{org.apache.arrow}} > {{arrow-vector}} > {{${project.version}}} > {{tests}} > {{test-jar}} > {{test}} > {{}} > Usage would be something like: > {quote}try (IntVector vector = new IntVector(“vector”, allocator)) { > ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5); > output = doSomethingWith(input); > assertThat(output).isEqualTo(expected); > } > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972738#comment-16972738 ] Bryan Cutler commented on ARROW-6820: - I don't think that either C++ or Java require the Map Fields to have certain names, but the integration test framework does check that the names of all child fields are equal. So to resolve this how about I change the name from "entries" to "entry" in C++ and Java to be consistent with the format spec? > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". > In the Java implementation, a map vector also has a child field "entries", > itself with children "key" and "value" (by default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent
[ https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967123#comment-16967123 ] Bryan Cutler commented on ARROW-6820: - I don't have a strong preference to specific naming, but we should try to be consistent. In C++ it is very confusing because many APIs are "key" and "item" because when MapArray is viewed as a list of structs, the term "value" would mean an element in the struct array. Also, there could be conflicts because "value" is already used in List APIs. I think we should stick with the terminology from Schema.fbs where map type is specified as having a child field "entry", itself with children "key" and "value". In C++ we could work around the API by overriding then renaming, e.g. {code}std::shared_ptr MapArray::list_values() { return ListArray::values(); }{code} > [C++] [Doc] [Format] Map specification and implementation inconsistent > -- > > Key: ARROW-6820 > URL: https://issues.apache.org/jira/browse/ARROW-6820 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation, Format >Reporter: Antoine Pitrou >Priority: Blocker > Fix For: 1.0.0 > > > In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is > specified as having a child field "pairs", itself with children "keys" and > "items". > In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map > type is specified as having a child field "entry", itself with children "key" > and "value". > In the C++ implementation, a map type has a child field "entries", itself > with children "key" and "value". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration
[ https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952988#comment-16952988 ] Bryan Cutler commented on ARROW-3850: - I made ARROW-6904 to add MapArray to Arrow Python, once that is done it can be implemented in PySpark and we can close this once it passes the Spark integration tests. Nested structs require some other issues to be worked out, and there are other JIRAs for that. > [Python] Support MapType and StructType for enhanced PySpark integration > > > Key: ARROW-3850 > URL: https://issues.apache.org/jira/browse/ARROW-3850 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.11.1 >Reporter: Florian Wilhelm >Priority: Major > Fix For: 1.0.0 > > > It would be great to support MapType and (nested) StructType in Arrow so that > PySpark can make use of it. > > Quite often as in my use-case in Hive table cells are also complex types > saved. Currently it's not possible to user the new > {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}} > decorator which internally uses Arrow to generate a UDF for columns with > complex types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6904) [Python] Implement MapArray and MapType
[ https://issues.apache.org/jira/browse/ARROW-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952955#comment-16952955 ] Bryan Cutler commented on ARROW-6904: - I can work on this > [Python] Implement MapArray and MapType > --- > > Key: ARROW-6904 > URL: https://issues.apache.org/jira/browse/ARROW-6904 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Fix For: 1.0.0 > > > Map arrays are already added to C++, need to expose them in the Python API > also -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6904) [Python] Implement MapArray and MapType
Bryan Cutler created ARROW-6904: --- Summary: [Python] Implement MapArray and MapType Key: ARROW-6904 URL: https://issues.apache.org/jira/browse/ARROW-6904 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Bryan Cutler Assignee: Bryan Cutler Fix For: 1.0.0 Map arrays are already added to C++, need to expose them in the Python API also -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6429. - Resolution: Fixed > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 1.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946068#comment-16946068 ] Bryan Cutler commented on ARROW-6429: - Tests are passing since ARROW-6686 was merged, I'll resolve this now > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 1.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6790) [Release] Automatically disable integration test cases in release verification
Bryan Cutler created ARROW-6790: --- Summary: [Release] Automatically disable integration test cases in release verification Key: ARROW-6790 URL: https://issues.apache.org/jira/browse/ARROW-6790 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Bryan Cutler Assignee: Bryan Cutler If dev/release/verify-release-candidate.sh is run with selective testing and includes integration tests, the selected implementations should be the only ones enabled when running the integration test portion. For example: TEST_DEFAULT=0 \ TEST_CPP=1 \ TEST_JAVA=1 \ TEST_INTEGRATION=1 \ dev/release/verify-release-candidate.sh source 0.15.0 2 Should run integration only for C++ and Java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration
[ https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937159#comment-16937159 ] Bryan Cutler commented on ARROW-3850: - Now that SPARK-23836 is merged, a scalar Pandas UDF can return a StructType that will accept a pandas.DataFrame. By nested structs, I mean a column of StructType that have a child that is a StructType. Spark does not currently support this as an input column, or return type from Pandas UDFs. > [Python] Support MapType and StructType for enhanced PySpark integration > > > Key: ARROW-3850 > URL: https://issues.apache.org/jira/browse/ARROW-3850 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.11.1 >Reporter: Florian Wilhelm >Priority: Major > Fix For: 1.0.0 > > > It would be great to support MapType and (nested) StructType in Arrow so that > PySpark can make use of it. > > Quite often as in my use-case in Hive table cells are also complex types > saved. Currently it's not possible to user the new > {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}} > decorator which internally uses Arrow to generate a UDF for columns with > complex types. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935210#comment-16935210 ] Bryan Cutler commented on ARROW-6429: - I believe I need to add a patch so Spark can compile with Arrow Java. I'm working on this now. > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6652) [Python] to_pandas conversion removes timezone from type
[ https://issues.apache.org/jira/browse/ARROW-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935089#comment-16935089 ] Bryan Cutler commented on ARROW-6652: - [~wesm] or [~apitrou] would you be able to take a look at this? > [Python] to_pandas conversion removes timezone from type > > > Key: ARROW-6652 > URL: https://issues.apache.org/jira/browse/ARROW-6652 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Bryan Cutler >Priority: Critical > Fix For: 0.15.0 > > > Calling {{to_pandas}} on a {{pyarrow.Array}} with a timezone aware timestamp > type, removes the timezone in the resulting {{pandas.Series}}. > {code} > >>> import pyarrow as pa > >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) > >>> a.to_pandas() > 0 1970-01-01 00:00:00.01 > dtype: datetime64[ns] > {code} > Previous behavior from 0.14.1 of converting a {{pyarrow.Column}} > {{to_pandas}} retained the timezone. > {code} > In [4]: import pyarrow as pa >...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) >...: c = pa.Column.from_array('ts', a) > In [5]: c.to_pandas() > > Out[5]: > 0 1969-12-31 16:00:00.01-08:00 > Name: ts, dtype: datetime64[ns, America/Los_Angeles] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933714#comment-16933714 ] Bryan Cutler edited comment on ARROW-6429 at 9/21/19 4:59 PM: -- [~wesm] the issue with the timestamp test failures looks to be because calling {{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes the tz from the resulting dtype. The behavior before was a pyarrow Column keeps the tz but the pyarrow Array removes when converting to a numpy array. With Arrow 0.14.1 {code:java} In [4]: import pyarrow as pa ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) ...: c = pa.Column.from_array('ts', a) In [5]: c.to_pandas() Out[5]: 0 1969-12-31 16:00:00.01-08:00 Name: ts, dtype: datetime64[ns, America/Los_Angeles] In [6]: a.to_pandas() Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]') {code} With current master {code:java} >>> import pyarrow as pa >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) >>> a.to_pandas() 0 1970-01-01 00:00:00.01 dtype: datetime64[ns] {code} After manually adding the timezone back in the series dtype (and fixing the Java compilation), all tests pass and the spark integration run finished. I wasn't able to look into why the timezone is being removed though. Should I open up a jira for this? edit: I made ARROW-6652 since it is not just a Spark issue was (Author: bryanc): [~wesm] the issue with the timestamp test failures looks to be because calling {{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes the tz from the resulting dtype. The behavior before was a pyarrow Column keeps the tz but the pyarrow Array removes when converting to a numpy array. With Arrow 0.14.1 {code} In [4]: import pyarrow as pa ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) ...: c = pa.Column.from_array('ts', a) In [5]: c.to_pandas() Out[5]: 0 1969-12-31 16:00:00.01-08:00 Name: ts, dtype: datetime64[ns, America/Los_Angeles] In [6]: a.to_pandas() Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]') {code} With current master {code} >>> import pyarrow as pa >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) >>> a.to_pandas() 0 1970-01-01 00:00:00.01 dtype: datetime64[ns] {code} After manually adding the timezone back in the series dtype (and fixing the Java compilation), all tests pass and the spark integration run finished. I wasn't able to look into why the timezone is being removed though. Should I open up a jira for this? > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6652) [Python] to_pandas conversion removes timezone from type
Bryan Cutler created ARROW-6652: --- Summary: [Python] to_pandas conversion removes timezone from type Key: ARROW-6652 URL: https://issues.apache.org/jira/browse/ARROW-6652 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Bryan Cutler Fix For: 0.15.0 Calling {{to_pandas}} on a {{pyarrow.Array}} with a timezone aware timestamp type, removes the timezone in the resulting {{pandas.Series}}. {code} >>> import pyarrow as pa >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) >>> a.to_pandas() 0 1970-01-01 00:00:00.01 dtype: datetime64[ns] {code} Previous behavior from 0.14.1 of converting a {{pyarrow.Column}} {{to_pandas}} retained the timezone. {code} In [4]: import pyarrow as pa ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) ...: c = pa.Column.from_array('ts', a) In [5]: c.to_pandas() Out[5]: 0 1969-12-31 16:00:00.01-08:00 Name: ts, dtype: datetime64[ns, America/Los_Angeles] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933714#comment-16933714 ] Bryan Cutler commented on ARROW-6429: - [~wesm] the issue with the timestamp test failures looks to be because calling {{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes the tz from the resulting dtype. The behavior before was a pyarrow Column keeps the tz but the pyarrow Array removes when converting to a numpy array. With Arrow 0.14.1 {code} In [4]: import pyarrow as pa ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) ...: c = pa.Column.from_array('ts', a) In [5]: c.to_pandas() Out[5]: 0 1969-12-31 16:00:00.01-08:00 Name: ts, dtype: datetime64[ns, America/Los_Angeles] In [6]: a.to_pandas() Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]') {code} With current master {code} >>> import pyarrow as pa >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles')) >>> a.to_pandas() 0 1970-01-01 00:00:00.01 dtype: datetime64[ns] {code} After manually adding the timezone back in the series dtype (and fixing the Java compilation), all tests pass and the spark integration run finished. I wasn't able to look into why the timezone is being removed though. Should I open up a jira for this? > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929909#comment-16929909 ] Bryan Cutler commented on ARROW-6429: - After ARROW-6557 there seems to be another issue with timestamps [https://github.com/apache/arrow/pull/5373#issuecomment-531264154.] I'll look into this soon. > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Wes McKinney >Priority: Blocker > Labels: nightly, pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6534) [Java] Fix typos and spelling
[ https://issues.apache.org/jira/browse/ARROW-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6534. - Resolution: Fixed Issue resolved by pull request 5359 [https://github.com/apache/arrow/pull/5359] > [Java] Fix typos and spelling > - > > Key: ARROW-6534 > URL: https://issues.apache.org/jira/browse/ARROW-6534 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Fix typos and spelling, mostly in docs and tests. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6534) [Java] Fix typos and spelling
Bryan Cutler created ARROW-6534: --- Summary: [Java] Fix typos and spelling Key: ARROW-6534 URL: https://issues.apache.org/jira/browse/ARROW-6534 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler Fix For: 0.15.0 Fix typos and spelling, mostly in docs and tests. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927871#comment-16927871 ] Bryan Cutler commented on ARROW-6429: - The failure seems to be caused from the removal of pyarrow.Column in favor of pyarrow.ChunkedArray. Spark iterates over columns of a pyarrow.Table, calls {{to_pandas()}} on each column, and assumes the result is a pd.Series. If the column is actually a pyarrow.ChunkedArray, then {{to_pandas()}} can be a numpy.array. [~wesmckinn] [~pitrou] I know in the pydoc it says the returned value can either be a pandas.Series or numpy.array, but is there anyway to ensure it is the former or is that the job of the caller? > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Bryan Cutler >Priority: Blocker > Labels: nightly > Fix For: 0.15.0 > > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6519) [Java] Use IPC continuation token to mark EOS
Bryan Cutler created ARROW-6519: --- Summary: [Java] Use IPC continuation token to mark EOS Key: ARROW-6519 URL: https://issues.apache.org/jira/browse/ARROW-6519 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler Fix For: 0.15.0 For Arrow stream in non-legacy mode, the EOS identifier should be \{0x, 0x}. This way, all bytes sent by the writer can be read. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails
[ https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926208#comment-16926208 ] Bryan Cutler commented on ARROW-6429: - I will take a look, but it might be a few days until I can get to it. > [CI][Crossbow] Nightly spark integration job fails > -- > > Key: ARROW-6429 > URL: https://issues.apache.org/jira/browse/ARROW-6429 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Bryan Cutler >Priority: Blocker > Labels: nightly > Fix For: 0.15.0 > > > See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and > create followup Jira to unskip, or delete job. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (ARROW-6474) [Python] Provide mechanism for python to write out old format
[ https://issues.apache.org/jira/browse/ARROW-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926207#comment-16926207 ] Bryan Cutler commented on ARROW-6474: - I think this is more for the case that a user is stuck with an already released Spark version, e.g. <= 2.4.4, and ends up installing pyarrow >= 0.15.0. The pyarrow writers will use the new format by default, which the Arrow Java version in Spark will be unable to handle since it's using 0.14.1. There is no way for the user to set the option in the pyarrow writer either, so they would have to downgrade pyarrow. I think this it's fair to say they need to stick with pyarrow 0.14.1, but an env variable would give them a way to use the latest release. > [Python] Provide mechanism for python to write out old format > - > > Key: ARROW-6474 > URL: https://issues.apache.org/jira/browse/ARROW-6474 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: Micah Kornfield >Priority: Blocker > Fix For: 0.15.0 > > > I think this needs to be an environment variable, so it can be made to work > with old version of the Java library pyspark integration. > > [~bryanc] can you check if this captures the requirements? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading
[ https://issues.apache.org/jira/browse/ARROW-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-6461: --- Assignee: Bryan Cutler > [Java] EchoServer can close socket before client has finished reading > - > > Key: ARROW-6461 > URL: https://issues.apache.org/jira/browse/ARROW-6461 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 50m > Remaining Estimate: 0h > > When the EchoServer finishes running the client connection, the socket is > closed immediately. This causes a race condition and the client will fail > with a > {noformat} > SocketException: connection reset {noformat} > if it has not read all of the echoed batches. > This was consistently happening with the fix for ARROW-6315 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading
[ https://issues.apache.org/jira/browse/ARROW-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6461. - Resolution: Fixed Issue resolved by pull request 5288 [https://github.com/apache/arrow/pull/5288] > [Java] EchoServer can close socket before client has finished reading > - > > Key: ARROW-6461 > URL: https://issues.apache.org/jira/browse/ARROW-6461 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 40m > Remaining Estimate: 0h > > When the EchoServer finishes running the client connection, the socket is > closed immediately. This causes a race condition and the client will fail > with a > {noformat} > SocketException: connection reset {noformat} > if it has not read all of the echoed batches. > This was consistently happening with the fix for ARROW-6315 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading
Bryan Cutler created ARROW-6461: --- Summary: [Java] EchoServer can close socket before client has finished reading Key: ARROW-6461 URL: https://issues.apache.org/jira/browse/ARROW-6461 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Bryan Cutler Fix For: 0.15.0 When the EchoServer finishes running the client connection, the socket is closed immediately. This causes a race condition and the client will fail with a {noformat} SocketException: connection reset {noformat} if it has not read all of the echoed batches. This was consistently happening with the fix for ARROW-6315 -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6202) [Java] Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of size 4 due to memory limit. Current allocation: 2147483646
[ https://issues.apache.org/jira/browse/ARROW-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6202. - Fix Version/s: 0.15.0 Resolution: Fixed Issue resolved by pull request 5134 [https://github.com/apache/arrow/pull/5134] > [Java] Exception in thread "main" > org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of > size 4 due to memory limit. Current allocation: 2147483646 > --- > > Key: ARROW-6202 > URL: https://issues.apache.org/jira/browse/ARROW-6202 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Affects Versions: 0.14.1 >Reporter: Jim Northrup >Assignee: Micah Kornfield >Priority: Major > Labels: jdbc, pull-request-available > Fix For: 0.15.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > jdbc query results exceed native heap when using generous -Xmx settings. > for roughly 800 megabytes of csv/flatfile resultset, arrow is unable to house > the contents in RAM long enough to persist to disk, without explicit > knowledge beyond unit test sample code. > source: > https://github.com/jnorthrup/jdbc2json/blob/master/src/main/java/com/fnreport/QueryToFeather.kt#L83 > {code:java} > Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: > Unable to allocate buffer of size 4 due to memory limit. Current allocation: > 2147483646 > at > org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:307) > at > org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:277) > at > org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.updateVector(JdbcToArrowUtils.java:610) > at > org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToFieldVector(JdbcToArrowUtils.java:462) > at > org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:396) > at > org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:225) > at > org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:187) > at > org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:156) > at com.fnreport.QueryToFeather$Companion.go(QueryToFeather.kt:83) > at > com.fnreport.QueryToFeather$Companion$main$1.invokeSuspend(QueryToFeather.kt:95) > at > kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) > at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:241) > at > kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:270) > at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:79) > at > kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:54) > at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source) > at > kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:36) > at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source) > at com.fnreport.QueryToFeather$Companion.main(QueryToFeather.kt:93) > at com.fnreport.QueryToFeather.main(QueryToFeather.kt) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6011) [Python] Data incomplete when using pyarrow in pyspark in python 3.x
[ https://issues.apache.org/jira/browse/ARROW-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6011. - Resolution: Cannot Reproduce I could not reproduce. We can continue the discussion in SPARK-28482 and reopen if we find an issue in Arrow > [Python] Data incomplete when using pyarrow in pyspark in python 3.x > > > Key: ARROW-6011 > URL: https://issues.apache.org/jira/browse/ARROW-6011 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python >Affects Versions: 0.10.0, 0.14.0 > Environment: ceonts 7.4 pyarrow 0.10.0 0.14.0 python 2.7 3.5 > 3.6 >Reporter: jiangyu >Priority: Major > Attachments: image-2019-07-23-16-06-49-889.png, py3.6.png, test.csv, > test.py, worker.png > > > Hi, > > Since Spark 2.3.x, pandas udf has been introduced as default ser/des method. > However, an issue raises with python >= 3.5.x version. > We use pandas udf to process batches of data, but we find the data is > incomplete in python 3.x. At first , i think the process logical maybe wrong, > so i change the code to very simple one and it has the same problem.After > investigate for a week, i find it is related to pyarrow. > > *Reproduce procedure:* > 1. prepare data > The data have seven column, a、b、c、d、e、f and g, data type is Integer > a,b,c,d,e,f,g > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > 1,2,3,4,5,6,7 > produce 100,000 rows and name the file test.csv ,upload to hdfs, then load > it , and repartition it to 1 partition. > > {code:java} > df=spark.read.format('csv').option("header","true").load('/test.csv') > df=df.select(*(col(c).cast("int").alias(c) for c in df.columns)) > df=df.repartition(1) > spark_context = SparkContext.getOrCreate() {code} > > 2.register pandas udf > > {code:java} > def add_func(a,b,c,d,e,f,g): > print('iterator one time') > return a > add = pandas_udf(add_func, returnType=IntegerType()) > df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code} > > 3.apply pandas udf > > {code:java} > def trigger_func(iterator): > yield iterator > df_result.rdd.foreachPartition(trigger_func){code} > > 4.execute it in pyspark (local or yarn) > run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As > mentioned before the total row number is 100, it should print "iterator > one time " 10 times. > (1)Python 2.7 envs: > > {code:java} > PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf > spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf > spark.executor.pyspark.memory=2g --conf > spark.sql.execution.arrow.enabled=true --executor-cores 1{code} > > !image-2019-07-23-16-06-49-889.png! > The result is right, 10 times of print. > > > (2)Python 3.5 or 3.6 envs: > {code:java} > PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf > spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf > spark.executor.pyspark.memory=2g --conf > spark.sql.execution.arrow.enabled=true --executor-cores{code} > > !py3.6.png! > The data is incomplete. Exception is print by spark which have been added by > us , I will explain it later. > > > h3. *Investigation* > The “process done” is added in the worker.py. > !worker.png! > In order to get the exception, change the spark code, the code is under > core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to > print the exception. > > > {code:java} > @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging { > case t: Throwable => > // Purposefully not using NonFatal, because even fatal exceptions > // we don't want to have our finallyBlock suppress > + logInfo(t.getLocalizedMessage) > + t.printStackTrace() > originalThrowable = t > throw originalThrowable > } finally {{code} > > > It seems the pyspark get the data from jvm , but pyarrow get the data > incomplete. Pyarrow side think the data is finished, then shutdown the > socket. At the same time, the jvm side still writes to the same socket , but > get socket close exception. > The pyarrow part is in ipc.pxi: > > {code:java} > cdef class _RecordBatchReader: > cdef: > shared_ptr[CRecordBatchReader] reader > shared_ptr[InputStream] in_stream > cdef readonly: > Schema schema > def _cinit_(self): > pass > def _open(self, source): > get_input_stream(source, _stream) > with nogil: > check_status(CRecordBatchStreamReader.Open( > self.in_stream.get(), )) > self.schema = pyarrow_wrap_schema(self.reader.get().schema()) > def _iter_(self): > while True: > yield self.read_next_batch() > def get_next_batch(self): > import warnings > warnings.warn('Please use read_next_batch instead of ' >
[jira] [Updated] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'
[ https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-6301: Summary: [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found' (was: atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found') > [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > --- > > Key: ARROW-6301 > URL: https://issues.apache.org/jira/browse/ARROW-6301 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 > Environment: linux, virtualenv, uwsgi, cpython 2.7 >Reporter: David Alphus >Priority: Minor > > On interrupt, I am frequently seeing the atexit function failing in pyarrow > 0.14.1. > {code:java} > ^CSIGINT/SIGQUIT received...killing workers... > killing the spooler with pid 22640 > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > check_status(UnregisterPyExtensionType()) > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > raise ArrowKeyError(message) > ArrowKeyError: 'No type extension with name arrow.py_extension_type found' > Error in sys.exitfunc: > Traceback (most recent call last): > File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in > _run_exitfuncs > func(*targs, **kargs) > File "pyarrow/types.pxi", line 1860, in > pyarrow.lib._unregister_py_extension_type > File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status > pyarrow.lib.ArrowKeyError: 'No type extension with name > arrow.py_extension_type found' > spooler (pid: 22640) annihilated > worker 1 buried after 1 seconds > goodbye to uWSGI.{code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (ARROW-6210) [Java] remove equals API from ValueVector
[ https://issues.apache.org/jira/browse/ARROW-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6210. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5065 [https://github.com/apache/arrow/pull/5065] > [Java] remove equals API from ValueVector > - > > Key: ARROW-6210 > URL: https://issues.apache.org/jira/browse/ARROW-6210 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Pindikura Ravindra >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > > This is a follow-up from [https://github.com/apache/arrow/pull/4933] > The callers should be fixed to use the RangeEquals API instead. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
[ https://issues.apache.org/jira/browse/ARROW-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906415#comment-16906415 ] Bryan Cutler commented on ARROW-6211: - This sounds good to me then, I agree it would be useful to have a generic visitor api > [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface > - > > Key: ARROW-6211 > URL: https://issues.apache.org/jira/browse/ARROW-6211 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Pindikura Ravindra >Assignee: Ji Liu >Priority: Major > > This is a follow-up from [https://github.com/apache/arrow/pull/4933] > > public interface VectorVisitor \{..} > > In ValueVector : > public OUT accept(VectorVisitor > visitor, IN value) throws EX; > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6215) [Java] RangeEqualVisitor does not properly compare ZeroVector
[ https://issues.apache.org/jira/browse/ARROW-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6215. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5070 [https://github.com/apache/arrow/pull/5070] > [Java] RangeEqualVisitor does not properly compare ZeroVector > - > > Key: ARROW-6215 > URL: https://issues.apache.org/jira/browse/ARROW-6215 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 10m > Remaining Estimate: 0h > > ZeroVector.accept and RangeEqualVisitor always return True no matter what > type of other vector is compared -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6215) [Java] RangeEqualVisitor does not properly compare ZeroVector
Bryan Cutler created ARROW-6215: --- Summary: [Java] RangeEqualVisitor does not properly compare ZeroVector Key: ARROW-6215 URL: https://issues.apache.org/jira/browse/ARROW-6215 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler ZeroVector.accept and RangeEqualVisitor always return True no matter what type of other vector is compared -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-6209) [Java] Extract set null method to the base class for fixed width vectors
[ https://issues.apache.org/jira/browse/ARROW-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-6209. - Resolution: Fixed Fix Version/s: 0.15.0 Issue resolved by pull request 5064 [https://github.com/apache/arrow/pull/5064] > [Java] Extract set null method to the base class for fixed width vectors > > > Key: ARROW-6209 > URL: https://issues.apache.org/jira/browse/ARROW-6209 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, each fixed width vector has the setNull method. All these > implementations are identical, so we move them to the base class. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
[ https://issues.apache.org/jira/browse/ARROW-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905457#comment-16905457 ] Bryan Cutler commented on ARROW-6211: - So will this allow for other types of visitors besides a RangeEqualsVisitor? > [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface > - > > Key: ARROW-6211 > URL: https://issues.apache.org/jira/browse/ARROW-6211 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Pindikura Ravindra >Assignee: Ji Liu >Priority: Major > > This is a follow-up from [https://github.com/apache/arrow/pull/4933] > > public interface VectorVisitor \{..} > > In ValueVector : > public OUT accept(VectorVisitor > visitor, IN value) throws EX; > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5579) [Java] shade flatbuffer dependency
[ https://issues.apache.org/jira/browse/ARROW-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-5579. - Resolution: Fixed Fix Version/s: (was: 1.0.0) 0.15.0 Issue resolved by pull request 4701 [https://github.com/apache/arrow/pull/4701] > [Java] shade flatbuffer dependency > -- > > Key: ARROW-5579 > URL: https://issues.apache.org/jira/browse/ARROW-5579 > Project: Apache Arrow > Issue Type: Task > Components: Java >Reporter: Pindikura Ravindra >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 20h 40m > Remaining Estimate: 0h > > Reported in a [github issue|[https://github.com/apache/arrow/issues/4489]] > > After some [discussion|https://github.com/google/flatbuffers/issues/5368] > with the Flatbuffers maintainer, it appears that FB generated code is not > guaranteed to be compatible with _any other_ version of the runtime library > other than the exact same version of the flatc used to compile it. > This makes depending on flatbuffers in a library (like arrow) quite risky, as > if an app depends on any other version of FB, either directly or > transitively, it's likely the versions will clash at some point and you'll > see undefined behaviour at runtime. > Shading the dependency looks to me the best way to avoid this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (ARROW-1184) [Java] Dictionary.equals is not working correctly
[ https://issues.apache.org/jira/browse/ARROW-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-1184: --- Assignee: Ji Liu > [Java] Dictionary.equals is not working correctly > - > > Key: ARROW-1184 > URL: https://issues.apache.org/jira/browse/ARROW-1184 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Bryan Cutler >Assignee: Ji Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > > The {{Dictionary.equals}} method does not return True when the dictionaries > are equal. This is because {{equals}} is not implemented for FieldVector and > so that comparison defaults to comparing the two objects only and not the > vector data. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-1184) [Java] Dictionary.equals is not working correctly
[ https://issues.apache.org/jira/browse/ARROW-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-1184. - Resolution: Fixed Issue resolved by pull request 4843 [https://github.com/apache/arrow/pull/4843] > [Java] Dictionary.equals is not working correctly > - > > Key: ARROW-1184 > URL: https://issues.apache.org/jira/browse/ARROW-1184 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > The {{Dictionary.equals}} method does not return True when the dictionaries > are equal. This is because {{equals}} is not implemented for FieldVector and > so that comparison defaults to comparing the two objects only and not the > vector data. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily
[ https://issues.apache.org/jira/browse/ARROW-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-5911. - Resolution: Fixed Fix Version/s: 0.14.1 Issue resolved by pull request 4854 [https://github.com/apache/arrow/pull/4854] > [Java] Make ListVector and MapVector create reader lazily > - > > Key: ARROW-5911 > URL: https://issues.apache.org/jira/browse/ARROW-5911 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.1 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Current implementation creates reader eagerly, which may cause unnecessary > resource and time. This issue changes the behavior to lazily create the > reader. > This is a follow-up issue for ARROW-5897. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Resolved] (ARROW-5435) [Java] add test for IntervalYearVector#getAsStringBuilder
[ https://issues.apache.org/jira/browse/ARROW-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved ARROW-5435. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4407 [https://github.com/apache/arrow/pull/4407] > [Java] add test for IntervalYearVector#getAsStringBuilder > - > > Key: ARROW-5435 > URL: https://issues.apache.org/jira/browse/ARROW-5435 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5762) [Integration][JS] Integration Tests for Map Type
[ https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated ARROW-5762: Summary: [Integration][JS] Integration Tests for Map Type (was: [Integration][JS] Integration Tests for MapType) > [Integration][JS] Integration Tests for Map Type > > > Key: ARROW-5762 > URL: https://issues.apache.org/jira/browse/ARROW-5762 > Project: Apache Arrow > Issue Type: Improvement > Components: Integration, JavaScript >Reporter: Bryan Cutler >Priority: Major > > ARROW-1279 enabled integration tests for MapType between Java and C++, but > JavaScript had to be disabled for the map case due to an error. Once this is > fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} > with the other nested types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)