[jira] [Created] (ARROW-18360) [Python] Incorrectly passing schema=None to do_put crashes

2022-11-17 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-18360:


 Summary: [Python] Incorrectly passing schema=None to do_put crashes
 Key: ARROW-18360
 URL: https://issues.apache.org/jira/browse/ARROW-18360
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 9.0.0
Reporter: Bryan Cutler


In pyarrow.flight, putting an incorrect value of None for schema in do_put will 
lead to a core dump.

In pyarrow 9.0.0, trying to enter the command leads to a 

{code}
In [3]: writer, reader = 
client.do_put(flight.FlightDescriptor.for_command(cmd), schema=None)
Segmentation fault (core dumped)
{code}

In pyarrow 7.0.0, the kernel crashes after attempting to access the writer and 
I got the following:
{code}
In [38]: client = flight.FlightClient('grpc+tls://localhost:9643', 
disable_server_verification=True)

In [39]: writer, reader = 
client.do_put(flight.FlightDescriptor.for_command(cmd), None)

In [40]: 
writer./home/conda/feedstock_root/build_artifacts/arrow-cpp-ext_1644752264449/work/cpp/src/arrow/flight/client.cc:736:
  Check failed: (batch_writer_) != (nullptr) 
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(+0x66288c)[0x7f0feeae088c]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0x101)[0x7f0feeae0c91]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/../../../libarrow_flight.so.700(+0x7c1e1)[0x7f0fa9e331e1]
miniconda3/envs/dev/lib/python3.10/site-packages/pyarrow/lib.cpython-310-x86_64-linux-gnu.so(+0x17cf1a)[0x7f0fefe7ff1a]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(+0x144814)[0x559a7cb8f814]
miniconda3/envs/dev/bin/python(+0x1445bf)[0x559a7cb8f5bf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(+0x1516ac)[0x559a7cb9c6ac]
miniconda3/envs/dev/bin/python(PyObject_Call+0xb8)[0x559a7cb9d348]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(+0x151ef3)[0x559a7cb9cef3]
miniconda3/envs/dev/bin/python(+0x1ead44)[0x559a7cc35d44]
miniconda3/envs/dev/bin/python(+0x220397)[0x559a7cc6b397]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x1311)[0x559a7cb7fbd1]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x2b05)[0x559a7cb813c5]
miniconda3/envs/dev/bin/python(_PyFunction_Vectorcall+0x6f)[0x559a7cb8f3cf]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x66f)[0x559a7cb7ef2f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(+0x1416f5)[0x559a7cb8c6f5]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x52)[0x559a7cb8c4a2]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x14fc9d)[0x559a7cb9ac9d]
miniconda3/envs/dev/bin/python(_PyObject_GenericGetAttrWithDict+0x4f3)[0x559a7cb8da03]
miniconda3/envs/dev/bin/python(PyObject_GetAttr+0x44)[0x559a7cb8c494]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x4d8f)[0x559a7cb8364f]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x9ca)[0x559a7cb7f28a]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]
miniconda3/envs/dev/bin/python(+0x1602d9)[0x559a7cbab2d9]
miniconda3/envs/dev/bin/python(+0x19d5f5)[0x559a7cbe85f5]
miniconda3/envs/dev/bin/python(_PyEval_EvalFrameDefault+0x30c)[0x559a7cb7ebcc]
miniconda3/envs/dev/bin/python(+0x15a178)[0x559a7cba5178]

[jira] [Created] (ARROW-15831) [Java] Upgrade Flight dependencies

2022-03-02 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-15831:


 Summary: [Java] Upgrade Flight dependencies
 Key: ARROW-15831
 URL: https://issues.apache.org/jira/browse/ARROW-15831
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Upgrade grpc, netty and protobuf dependencies for Flight



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15746) [Java] Add arrow-flight pom to list of artifacts to deploy

2022-02-21 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-15746:


 Summary: [Java] Add arrow-flight pom to list of artifacts to deploy
 Key: ARROW-15746
 URL: https://issues.apache.org/jira/browse/ARROW-15746
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


The arrow-flight pom is currently not being deployed, see 
https://lists.apache.org/thread/fbrgvf30os5h4ox7fk4txrlgdp1g5g4g



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15722) [Java] Improve error message for ListVector with wrong number of children

2022-02-17 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-15722:


 Summary: [Java] Improve error message for ListVector with wrong 
number of children
 Key: ARROW-15722
 URL: https://issues.apache.org/jira/browse/ARROW-15722
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


If a ListVector is made without any children, the error message will say "Lists 
have only one child. Found: []".

The wording could be improved a little to let the user know what went wrong.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14198) [Java] Upgrade Netty and gRPC dependencies

2021-10-01 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-14198:


 Summary: [Java] Upgrade Netty and gRPC dependencies
 Key: ARROW-14198
 URL: https://issues.apache.org/jira/browse/ARROW-14198
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Current versions in use are quite old and subject to vulnerabilities.

See https://www.cvedetails.com/cve/CVE-2021-21409/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13872) [Java] ExtensionTypeVector does not work with RangeEqualsVisitor

2021-09-02 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-13872:


 Summary: [Java] ExtensionTypeVector does not work with 
RangeEqualsVisitor
 Key: ARROW-13872
 URL: https://issues.apache.org/jira/browse/ARROW-13872
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 5.0.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


When using an ExtensionTypeVector with a RangeEqualsVector to compare with 
another extension type vector, it fails because in vector.accept() the 
extension type defers to the underlyingVector, but this is not done for the 
vector initially set in the RangeEqualsVisitor, so it ends up either failing 
due to different types or attempting to cast the extension vector to the 
underlying vector type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13076) [Java] Enable ExtensionType to use StructVector and UnionVector for underlying storage

2021-06-14 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-13076:


 Summary: [Java] Enable ExtensionType to use StructVector and 
UnionVector for underlying storage
 Key: ARROW-13076
 URL: https://issues.apache.org/jira/browse/ARROW-13076
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Currently, an ExtensionTypeVector has a type constraint for the underlying 
storage to extend BaseValueVector. StructVector, UnionVector and 
DenseUnionVector do not extend this base class.

After ARROW-13044, Union vectors will extend the ValueVector interface and the 
extension vector type constrain could be relaxed to this interface to allow the 
above vector types to be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13044) [Java] Union vectors should extend BaseValueVector

2021-06-10 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-13044:


 Summary: [Java] Union vectors should extend BaseValueVector
 Key: ARROW-13044
 URL: https://issues.apache.org/jira/browse/ARROW-13044
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 4.0.1
Reporter: Bryan Cutler
Assignee: Bryan Cutler


I was going to try using a DenseUnionVector as the underlying vector of an 
extension type but it's not currently possible because ExtensionTypeVector has 
a type constraint for the underlying storage to extend BaseValueVector and the 
union vectors do not extend this class.

It should be possible for UnionVector and DenseUnionVector to extend 
AbstractContainerVector, which is a subclass of BaseValueVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11739) [Java] Add API for getBufferSize() with density to BaseVariableWidthVector

2021-02-22 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-11739:


 Summary: [Java] Add API for getBufferSize() with density to 
BaseVariableWidthVector
 Key: ARROW-11739
 URL: https://issues.apache.org/jira/browse/ARROW-11739
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11382) [Java] NullVector field name can't be set

2021-01-25 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-11382:


 Summary: [Java] NullVector field name can't be set
 Key: ARROW-11382
 URL: https://issues.apache.org/jira/browse/ARROW-11382
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler


Currently, the Java NullVector has a default Field name fixed to 
DATA_VECTOR_NAME, which is "$data$". This should be able to be changed by the 
user, probably by having an alternate constructor that accepts a name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10512) [Python] Arrow to Pandas conversion promotes child array to float for NULL values

2020-11-06 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-10512:


 Summary: [Python] Arrow to Pandas conversion promotes child array 
to float for NULL values
 Key: ARROW-10512
 URL: https://issues.apache.org/jira/browse/ARROW-10512
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler


When converting a nested Arrow array to Pandas, if a child array is an integer 
type with NULL values, it gets promoted to floating point and NULL values are 
replaced with NaNs. Since the Pandas conversion for these types results in 
Python objects, it is not necessary to use NaN and `None` values could be 
inserted instead. This is the case for ListType, MapType and StructType, etc.

{code}
In [4]: s = pd.Series([[1, 2, 3], [4, 5, None]])

In [5]: arr = pa.Array.from_pandas(s)

In [6]: arr.type
Out[6]: ListType(list)

In [7]: arr.to_pandas()
Out[7]: 
0[1.0, 2.0, 3.0]
1[4.0, 5.0, nan]
dtype: object {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10457) [CI] Fix Spark branch-3.0 integration tests

2020-11-01 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-10457:


 Summary: [CI] Fix Spark branch-3.0 integration tests
 Key: ARROW-10457
 URL: https://issues.apache.org/jira/browse/ARROW-10457
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Bryan Cutler


The Spark branch-3.0 is currently failing because this branch has not been 
updated or patched to use the latest Arrow Java, see 
https://github.com/ursa-labs/crossbow/actions?query=branch:actions-681-github-test-conda-python-3.7-spark-branch-3.0.
 The Spark branch-3.0 has already been released and only able to get bug fixes. 
Instead of patching the Spark build, we should not try to rebuild Spark with 
the latest Arrow Java, and instead only test against the latest pyarrow. This 
should work, but might also need a minor Python patch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10260) [Python] Missing MapType to Pandas dtype

2020-10-09 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-10260:


 Summary: [Python] Missing MapType to Pandas dtype
 Key: ARROW-10260
 URL: https://issues.apache.org/jira/browse/ARROW-10260
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler


The Map type conversion to Pandas done in ARROW-10151 forgot to add dtype 
mapping for {{to_pandas_dtype()}}

 
{code:java}
In [2]: d = pa.map_(pa.int64(), pa.float64())   
 In [3]: d.to_pandas_dtype()

  
---
NotImplementedError   Traceback (most recent call last)
 in 
> 1 
d.to_pandas_dtype()~/miniconda2/envs/pyarrow-test/lib/python3.7/site-packages/pyarrow/types.pxi
 in pyarrow.lib.DataType.to_pandas_dtype()NotImplementedError: map{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10151) [Python] Add support MapArray to_pandas conversion

2020-10-01 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-10151:


 Summary: [Python] Add support MapArray to_pandas conversion
 Key: ARROW-10151
 URL: https://issues.apache.org/jira/browse/ARROW-10151
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler
 Fix For: 2.0.0


MapArray does not currently support to_pandas conversion and raises a 
{{Status::NotImplemented("No known equivalent Pandas block for Arrow data of 
type ")}}

Conversion from Pandas seems to work, but should verify there are tests in 
place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9750) [Doc][Python] Add usage of py.Array scalar operations behavior

2020-08-14 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-9750:
---

 Summary: [Doc][Python] Add usage of py.Array scalar operations 
behavior
 Key: ARROW-9750
 URL: https://issues.apache.org/jira/browse/ARROW-9750
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Python
Affects Versions: 1.0.0
Reporter: Bryan Cutler


Recent changes in 1.0.0 affected the way pyarrow.Array  scalars handle 
operations such as equality. For example, an equality check will compare object 
equivalence and return False no matter what the value is. Since this could be 
confusing to the user, there should be some documentation on this behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9576) [Doc] Fix error in code example for extension types

2020-07-27 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-9576:
---

 Summary: [Doc] Fix error in code example for extension types
 Key: ARROW-9576
 URL: https://issues.apache.org/jira/browse/ARROW-9576
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


There is an error in the example code using an undefined variable `arr` instead 
of `self` here 
https://arrow.apache.org/docs/python/extending_types.html#conversion-to-pandas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9545) [Java] Add forward compatibility checks for unrecognized future MetadataVersion

2020-07-23 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-9545:
---

 Summary: [Java] Add forward compatibility checks for unrecognized 
future MetadataVersion
 Key: ARROW-9545
 URL: https://issues.apache.org/jira/browse/ARROW-9545
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Bryan Cutler
Assignee: Wes McKinney
 Fix For: 1.0.0


We should have no need of these checks in theory, but they present a safeguard 
should some years in the future it became necessary to increment the 
MetadataVersion. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9357) [Java] Document how to set netty/unsafe allocators

2020-07-07 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-9357:
---

 Summary: [Java] Document how to set netty/unsafe allocators
 Key: ARROW-9357
 URL: https://issues.apache.org/jira/browse/ARROW-9357
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Bryan Cutler


There are now 2 allocators available, one based on netty and one using unsafe 
apis. We should provide end-user documentation on which one is default and how 
to set and use each one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9356) [Java] Remove Netty dependency from arrow-vector

2020-07-07 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-9356:
---

 Summary: [Java] Remove Netty dependency from arrow-vector 
 Key: ARROW-9356
 URL: https://issues.apache.org/jira/browse/ARROW-9356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
 Fix For: 1.0.0


Cleanup remaining usage of Netty from arrow-vector and remove as a dependency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6111) [Java] Support LargeVarChar and LargeBinary types and add integration test with C++

2020-05-27 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6111.
-
Resolution: Fixed

Issue resolved by pull request 6425
[https://github.com/apache/arrow/pull/6425]

> [Java] Support LargeVarChar and LargeBinary types and add integration test 
> with C++
> ---
>
> Key: ARROW-6111
> URL: https://issues.apache.org/jira/browse/ARROW-6111
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Integration, Java
>Reporter: Micah Kornfield
>Assignee: Liya Fan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8731) Error when using toPandas with PyArrow

2020-05-07 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101844#comment-17101844
 ] 

Bryan Cutler commented on ARROW-8731:
-

[~are...@wayfair.com] you should be able to use a newer version of pyarrow with 
pyspark 2.4.4 by following the instructions here 
https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x

> Error when using toPandas with PyArrow
> --
>
> Key: ARROW-8731
> URL: https://issues.apache.org/jira/browse/ARROW-8731
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: Python Environment on the worker and driver
> - jupyter==1.0.0
> - pandas==1.0.3
> - pyarrow==0.14.0
> - pyspark==2.4.0
> - py4j==0.10.7
> - pyarrow==0.14.0
>Reporter: Andrew Redd
>Priority: Blocker
>
> I'm getting the following error when calling toPandas on a spark dataframe. I 
> imagine my pyspark and pyarrow versions are clashing somehow but I haven't 
> found this same issue by anyone else online
>  * This is a blocker to our use of pyarrow on a project
>  
> {code:java}
> ---
> TypeError Traceback (most recent call last)
>  in 
> > 1 df.limit(100).toPandas()
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in toPandas(self)
>2119 _check_dataframe_localize_timestamps
>2120 import pyarrow
> -> 2121 batches = self._collectAsArrow()
>2122 if len(batches) > 0:
>2123 table = pyarrow.Table.from_batches(batches)
> /venv/lib/python3.6/site-packages/pyspark/sql/dataframe.py in 
> _collectAsArrow(self)
>2177 with SCCallSiteSync(self._sc) as css:
>2178 sock_info = self._jdf.collectAsArrowToPython()
> -> 2179 return list(_load_from_socket(sock_info, 
> ArrowStreamSerializer()))
>2180 
>2181 
> ##
> /venv/lib/python3.6/site-packages/pyspark/rdd.py in 
> _load_from_socket(sock_info, serializer)
> 142 
> 143 def _load_from_socket(sock_info, serializer):
> --> 144 (sockfile, sock) = local_connect_and_auth(*sock_info)
> 145 # The RDD materialization time is unpredicable, if we set a 
> timeout for socket reading
> 146 # operation, it will very possibly fail. See SPARK-18281.
> TypeError: local_connect_and_auth() takes 2 positional arguments but 3 were 
> given
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7610) [Java] Finish support for 64 bit int allocations

2020-04-27 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7610.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6323
[https://github.com/apache/arrow/pull/6323]

> [Java] Finish support for 64 bit int allocations 
> -
>
> Key: ARROW-7610
> URL: https://issues.apache.org/jira/browse/ARROW-7610
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> 1.  Add an allocator capable of allocating larger then 2GB of data.
> 2.  Do end-to-end round trip trip on a larger vector/record batch size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays

2020-04-13 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-8386.
-
Fix Version/s: 0.17.0
   Resolution: Fixed

Issue resolved by pull request 6889
[https://github.com/apache/arrow/pull/6889]

> [Python] pyarrow.jvm raises error for empty Arrays
> --
>
> Key: ARROW-8386
> URL: https://issues.apache.org/jira/browse/ARROW-8386
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In the pyarrow.jvm module, when there is an empty array in Java, trying to 
> create it in python raises a ValueError. This is because for an empty array, 
> Java returns an empty list of buffers, then pyarrow.jvm attempts to create 
> the array with pa.Array.from_buffers with an empty list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8386) [Python] pyarrow.jvm raises error for empty Arrays

2020-04-09 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-8386:
---

 Summary: [Python] pyarrow.jvm raises error for empty Arrays
 Key: ARROW-8386
 URL: https://issues.apache.org/jira/browse/ARROW-8386
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


In the pyarrow.jvm module, when there is an empty array in Java, trying to 
create it in python raises a ValueError. This is because for an empty array, 
Java returns an empty list of buffers, then pyarrow.jvm attempts to create the 
array with pa.Array.from_buffers with an empty list.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-5649) [Integration][C++][Java] Create round trip integration test for extension types

2020-03-05 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052379#comment-17052379
 ] 

Bryan Cutler commented on ARROW-5649:
-

[~npr] I think in the scope of our current integration tests, yes this is 
effectively the same. It would be nice to test the addition steps of creating 
the extension type array across implementations and verifying it works as 
expected. Not sure how that would be done in our existing integration testing 
framework though.

> [Integration][C++][Java] Create round trip integration test for extension 
> types
> ---
>
> Key: ARROW-5649
> URL: https://issues.apache.org/jira/browse/ARROW-5649
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Integration, Java
>Affects Versions: 0.16.0
>Reporter: Micah Kornfield
>Priority: Major
> Fix For: 1.0.0
>
>
> With Java and C++ code merged we should verify round-trip of the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-02-28 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7966:

Component/s: Integration
 FlightRPC

> [Integration][Flight][C++] Client should verify each batch independently
> 
>
> Key: ARROW-7966
> URL: https://issues.apache.org/jira/browse/ARROW-7966
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC, Integration
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
> all batches from JSON into a Table, reads all batches in the flight stream 
> from the server into a Table, then compares the Tables for equality.  This is 
> potentially a problem because a record batch might have specific information 
> that is then lost in the conversion to a Table. For example, if the server 
> sends empty batches, the resulting Table would not be different from one with 
> no empty batches.
> Instead, the client should check each record batch from the JSON file against 
> each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7966) [Integration][Flight][C++] Client should verify each batch independently

2020-02-28 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7966:
---

 Summary: [Integration][Flight][C++] Client should verify each 
batch independently
 Key: ARROW-7966
 URL: https://issues.apache.org/jira/browse/ARROW-7966
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Bryan Cutler


Currently the C++ Flight test client in {{test_integration_client.cc}} reads 
all batches from JSON into a Table, reads all batches in the flight stream from 
the server into a Table, then compares the Tables for equality.  This is 
potentially a problem because a record batch might have specific information 
that is then lost in the conversion to a Table. For example, if the server 
sends empty batches, the resulting Table would not be different from one with 
no empty batches.

Instead, the client should check each record batch from the JSON file against 
each record batch from the server independently. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7933) [Java][Flight][Tests] Add roundtrip tests for Java Flight Test Client

2020-02-24 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7933:
---

 Summary: [Java][Flight][Tests] Add roundtrip tests for Java Flight 
Test Client
 Key: ARROW-7933
 URL: https://issues.apache.org/jira/browse/ARROW-7933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Java
Reporter: Bryan Cutler


There should be some built-in roundtrip tests for Java Flight 
IntegrationTestClient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7899) [Integration][Java] null type integration test

2020-02-24 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7899.
-
Resolution: Fixed

Issue resolved by pull request 6476
[https://github.com/apache/arrow/pull/6476]

> [Integration][Java] null type integration test
> --
>
> Key: ARROW-7899
> URL: https://issues.apache.org/jira/browse/ARROW-7899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Java
>Reporter: Neal Richardson
>Assignee: Bryan Cutler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> From [https://github.com/apache/arrow/pull/6368]
>  
> h3.  *[lidavidm|https://github.com/lidavidm]*  commented [2 days 
> ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218]
> |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208]
>  
> If I'm not mistaken, this means that we only compare the data fully if 
> there's actual data in both JSON and in the Arrow file?
> Though the Flight test is also potentially wrong:
>  
> [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173]
> It only compares the last batch sent over the wire.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7899) [Integration][Java] null type integration test

2020-02-20 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-7899:
---

Assignee: Bryan Cutler

> [Integration][Java] null type integration test
> --
>
> Key: ARROW-7899
> URL: https://issues.apache.org/jira/browse/ARROW-7899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Java
>Reporter: Neal Richardson
>Assignee: Bryan Cutler
>Priority: Blocker
> Fix For: 1.0.0
>
>
> From [https://github.com/apache/arrow/pull/6368]
>  
> h3.  *[lidavidm|https://github.com/lidavidm]*  commented [2 days 
> ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218]
> |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208]
>  
> If I'm not mistaken, this means that we only compare the data fully if 
> there's actual data in both JSON and in the Arrow file?
> Though the Flight test is also potentially wrong:
>  
> [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173]
> It only compares the last batch sent over the wire.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7899) [Integration][Java] null type integration test

2020-02-20 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041330#comment-17041330
 ] 

Bryan Cutler commented on ARROW-7899:
-

I can look into this

> [Integration][Java] null type integration test
> --
>
> Key: ARROW-7899
> URL: https://issues.apache.org/jira/browse/ARROW-7899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Java
>Reporter: Neal Richardson
>Priority: Blocker
> Fix For: 1.0.0
>
>
> From [https://github.com/apache/arrow/pull/6368]
>  
> h3.  *[lidavidm|https://github.com/lidavidm]*  commented [2 days 
> ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218]
> |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208]
>  
> If I'm not mistaken, this means that we only compare the data fully if 
> there's actual data in both JSON and in the Arrow file?
> Though the Flight test is also potentially wrong:
>  
> [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173]
> It only compares the last batch sent over the wire.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7899) [Integration][Java] null type integration test

2020-02-20 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7899:

Description: 
>From [https://github.com/apache/arrow/pull/6368]

 
h3.  *[lidavidm|https://github.com/lidavidm]*  commented [2 days 
ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218]
|[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208]
 
If I'm not mistaken, this means that we only compare the data fully if there's 
actual data in both JSON and in the Arrow file?
Though the Flight test is also potentially wrong:
 
[https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173]
It only compares the last batch sent over the wire.|

> [Integration][Java] null type integration test
> --
>
> Key: ARROW-7899
> URL: https://issues.apache.org/jira/browse/ARROW-7899
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Integration, Java
>Reporter: Neal Richardson
>Priority: Blocker
> Fix For: 1.0.0
>
>
> From [https://github.com/apache/arrow/pull/6368]
>  
> h3.  *[lidavidm|https://github.com/lidavidm]*  commented [2 days 
> ago|https://github.com/apache/arrow/pull/6368#issuecomment-587593218]
> |[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L208]
>  
> If I'm not mistaken, this means that we only compare the data fully if 
> there's actual data in both JSON and in the Arrow file?
> Though the Flight test is also potentially wrong:
>  
> [https://github.com/apache/arrow/blob/master/java/flight/flight-core/src/main/java/org/apache/arrow/flight/example/integration/IntegrationTestClient.java#L166-L173]
> It only compares the last batch sent over the wire.|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7467) [Java] ComplexCopier does incorrect copy for Map nullable info

2020-02-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7467.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6094
[https://github.com/apache/arrow/pull/6094]

> [Java] ComplexCopier does incorrect copy for Map nullable info
> --
>
> Key: ARROW-7467
> URL: https://issues.apache.org/jira/browse/ARROW-7467
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The {{MapVector}} and its 'value' vector are nullable, and its 
> {{structVector}} and 'key' vector are non-nullable.
> However, the {{MapVector}} generated by ComplexCopier has all nullable fields 
> which is not correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7405) [Java] ListVector isEmpty API is incorrect

2020-02-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7405:

Priority: Minor  (was: Major)

> [Java] ListVector isEmpty API is incorrect
> --
>
> Key: ARROW-7405
> URL: https://issues.apache.org/jira/browse/ARROW-7405
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
>  Currently {{isEmpty}} API is always return false in 
> {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not 
> overwrite this method.
> This will lead to incorrect result, for example, a {{ListVector}} with data 
> [1,2], null, [], [5,6] should get [false, false, true, false] with this API, 
> but now it would return [false, false, false, false].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7405) [Java] ListVector isEmpty API is incorrect

2020-02-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7405.
-
Resolution: Fixed

Resolved from https://github.com/apache/arrow/pull/6044

> [Java] ListVector isEmpty API is incorrect
> --
>
> Key: ARROW-7405
> URL: https://issues.apache.org/jira/browse/ARROW-7405
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
>  Currently {{isEmpty}} API is always return false in 
> {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not 
> overwrite this method.
> This will lead to incorrect result, for example, a {{ListVector}} with data 
> [1,2], null, [], [5,6] should get [false, false, true, false] with this API, 
> but now it would return [false, false, false, false].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7405) [Java] ListVector isEmpty API is incorrect

2020-02-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7405:

Fix Version/s: 1.0.0

> [Java] ListVector isEmpty API is incorrect
> --
>
> Key: ARROW-7405
> URL: https://issues.apache.org/jira/browse/ARROW-7405
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
>  Currently {{isEmpty}} API is always return false in 
> {{BaseRepeatedValueVector}}, and its subclass {{ListVector}} did not 
> overwrite this method.
> This will lead to incorrect result, for example, a {{ListVector}} with data 
> [1,2], null, [], [5,6] should get [false, false, true, false] with this API, 
> but now it would return [false, false, false, false].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7770) [Release] Archery does not use correct integration test args

2020-02-04 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7770.
-
Resolution: Duplicate

> [Release] Archery does not use correct integration test args
> 
>
> Key: ARROW-7770
> URL: https://issues.apache.org/jira/browse/ARROW-7770
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> When using release verification script and selecting integration tests, 
> Archery ignores selected tests and runs all tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7770) [Release] Archery does not use correct integration test args

2020-02-04 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7770:
---

 Summary: [Release] Archery does not use correct integration test 
args
 Key: ARROW-7770
 URL: https://issues.apache.org/jira/browse/ARROW-7770
 Project: Apache Arrow
  Issue Type: Bug
  Components: Archery
Reporter: Bryan Cutler
Assignee: Bryan Cutler


When using release verification script and selecting integration tests, Archery 
ignores selected tests and runs all tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

2020-01-30 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026977#comment-17026977
 ] 

Bryan Cutler commented on ARROW-7723:
-

Thanks for the explaination [~wesm] , makes sense

> [Python] StructArray  timestamp type with timezone to_pandas convert error
> --
>
> Key: ARROW-7723
> URL: https://issues.apache.org/jira/browse/ARROW-7723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the 
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa 
>...: import pandas as pd 
>...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
> pd.Timestamp.now()}]) 
>...:   
>
> In [2]: arr.to_pandas()   
> 
> Out[2]: 
> 0{'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()   
>
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) 
>
> In [5]: arr2.to_pandas()  
> 
> Out[5]: 
> 0   2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) 
>
> In [7]: arr.to_pandas()   
> 
> Out[7]: 
> 0{'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

2020-01-29 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7723:

Priority: Blocker  (was: Major)

> [Python] StructArray  timestamp type with timezone to_pandas convert error
> --
>
> Key: ARROW-7723
> URL: https://issues.apache.org/jira/browse/ARROW-7723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Blocker
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the 
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa 
>...: import pandas as pd 
>...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
> pd.Timestamp.now()}]) 
>...:   
>
> In [2]: arr.to_pandas()   
> 
> Out[2]: 
> 0{'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()   
>
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) 
>
> In [5]: arr2.to_pandas()  
> 
> Out[5]: 
> 0   2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) 
>
> In [7]: arr.to_pandas()   
> 
> Out[7]: 
> 0{'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

2020-01-29 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026233#comment-17026233
 ] 

Bryan Cutler commented on ARROW-7723:
-

This is a regression, so marking as blocker for now.

> [Python] StructArray  timestamp type with timezone to_pandas convert error
> --
>
> Key: ARROW-7723
> URL: https://issues.apache.org/jira/browse/ARROW-7723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Blocker
> Fix For: 0.16.0
>
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the 
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa 
>...: import pandas as pd 
>...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
> pd.Timestamp.now()}]) 
>...:   
>
> In [2]: arr.to_pandas()   
> 
> Out[2]: 
> 0{'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()   
>
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) 
>
> In [5]: arr2.to_pandas()  
> 
> Out[5]: 
> 0   2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) 
>
> In [7]: arr.to_pandas()   
> 
> Out[7]: 
> 0{'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

2020-01-29 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-7723:

Fix Version/s: 0.16.0

> [Python] StructArray  timestamp type with timezone to_pandas convert error
> --
>
> Key: ARROW-7723
> URL: https://issues.apache.org/jira/browse/ARROW-7723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Blocker
> Fix For: 0.16.0
>
>
> When a {{StructArray}} has a child that is a timestamp with a timezone, the 
> {{to_pandas}} conversion outputs an int64 instead of a timestamp
> {code:java}
> In [1]: import pyarrow as pa 
>...: import pandas as pd 
>...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
> pd.Timestamp.now()}]) 
>...:   
>
> In [2]: arr.to_pandas()   
> 
> Out[2]: 
> 0{'end': 2020-01-29 11:38:02.792681, 'start': 2...
> dtype: object
> In [3]: ts = pd.Timestamp.now()   
>
> In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York')) 
>
> In [5]: arr2.to_pandas()  
> 
> Out[5]: 
> 0   2020-01-29 06:38:47.848944-05:00
> dtype: datetime64[ns, America/New_York]
> In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop']) 
>
> In [7]: arr.to_pandas()   
> 
> Out[7]: 
> 0{'start': 1580297927848944000, 'stop': 1580297...
> dtype: object
> {code}
> from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7723) [Python] StructArray timestamp type with timezone to_pandas convert error

2020-01-29 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7723:
---

 Summary: [Python] StructArray  timestamp type with timezone 
to_pandas convert error
 Key: ARROW-7723
 URL: https://issues.apache.org/jira/browse/ARROW-7723
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Bryan Cutler


When a {{StructArray}} has a child that is a timestamp with a timezone, the 
{{to_pandas}} conversion outputs an int64 instead of a timestamp
{code:java}
In [1]: import pyarrow as pa 
   ...: import pandas as pd 
   ...: arr = pa.array([{'start': pd.Timestamp.now(), 'end': 
pd.Timestamp.now()}]) 
   ...: 
 

In [2]: arr.to_pandas() 
  
Out[2]: 
0{'end': 2020-01-29 11:38:02.792681, 'start': 2...
dtype: object

In [3]: ts = pd.Timestamp.now() 
 

In [4]: arr2 = pa.array([ts], type=pa.timestamp('us', tz='America/New_York'))   
 

In [5]: arr2.to_pandas()
  
Out[5]: 
0   2020-01-29 06:38:47.848944-05:00
dtype: datetime64[ns, America/New_York]

In [6]: arr = pa.StructArray.from_arrays([arr2, arr2], ['start', 'stop'])   
 

In [7]: arr.to_pandas() 
  
Out[7]: 
0{'start': 1580297927848944000, 'stop': 1580297...
dtype: object

{code}
from https://github.com/apache/arrow/pull/6312



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7709) [Python] Conversion from Table Column to Pandas loses name for Timestamps

2020-01-28 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025493#comment-17025493
 ] 

Bryan Cutler commented on ARROW-7709:
-

>From [https://github.com/apache/arrow/pull/6294#issuecomment-579468239,] Joris 
>knows where this issue is and said he could fix this soon.

> [Python] Conversion from Table Column to Pandas loses name for Timestamps
> -
>
> Key: ARROW-7709
> URL: https://issues.apache.org/jira/browse/ARROW-7709
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Major
>
> When converting a Table timestamp column to Pandas, the name of the column is 
> lost in the resulting series.
> {code:java}
> In [23]: a1 = pa.array([pd.Timestamp.now()])  
>
> In [24]: a2 = pa.array([1])   
>
> In [25]: t = pa.Table.from_arrays([a1, a2], ['ts', 'a'])  
>
> In [26]: for c in t: 
> ...: print(c.to_pandas()) 
> ...:  
>
> 0   2020-01-28 13:17:26.738708
> dtype: datetime64[ns]
> 01
> Name: a, dtype: int64 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7693) [CI] Fix test-conda-python-3.7-spark-master nightly errors

2020-01-28 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7693.
-
Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6294
[https://github.com/apache/arrow/pull/6294]

> [CI] Fix test-conda-python-3.7-spark-master nightly errors
> --
>
> Key: ARROW-7693
> URL: https://issues.apache.org/jira/browse/ARROW-7693
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Spark master renamed some tests, need to update



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7709) [Python] Conversion from Table Column to Pandas loses name for Timestamps

2020-01-28 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7709:
---

 Summary: [Python] Conversion from Table Column to Pandas loses 
name for Timestamps
 Key: ARROW-7709
 URL: https://issues.apache.org/jira/browse/ARROW-7709
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Bryan Cutler


When converting a Table timestamp column to Pandas, the name of the column is 
lost in the resulting series.
{code:java}
In [23]: a1 = pa.array([pd.Timestamp.now()])
 

In [24]: a2 = pa.array([1]) 
 

In [25]: t = pa.Table.from_arrays([a1, a2], ['ts', 'a'])
 

In [26]: for c in t: 
...: print(c.to_pandas()) 
...:
 
0   2020-01-28 13:17:26.738708
dtype: datetime64[ns]
01
Name: a, dtype: int64 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7693) [CI] Fix test-conda-python-3.7-spark-master nightly errors

2020-01-27 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7693:
---

 Summary: [CI] Fix test-conda-python-3.7-spark-master nightly errors
 Key: ARROW-7693
 URL: https://issues.apache.org/jira/browse/ARROW-7693
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Spark master renamed some tests, need to update



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7596) [Python] Only apply zero-copy DataFrame block optimizations when split_blocks=True

2020-01-24 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023156#comment-17023156
 ] 

Bryan Cutler commented on ARROW-7596:
-

Linking some discussion on the mailing list regarding pyspark behavior with 
this option: [https://www.mail-archive.com/dev@arrow.apache.org/msg17008.html]
{noformat}
Joris Van den Bossche
Fri, 24 Jan 2020 02:11:05 -0800

Hi Bryan,

For the case that the column is no timestamp and was not modified: I don't
think it will take copies of the full dataframe by assigning columns in a
loop like that. But it is still doing work (it will copy data for that
column into the array holding those data for 2D blocks), and which can
easily be avoided I think by only assigning back when the column was
actually modified (eg by moving the is_datetime64tz_dtype inline in the
loop iterating through all columns, so you can only write back if actually
having tz-aware data).Further, even if you do the above to avoid writing back 
to the dataframe
when not needed, I am not sure you should directly try to use the new
zero-copy feature of the Table.to_pandas conversion (with
split_blocks=True). It depends very much on what further happens with the
converted dataframe. Once you do some operations in pandas, those splitted
blocks will get combined (resulting in a memory copy then), and it also
means you can't modify the dataframe (if this dataframe is used in python
UDFs, it might limit what can be done in those UDFs. Just guessing here, I
don't know the pyspark code well enough).

Joris


On Thu, 23 Jan 2020 at 21:03, Bryan Cutler  wrote:

> Thanks for investigating this and the quick fix Joris and Wes!  I just have
> a couple questions about the behavior observed here.  The pyspark code
> assigns either the same series back to the pandas.DataFrame or makes some
> modifications if it is a timestamp. In the case there are no timestamps, is
> this potentially making extra copies or will it be unable to take advantage
> of new zero-copy features in pyarrow? For the case of having timestamp
> columns that need to be modified, is there a more efficient way to create a
> new dataframe with only copies of the modified series?  Thanks!
>
> Bryan {noformat}
 

> [Python] Only apply zero-copy DataFrame block optimizations when 
> split_blocks=True
> --
>
> Key: ARROW-7596
> URL: https://issues.apache.org/jira/browse/ARROW-7596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Follow up to ARROW-3789 since there is downstream code that assumes that the 
> DataFrame produced always has all mutable blocks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7472) [Java] Fix some incorrect behavior in UnionListWriter

2020-01-16 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7472.
-
Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6101
[https://github.com/apache/arrow/pull/6101]

> [Java] Fix some incorrect behavior in UnionListWriter
> -
>
> Key: ARROW-7472
> URL: https://issues.apache.org/jira/browse/ARROW-7472
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Currently the {{UnionListWriter/UnionFixedSizeListWriter}} {{getField/close}} 
> APIs seems incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-4856) [Integration] The spark integration test exceeds the maximum time limit on travis

2020-01-07 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-4856.
-
Resolution: Fixed

Closing as the Spark testing has been passing

> [Integration] The spark integration test exceeds the maximum time limit on 
> travis
> -
>
> Key: ARROW-4856
> URL: https://issues.apache.org/jira/browse/ARROW-4856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See build https://travis-ci.org/kszucs/crossbow/builds/505179756



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-7502) [Integration] Remove Spark Integration patch that not needed anymore

2020-01-07 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-7502.
-
Fix Version/s: 0.16.0
   Resolution: Fixed

Issue resolved by pull request 6129
[https://github.com/apache/arrow/pull/6129]

> [Integration] Remove Spark Integration patch that not needed anymore
> 
>
> Key: ARROW-7502
> URL: https://issues.apache.org/jira/browse/ARROW-7502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.16.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Apache Spark master has been updated to work with Arrow 0.15.1 after the 
> binary protocol  change and patching Spark master is no longer necessary to 
> build with current Arrow, so the previous patch can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7502) [Integration] Remove Spark Integration patch that not needed anymore

2020-01-06 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7502:
---

 Summary: [Integration] Remove Spark Integration patch that not 
needed anymore
 Key: ARROW-7502
 URL: https://issues.apache.org/jira/browse/ARROW-7502
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Apache Spark master has been updated to work with Arrow 0.15.1 after the binary 
protocol  change and patching Spark master is no longer necessary to build with 
current Arrow, so the previous patch can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4856) [Integration] The spark integration test exceeds the maximum time limit on travis

2020-01-06 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009109#comment-17009109
 ] 

Bryan Cutler commented on ARROW-4856:
-

I believe this is resolved, is that correct [~kszucs] ?

> [Integration] The spark integration test exceeds the maximum time limit on 
> travis
> -
>
> Key: ARROW-4856
> URL: https://issues.apache.org/jira/browse/ARROW-4856
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Priority: Major
>
> See build https://travis-ci.org/kszucs/crossbow/builds/505179756



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-2524) [Java] [TEST] Run spark integration tests regularly

2020-01-06 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-2524.
-
Resolution: Fixed

Closing this as Spark integration tests are being run nightly

> [Java] [TEST] Run spark integration tests regularly
> ---
>
> Key: ARROW-2524
> URL: https://issues.apache.org/jira/browse/ARROW-2524
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Krisztian Szucs
>Priority: Major
>
> For example nightly builds, along with dask and hdfs tests, see 
> https://github.com/apache/arrow/pull/1890



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7223) [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true

2019-12-03 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987157#comment-16987157
 ] 

Bryan Cutler commented on ARROW-7223:
-

Thanks [~lidavidm] , it might be the case (which seems likely) that there is 
not much we can do about this. At the very least, it would be good to have a 
record of this info for consumers of Arrow Java that also might encounter the 
issue.

> [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true
> --
>
> Key: ARROW-7223
> URL: https://issues.apache.org/jira/browse/ARROW-7223
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Bryan Cutler
>Priority: Major
>
> After ARROW-3191, consumers of Arrow Java with a JDK 9 and above are required 
> to set the JVM property "io.netty.tryReflectionSetAccessible=true" at 
> startup, each time Arrow code is run, as documented at 
> https://github.com/apache/arrow/tree/master/java#java-properties. Not doing 
> this will result in the error "java.lang.UnsupportedOperationException: 
> sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available", 
> making Arrow unusable out-of-the-box.
> This proposes to automatically set the property if not already set in the 
> following steps:
> 1) check to see if the property io.netty.tryReflectionSetAccessible has been 
> set
> 2) if not set, automatically set to "true"
> 3) else if set to "false", catch the Netty error and prepend the error 
> message with the suggested setting of "true"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7223) [Java] Provide default setting of io.netty.tryReflectionSetAccessible=true

2019-11-20 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7223:
---

 Summary: [Java] Provide default setting of 
io.netty.tryReflectionSetAccessible=true
 Key: ARROW-7223
 URL: https://issues.apache.org/jira/browse/ARROW-7223
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler


After ARROW-3191, consumers of Arrow Java with a JDK 9 and above are required 
to set the JVM property "io.netty.tryReflectionSetAccessible=true" at startup, 
each time Arrow code is run, as documented at 
https://github.com/apache/arrow/tree/master/java#java-properties. Not doing 
this will result in the error "java.lang.UnsupportedOperationException: 
sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available", making 
Arrow unusable out-of-the-box.

This proposes to automatically set the property if not already set in the 
following steps:

1) check to see if the property io.netty.tryReflectionSetAccessible has been set
2) if not set, automatically set to "true"
3) else if set to "false", catch the Netty error and prepend the error message 
with the suggested setting of "true"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4890) [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1

2019-11-18 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976844#comment-16976844
 ] 

Bryan Cutler commented on ARROW-4890:
-

Sorry, I'm not sure of any documentation with the limits. It would be great to 
get that down somewhere and there should be a better error message for this, 
but maybe it should be done on the Spark side.

> [Python] Spark+Arrow Grouped pandas UDAF - read length must be positive or -1
> -
>
> Key: ARROW-4890
> URL: https://issues.apache.org/jira/browse/ARROW-4890
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Cloudera cdh5.13.3
> Cloudera Spark 2.3.0.cloudera3
>Reporter: Abdeali Kothari
>Priority: Major
> Attachments: Task retry fails.png, image-2019-07-04-12-03-57-002.png
>
>
> Creating this in Arrow project as the traceback seems to suggest this is an 
> issue in Arrow.
>  Continuation from the conversation on the 
> https://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3CCAK7Z5T_mChuqhFDAF2U68dO=p_1nst5ajjcrg0mexo5kby9...@mail.gmail.com%3E
> When I run a GROUPED_MAP UDF in Spark using PySpark, I run into the error:
> {noformat}
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 279, in load_stream
> for batch in reader:
>   File "pyarrow/ipc.pxi", line 265, in __iter__
>   File "pyarrow/ipc.pxi", line 281, in 
> pyarrow.lib._RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: read length must be positive or -1
> {noformat}
> as my dataset size starts increasing that I want to group on. Here is a 
> reproducible code snippet where I can reproduce this.
>  Note: My actual dataset is much larger and has many more unique IDs and is a 
> valid usecase where I cannot simplify this groupby in any way. I have 
> stripped out all the logic to make this example as simple as I could.
> {code:java}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--executor-memory 9G pyspark-shell'
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql import functions as F, types as T
> import pandas as pd
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> pdf1 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghij", "2000-01-01T00:00:00.000Z"]],
>   columns=['df1_c1', 'df1_c2', 'df1_c3', 'df1_c4']
> )
> df1 = spark.createDataFrame(pd.concat([pdf1 for i in 
> range(429)]).reset_index()).drop('index')
> pdf2 = pd.DataFrame(
>   [[1234567, 0.0, "abcdefghijklmno", "2000-01-01", "abcdefghijklmno", 
> "abcdefghijklmno"]],
>   columns=['df2_c1', 'df2_c2', 'df2_c3', 'df2_c4', 'df2_c5', 'df2_c6']
> )
> df2 = spark.createDataFrame(pd.concat([pdf2 for i in 
> range(48993)]).reset_index()).drop('index')
> df3 = df1.join(df2, df1['df1_c1'] == df2['df2_c1'], how='inner')
> def myudf(df):
> return df
> df4 = df3
> udf = F.pandas_udf(df4.schema, F.PandasUDFType.GROUPED_MAP)(myudf)
> df5 = df4.groupBy('df1_c1').apply(udf)
> print('df5.count()', df5.count())
> # df5.write.parquet('/tmp/temp.parquet', mode='overwrite')
> {code}
> I have tried running this on Amazon EMR with Spark 2.3.1 and 20GB RAM per 
> executor too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6820.
-
Resolution: Fixed

Issue resolved by pull request 5821
[https://github.com/apache/arrow/pull/5821]

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-6820:
---

Assignee: Bryan Cutler

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Assignee: Bryan Cutler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7173) Add test to verify Map field names can be arbitrary

2019-11-14 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-7173:
---

 Summary: Add test to verify Map field names can be arbitrary
 Key: ARROW-7173
 URL: https://issues.apache.org/jira/browse/ARROW-7173
 Project: Apache Arrow
  Issue Type: Test
  Components: Integration
Reporter: Bryan Cutler


A Map has child fields and the format spec only recommends that they be named 
"entries", "key", and "value" but could be named anything. Currently, 
integration tests for Map arrays verify the exchanged schema is equal, so the 
child fields are always named the same. There should be tests that use 
different names to verify implementations can accept this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6930) [Java] Create utility class for populating vector values used for test purpose only

2019-11-14 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6930.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 5693
[https://github.com/apache/arrow/pull/5693]

> [Java] Create utility class for populating vector values used for test 
> purpose only
> ---
>
> Key: ARROW-6930
> URL: https://issues.apache.org/jira/browse/ARROW-6930
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> There is a lot of verbosity in the construction of Arrays for testing 
> purposes (multiple lines of setSafe(...) or set(...).
> We should start adding a utility class to make test setup clearer and more 
> concise, note this class should be located in arrow-vector test package and 
> could be used in other module’s testing by adding dependency:
> {{}}
> {{org.apache.arrow}}
> {{arrow-vector}}
> {{${project.version}}}
> {{tests}}
> {{test-jar}}
> {{test}}
> {{}}
> Usage would be something like:
> {quote}try (IntVector vector = new IntVector(“vector”, allocator)) {
> ValueVectorPopulator.setVector(vector, 1, 2, null, 4, 5);
> output = doSomethingWith(input);
> assertThat(output).isEqualTo(expected);
> }
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-12 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972738#comment-16972738
 ] 

Bryan Cutler commented on ARROW-6820:
-

I don't think that either C++ or Java require the Map Fields to have certain 
names, but the integration test framework does check that the names of all 
child fields are equal. So to resolve this how about I change the name from 
"entries" to "entry" in C++ and Java to be consistent with the format spec?

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".
> In the Java implementation, a map vector also has a child field "entries", 
> itself with children "key" and "value" (by default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6820) [C++] [Doc] [Format] Map specification and implementation inconsistent

2019-11-04 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967123#comment-16967123
 ] 

Bryan Cutler commented on ARROW-6820:
-

I don't have a strong preference to specific naming, but we should try to be 
consistent.  In C++ it is very confusing because many APIs are "key" and "item" 
because when MapArray is viewed as a list of structs, the term "value" would 
mean an element in the struct array. Also, there could be conflicts because 
"value" is already used in List APIs. I think we should stick with the 
terminology from Schema.fbs where map type is specified as having a child field 
"entry", itself with children "key" and "value". In C++ we could work around 
the API by overriding then renaming, e.g. {code}std::shared_ptr 
MapArray::list_values() { return ListArray::values(); }{code}

> [C++] [Doc] [Format] Map specification and implementation inconsistent
> --
>
> Key: ARROW-6820
> URL: https://issues.apache.org/jira/browse/ARROW-6820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation, Format
>Reporter: Antoine Pitrou
>Priority: Blocker
> Fix For: 1.0.0
>
>
> In https://arrow.apache.org/docs/format/Layout.html#map-type, the map type is 
> specified as having a child field "pairs", itself with children "keys" and 
> "items".
> In https://github.com/apache/arrow/blob/master/format/Schema.fbs#L60, the map 
> type is specified as having a child field "entry", itself with children "key" 
> and "value".
> In the C++ implementation, a map type has a child field "entries", itself 
> with children "key" and "value".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration

2019-10-16 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952988#comment-16952988
 ] 

Bryan Cutler commented on ARROW-3850:
-

I made ARROW-6904 to add MapArray to Arrow Python, once that is done it can be 
implemented in PySpark and we can close this once it passes the Spark 
integration tests. Nested structs require some other issues to be worked out, 
and there are other JIRAs for that.

> [Python] Support MapType and StructType for enhanced PySpark integration
> 
>
> Key: ARROW-3850
> URL: https://issues.apache.org/jira/browse/ARROW-3850
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Florian Wilhelm
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be great to support MapType and (nested) StructType in Arrow so that 
> PySpark can make use of it.
>  
>  Quite often as in my use-case in Hive table cells are also complex types 
> saved. Currently it's not possible to user the new 
> {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}}
>  decorator which internally uses Arrow to generate a UDF for columns with 
> complex types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6904) [Python] Implement MapArray and MapType

2019-10-16 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952955#comment-16952955
 ] 

Bryan Cutler commented on ARROW-6904:
-

I can work on this

> [Python] Implement MapArray and MapType
> ---
>
> Key: ARROW-6904
> URL: https://issues.apache.org/jira/browse/ARROW-6904
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 1.0.0
>
>
> Map arrays are already added to C++, need to expose them in the Python API 
> also



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6904) [Python] Implement MapArray and MapType

2019-10-16 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6904:
---

 Summary: [Python] Implement MapArray and MapType
 Key: ARROW-6904
 URL: https://issues.apache.org/jira/browse/ARROW-6904
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 1.0.0


Map arrays are already added to C++, need to expose them in the Python API also



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-10-07 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6429.
-
Resolution: Fixed

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-10-07 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946068#comment-16946068
 ] 

Bryan Cutler commented on ARROW-6429:
-

Tests are passing since ARROW-6686 was merged, I'll resolve this now

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6790) [Release] Automatically disable integration test cases in release verification

2019-10-03 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6790:
---

 Summary: [Release] Automatically disable integration test cases in 
release verification
 Key: ARROW-6790
 URL: https://issues.apache.org/jira/browse/ARROW-6790
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Bryan Cutler
Assignee: Bryan Cutler


If dev/release/verify-release-candidate.sh is run with selective testing and 
includes integration tests, the selected implementations should be the only 
ones enabled when running the integration test portion. For example:

TEST_DEFAULT=0 \
TEST_CPP=1 \
TEST_JAVA=1 \
TEST_INTEGRATION=1 \
dev/release/verify-release-candidate.sh source 0.15.0 2

Should run integration only for C++ and Java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-3850) [Python] Support MapType and StructType for enhanced PySpark integration

2019-09-24 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937159#comment-16937159
 ] 

Bryan Cutler commented on ARROW-3850:
-

Now that SPARK-23836 is merged, a scalar Pandas UDF can return a StructType 
that will accept a pandas.DataFrame. By nested structs, I mean a column of 
StructType that have a child that is a StructType.  Spark does not currently 
support this as an input column, or return type from Pandas UDFs.

> [Python] Support MapType and StructType for enhanced PySpark integration
> 
>
> Key: ARROW-3850
> URL: https://issues.apache.org/jira/browse/ARROW-3850
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Florian Wilhelm
>Priority: Major
> Fix For: 1.0.0
>
>
> It would be great to support MapType and (nested) StructType in Arrow so that 
> PySpark can make use of it.
>  
>  Quite often as in my use-case in Hive table cells are also complex types 
> saved. Currently it's not possible to user the new 
> {{[pandas_udf|https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=explode#pyspark.sql.functions.pandas_udf]}}
>  decorator which internally uses Arrow to generate a UDF for columns with 
> complex types.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-21 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935210#comment-16935210
 ] 

Bryan Cutler commented on ARROW-6429:
-

I believe I need to add a patch so Spark can compile with Arrow Java. I'm 
working on this now.

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6652) [Python] to_pandas conversion removes timezone from type

2019-09-21 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935089#comment-16935089
 ] 

Bryan Cutler commented on ARROW-6652:
-

[~wesm] or [~apitrou]  would you be able to take a look at this?

> [Python] to_pandas conversion removes timezone from type
> 
>
> Key: ARROW-6652
> URL: https://issues.apache.org/jira/browse/ARROW-6652
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Bryan Cutler
>Priority: Critical
> Fix For: 0.15.0
>
>
> Calling {{to_pandas}} on a {{pyarrow.Array}} with a timezone aware timestamp 
> type, removes the timezone in the resulting {{pandas.Series}}.
> {code}
> >>> import pyarrow as pa
> >>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
> >>> a.to_pandas()
> 0   1970-01-01 00:00:00.01
> dtype: datetime64[ns]
> {code}
> Previous behavior from 0.14.1 of converting a {{pyarrow.Column}} 
> {{to_pandas}} retained the timezone.
> {code}
> In [4]: import pyarrow as pa 
>...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))  
>...: c = pa.Column.from_array('ts', a) 
> In [5]: c.to_pandas() 
>
> Out[5]: 
> 0   1969-12-31 16:00:00.01-08:00
> Name: ts, dtype: datetime64[ns, America/Los_Angeles]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-21 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933714#comment-16933714
 ] 

Bryan Cutler edited comment on ARROW-6429 at 9/21/19 4:59 PM:
--

[~wesm] the issue with the timestamp test failures looks to be because calling 
{{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes 
the tz from the resulting dtype. The behavior before was a pyarrow Column keeps 
the tz but the pyarrow Array removes when converting to a numpy array.

With Arrow 0.14.1
{code:java}
In [4]: import pyarrow as pa 
   ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))  
   ...: c = pa.Column.from_array('ts', a) 

In [5]: c.to_pandas()   
 
Out[5]: 
0   1969-12-31 16:00:00.01-08:00
Name: ts, dtype: datetime64[ns, America/Los_Angeles]

In [6]: a.to_pandas()   
 
Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]')
{code}
With current master
{code:java}
>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> a.to_pandas()
0   1970-01-01 00:00:00.01
dtype: datetime64[ns]
{code}
After manually adding the timezone back in the series dtype (and fixing the 
Java compilation), all tests pass and the spark integration run finished. I 
wasn't able to look into why the timezone is being removed though. Should I 
open up a jira for this?

edit: I made ARROW-6652 since it is not just a Spark issue


was (Author: bryanc):
[~wesm] the issue with the timestamp test failures looks to be because calling 
{{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes 
the tz from the resulting dtype. The behavior before was a pyarrow Column keeps 
the tz but the pyarrow Array removes when converting to a numpy array.

With Arrow 0.14.1
{code}
In [4]: import pyarrow as pa 
   ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))  
   ...: c = pa.Column.from_array('ts', a) 

In [5]: c.to_pandas()   
 
Out[5]: 
0   1969-12-31 16:00:00.01-08:00
Name: ts, dtype: datetime64[ns, America/Los_Angeles]

In [6]: a.to_pandas()   
 
Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]')
{code}

With current master
{code}
>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> a.to_pandas()
0   1970-01-01 00:00:00.01
dtype: datetime64[ns]
{code}

After manually adding the timezone back in the series dtype (and fixing the 
Java compilation), all tests pass and the spark integration run finished. I 
wasn't able to look into why the timezone is being removed though. Should I 
open up a jira for this?


> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6652) [Python] to_pandas conversion removes timezone from type

2019-09-21 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6652:
---

 Summary: [Python] to_pandas conversion removes timezone from type
 Key: ARROW-6652
 URL: https://issues.apache.org/jira/browse/ARROW-6652
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Bryan Cutler
 Fix For: 0.15.0


Calling {{to_pandas}} on a {{pyarrow.Array}} with a timezone aware timestamp 
type, removes the timezone in the resulting {{pandas.Series}}.

{code}
>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> a.to_pandas()
0   1970-01-01 00:00:00.01
dtype: datetime64[ns]
{code}

Previous behavior from 0.14.1 of converting a {{pyarrow.Column}} {{to_pandas}} 
retained the timezone.
{code}
In [4]: import pyarrow as pa 
   ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))  
   ...: c = pa.Column.from_array('ts', a) 

In [5]: c.to_pandas()   
 
Out[5]: 
0   1969-12-31 16:00:00.01-08:00
Name: ts, dtype: datetime64[ns, America/Los_Angeles]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-19 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933714#comment-16933714
 ] 

Bryan Cutler commented on ARROW-6429:
-

[~wesm] the issue with the timestamp test failures looks to be because calling 
{{to_pandas}} on a pyarrow ChunkedArray with a tz aware timestamp type removes 
the tz from the resulting dtype. The behavior before was a pyarrow Column keeps 
the tz but the pyarrow Array removes when converting to a numpy array.

With Arrow 0.14.1
{code}
In [4]: import pyarrow as pa 
   ...: a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))  
   ...: c = pa.Column.from_array('ts', a) 

In [5]: c.to_pandas()   
 
Out[5]: 
0   1969-12-31 16:00:00.01-08:00
Name: ts, dtype: datetime64[ns, America/Los_Angeles]

In [6]: a.to_pandas()   
 
Out[6]: array(['1970-01-01T00:00:00.01'], dtype='datetime64[us]')
{code}

With current master
{code}
>>> import pyarrow as pa
>>> a = pa.array([1], type=pa.timestamp('us', tz='America/Los_Angeles'))
>>> a.to_pandas()
0   1970-01-01 00:00:00.01
dtype: datetime64[ns]
{code}

After manually adding the timezone back in the series dtype (and fixing the 
Java compilation), all tests pass and the spark integration run finished. I 
wasn't able to look into why the timezone is being removed though. Should I 
open up a jira for this?


> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-15 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929909#comment-16929909
 ] 

Bryan Cutler commented on ARROW-6429:
-

After ARROW-6557 there seems to be another issue with timestamps 
[https://github.com/apache/arrow/pull/5373#issuecomment-531264154.]  I'll look 
into this soon.

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: nightly, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6534) [Java] Fix typos and spelling

2019-09-11 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6534.
-
Resolution: Fixed

Issue resolved by pull request 5359
[https://github.com/apache/arrow/pull/5359]

> [Java] Fix typos and spelling
> -
>
> Key: ARROW-6534
> URL: https://issues.apache.org/jira/browse/ARROW-6534
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Fix typos and spelling, mostly in docs and tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6534) [Java] Fix typos and spelling

2019-09-11 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6534:
---

 Summary: [Java] Fix typos and spelling
 Key: ARROW-6534
 URL: https://issues.apache.org/jira/browse/ARROW-6534
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 0.15.0


Fix typos and spelling, mostly in docs and tests.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-11 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927871#comment-16927871
 ] 

Bryan Cutler commented on ARROW-6429:
-

The failure seems to be caused from the removal of pyarrow.Column in favor of 
pyarrow.ChunkedArray. Spark iterates over columns of a pyarrow.Table, calls 
{{to_pandas()}} on each column, and assumes the result is a pd.Series. If the 
column is actually a pyarrow.ChunkedArray, then {{to_pandas()}} can be a 
numpy.array. [~wesmckinn] [~pitrou] I know in the pydoc it says the returned 
value can either be a pandas.Series or numpy.array, but is there anyway to 
ensure it is the former or is that the job of the caller?

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Bryan Cutler
>Priority: Blocker
>  Labels: nightly
> Fix For: 0.15.0
>
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6519) [Java] Use IPC continuation token to mark EOS

2019-09-10 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6519:
---

 Summary: [Java] Use IPC continuation token to mark EOS
 Key: ARROW-6519
 URL: https://issues.apache.org/jira/browse/ARROW-6519
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler
 Fix For: 0.15.0


For Arrow stream in non-legacy mode, the EOS identifier should be \{0x, 
0x}. This way, all bytes sent by the writer can be read.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6429) [CI][Crossbow] Nightly spark integration job fails

2019-09-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926208#comment-16926208
 ] 

Bryan Cutler commented on ARROW-6429:
-

I will take a look, but it might be a few days until I can get to it.

> [CI][Crossbow] Nightly spark integration job fails
> --
>
> Key: ARROW-6429
> URL: https://issues.apache.org/jira/browse/ARROW-6429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Bryan Cutler
>Priority: Blocker
>  Labels: nightly
> Fix For: 0.15.0
>
>
> See https://circleci.com/gh/ursa-labs/crossbow/2310. Either fix, skip job and 
> create followup Jira to unskip, or delete job.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6474) [Python] Provide mechanism for python to write out old format

2019-09-09 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926207#comment-16926207
 ] 

Bryan Cutler commented on ARROW-6474:
-

I think this is more for the case that a user is stuck with an already released 
Spark version, e.g. <= 2.4.4, and ends up installing pyarrow >= 0.15.0.  The 
pyarrow writers will use the new format by default, which the Arrow Java 
version in Spark will be unable to handle since it's using 0.14.1. There is no 
way for the user to set the option in the pyarrow writer either, so they would 
have to downgrade pyarrow. I think this it's fair to say they need to stick 
with pyarrow 0.14.1, but an env variable would give them a way to use the 
latest release.

> [Python] Provide mechanism for python to write out old format
> -
>
> Key: ARROW-6474
> URL: https://issues.apache.org/jira/browse/ARROW-6474
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Micah Kornfield
>Priority: Blocker
> Fix For: 0.15.0
>
>
> I think this needs to be an environment variable, so it can be made to work 
> with old version of the Java library pyspark integration.
>  
>  [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading

2019-09-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-6461:
---

Assignee: Bryan Cutler

> [Java] EchoServer can close socket before client has finished reading
> -
>
> Key: ARROW-6461
> URL: https://issues.apache.org/jira/browse/ARROW-6461
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When the EchoServer finishes running the client connection, the socket is 
> closed immediately. This causes a race condition and the client will fail 
> with a
> {noformat}
>  SocketException: connection reset {noformat}
> if it has not read all of the echoed batches.
> This was consistently happening with the fix for ARROW-6315



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading

2019-09-05 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6461.
-
Resolution: Fixed

Issue resolved by pull request 5288
[https://github.com/apache/arrow/pull/5288]

> [Java] EchoServer can close socket before client has finished reading
> -
>
> Key: ARROW-6461
> URL: https://issues.apache.org/jira/browse/ARROW-6461
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When the EchoServer finishes running the client connection, the socket is 
> closed immediately. This causes a race condition and the client will fail 
> with a
> {noformat}
>  SocketException: connection reset {noformat}
> if it has not read all of the echoed batches.
> This was consistently happening with the fix for ARROW-6315



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6461) [Java] EchoServer can close socket before client has finished reading

2019-09-04 Thread Bryan Cutler (Jira)
Bryan Cutler created ARROW-6461:
---

 Summary: [Java] EchoServer can close socket before client has 
finished reading
 Key: ARROW-6461
 URL: https://issues.apache.org/jira/browse/ARROW-6461
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Bryan Cutler
 Fix For: 0.15.0


When the EchoServer finishes running the client connection, the socket is 
closed immediately. This causes a race condition and the client will fail with a
{noformat}
 SocketException: connection reset {noformat}
if it has not read all of the echoed batches.

This was consistently happening with the fix for ARROW-6315



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6202) [Java] Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of size 4 due to memory limit. Current allocation: 2147483646

2019-08-23 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6202.
-
Fix Version/s: 0.15.0
   Resolution: Fixed

Issue resolved by pull request 5134
[https://github.com/apache/arrow/pull/5134]

> [Java] Exception in thread "main" 
> org.apache.arrow.memory.OutOfMemoryException: Unable to allocate buffer of 
> size 4 due to memory limit. Current allocation: 2147483646
> ---
>
> Key: ARROW-6202
> URL: https://issues.apache.org/jira/browse/ARROW-6202
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 0.14.1
>Reporter: Jim Northrup
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: jdbc, pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> jdbc query results exceed native heap when using generous -Xmx settings. 
> for roughly 800 megabytes of csv/flatfile resultset, arrow is unable to house 
> the contents in RAM long enough to persist to disk, without explicit 
> knowledge beyond unit test sample code.
> source:
> https://github.com/jnorthrup/jdbc2json/blob/master/src/main/java/com/fnreport/QueryToFeather.kt#L83
> {code:java}
> Exception in thread "main" org.apache.arrow.memory.OutOfMemoryException: 
> Unable to allocate buffer of size 4 due to memory limit. Current allocation: 
> 2147483646
> at 
> org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:307)
> at 
> org.apache.arrow.memory.BaseAllocator.buffer(BaseAllocator.java:277)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.updateVector(JdbcToArrowUtils.java:610)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToFieldVector(JdbcToArrowUtils.java:462)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:396)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:225)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:187)
> at 
> org.apache.arrow.adapter.jdbc.JdbcToArrow.sqlToArrow(JdbcToArrow.java:156)
> at com.fnreport.QueryToFeather$Companion.go(QueryToFeather.kt:83)
> at 
> com.fnreport.QueryToFeather$Companion$main$1.invokeSuspend(QueryToFeather.kt:95)
> at 
> kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
> at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:241)
> at 
> kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:270)
> at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:79)
> at 
> kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:54)
> at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source)
> at 
> kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:36)
> at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source)
> at com.fnreport.QueryToFeather$Companion.main(QueryToFeather.kt:93)
> at com.fnreport.QueryToFeather.main(QueryToFeather.kt)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6011) [Python] Data incomplete when using pyarrow in pyspark in python 3.x

2019-08-21 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6011.
-
Resolution: Cannot Reproduce

I could not reproduce. We can continue the discussion in SPARK-28482 and reopen 
if we find an issue in Arrow

> [Python] Data incomplete when using pyarrow in pyspark in python 3.x
> 
>
> Key: ARROW-6011
> URL: https://issues.apache.org/jira/browse/ARROW-6011
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java, Python
>Affects Versions: 0.10.0, 0.14.0
> Environment: ceonts 7.4  pyarrow 0.10.0  0.14.0   python 2.7  3.5 
> 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: image-2019-07-23-16-06-49-889.png, py3.6.png, test.csv, 
> test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method. 
> However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !image-2019-07-23-16-06-49-889.png!  
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by spark which have been added by 
> us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!   
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
>  case t: Throwable =>
>  // Purposefully not using NonFatal, because even fatal exceptions
>  // we don't want to have our finallyBlock suppress
> + logInfo(t.getLocalizedMessage)
> + t.printStackTrace()
>  originalThrowable = t
>  throw originalThrowable
>  } finally {{code}
>  
>  
>  It seems the pyspark get the data from jvm , but pyarrow get the data 
> incomplete. Pyarrow side think the data is finished, then shutdown the 
> socket. At the same time, the jvm side still writes to the same socket , but 
> get socket close exception.
>  The pyarrow part is in ipc.pxi:
>   
> {code:java}
> cdef class _RecordBatchReader:
>  cdef:
>  shared_ptr[CRecordBatchReader] reader
>  shared_ptr[InputStream] in_stream
> cdef readonly:
>  Schema schema
> def _cinit_(self):
>  pass
> def _open(self, source):
>  get_input_stream(source, _stream)
>  with nogil:
>  check_status(CRecordBatchStreamReader.Open(
>  self.in_stream.get(), ))
> self.schema = pyarrow_wrap_schema(self.reader.get().schema())
> def _iter_(self):
>  while True:
>  yield self.read_next_batch()
> def get_next_batch(self):
>  import warnings
>  warnings.warn('Please use read_next_batch instead of '
>  

[jira] [Updated] (ARROW-6301) [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type found'

2019-08-20 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-6301:

Summary: [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension 
with name arrow.py_extension_type found'  (was: atexit: 
pyarrow.lib.ArrowKeyError: 'No type extension with name arrow.py_extension_type 
found')

> [Python] atexit: pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found'
> ---
>
> Key: ARROW-6301
> URL: https://issues.apache.org/jira/browse/ARROW-6301
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
> Environment: linux, virtualenv, uwsgi, cpython 2.7
>Reporter: David Alphus
>Priority: Minor
>
> On interrupt, I am frequently seeing the atexit function failing in pyarrow 
> 0.14.1.
> {code:java}
>  ^CSIGINT/SIGQUIT received...killing workers... 
> killing the spooler with pid 22640 
> Error in atexit._run_exitfuncs: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>     check_status(UnregisterPyExtensionType()) 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
>     raise ArrowKeyError(message) 
> ArrowKeyError: 'No type extension with name arrow.py_extension_type found' 
> Error in sys.exitfunc: 
> Traceback (most recent call last): 
>   File "/home/alpha/.virtualenvs/wsgi/lib/python2.7/atexit.py", line 24, in 
> _run_exitfuncs 
>     func(*targs, **kargs) 
>   File "pyarrow/types.pxi", line 1860, in 
> pyarrow.lib._unregister_py_extension_type 
>   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status 
> pyarrow.lib.ArrowKeyError: 'No type extension with name 
> arrow.py_extension_type found' 
> spooler (pid: 22640) annihilated 
> worker 1 buried after 1 seconds 
> goodbye to uWSGI.{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Resolved] (ARROW-6210) [Java] remove equals API from ValueVector

2019-08-14 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6210.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5065
[https://github.com/apache/arrow/pull/5065]

> [Java] remove equals API from ValueVector
> -
>
> Key: ARROW-6210
> URL: https://issues.apache.org/jira/browse/ARROW-6210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> This is a follow-up from [https://github.com/apache/arrow/pull/4933]
> The callers should be fixed to use the RangeEquals API instead.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface

2019-08-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906415#comment-16906415
 ] 

Bryan Cutler commented on ARROW-6211:
-

This sounds good to me then, I agree it would be useful to have a generic 
visitor api

> [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
> -
>
> Key: ARROW-6211
> URL: https://issues.apache.org/jira/browse/ARROW-6211
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Ji Liu
>Priority: Major
>
> This is a follow-up from [https://github.com/apache/arrow/pull/4933]
>  
> public interface VectorVisitor \{..}
>  
> In ValueVector : 
> public  OUT accept(VectorVisitor 
> visitor, IN value) throws EX;
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6215) [Java] RangeEqualVisitor does not properly compare ZeroVector

2019-08-12 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6215.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5070
[https://github.com/apache/arrow/pull/5070]

> [Java] RangeEqualVisitor does not properly compare ZeroVector
> -
>
> Key: ARROW-6215
> URL: https://issues.apache.org/jira/browse/ARROW-6215
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ZeroVector.accept and RangeEqualVisitor always return True no matter what 
> type of other vector is compared



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6215) [Java] RangeEqualVisitor does not properly compare ZeroVector

2019-08-12 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-6215:
---

 Summary: [Java] RangeEqualVisitor does not properly compare 
ZeroVector
 Key: ARROW-6215
 URL: https://issues.apache.org/jira/browse/ARROW-6215
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


ZeroVector.accept and RangeEqualVisitor always return True no matter what type 
of other vector is compared



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-6209) [Java] Extract set null method to the base class for fixed width vectors

2019-08-12 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-6209.
-
   Resolution: Fixed
Fix Version/s: 0.15.0

Issue resolved by pull request 5064
[https://github.com/apache/arrow/pull/5064]

> [Java] Extract set null method to the base class for fixed width vectors
> 
>
> Key: ARROW-6209
> URL: https://issues.apache.org/jira/browse/ARROW-6209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, each fixed width vector has the setNull method. All these 
> implementations are identical, so we move them to the base class. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface

2019-08-12 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905457#comment-16905457
 ] 

Bryan Cutler commented on ARROW-6211:
-

So will this allow for other types of visitors besides a RangeEqualsVisitor?

> [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
> -
>
> Key: ARROW-6211
> URL: https://issues.apache.org/jira/browse/ARROW-6211
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Ji Liu
>Priority: Major
>
> This is a follow-up from [https://github.com/apache/arrow/pull/4933]
>  
> public interface VectorVisitor \{..}
>  
> In ValueVector : 
> public  OUT accept(VectorVisitor 
> visitor, IN value) throws EX;
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5579) [Java] shade flatbuffer dependency

2019-08-08 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-5579.
-
   Resolution: Fixed
Fix Version/s: (was: 1.0.0)
   0.15.0

Issue resolved by pull request 4701
[https://github.com/apache/arrow/pull/4701]

> [Java] shade flatbuffer dependency
> --
>
> Key: ARROW-5579
> URL: https://issues.apache.org/jira/browse/ARROW-5579
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Pindikura Ravindra
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>  Time Spent: 20h 40m
>  Remaining Estimate: 0h
>
> Reported in a [github issue|[https://github.com/apache/arrow/issues/4489]] 
>  
> After some [discussion|https://github.com/google/flatbuffers/issues/5368] 
> with the Flatbuffers maintainer, it appears that FB generated code is not 
> guaranteed to be compatible with _any other_ version of the runtime library 
> other than the exact same version of the flatc used to compile it.
> This makes depending on flatbuffers in a library (like arrow) quite risky, as 
> if an app depends on any other version of FB, either directly or 
> transitively, it's likely the versions will clash at some point and you'll 
> see undefined behaviour at runtime.
> Shading the dependency looks to me the best way to avoid this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (ARROW-1184) [Java] Dictionary.equals is not working correctly

2019-07-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned ARROW-1184:
---

Assignee: Ji Liu

> [Java] Dictionary.equals is not working correctly
> -
>
> Key: ARROW-1184
> URL: https://issues.apache.org/jira/browse/ARROW-1184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Ji Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> The {{Dictionary.equals}} method does not return True when the dictionaries 
> are equal.  This is because {{equals}} is not implemented for FieldVector and 
> so that comparison defaults to comparing the two objects only and not the 
> vector data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-1184) [Java] Dictionary.equals is not working correctly

2019-07-25 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-1184.
-
Resolution: Fixed

Issue resolved by pull request 4843
[https://github.com/apache/arrow/pull/4843]

> [Java] Dictionary.equals is not working correctly
> -
>
> Key: ARROW-1184
> URL: https://issues.apache.org/jira/browse/ARROW-1184
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> The {{Dictionary.equals}} method does not return True when the dictionaries 
> are equal.  This is because {{equals}} is not implemented for FieldVector and 
> so that comparison defaults to comparing the two objects only and not the 
> vector data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily

2019-07-16 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-5911.
-
   Resolution: Fixed
Fix Version/s: 0.14.1

Issue resolved by pull request 4854
[https://github.com/apache/arrow/pull/4854]

> [Java] Make ListVector and MapVector create reader lazily
> -
>
> Key: ARROW-5911
> URL: https://issues.apache.org/jira/browse/ARROW-5911
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.1
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Current implementation creates reader eagerly, which may cause unnecessary 
> resource and time. This issue changes the behavior to lazily create the 
> reader.
> This is a follow-up issue for ARROW-5897.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (ARROW-5435) [Java] add test for IntervalYearVector#getAsStringBuilder

2019-06-27 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-5435.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4407
[https://github.com/apache/arrow/pull/4407]

> [Java] add test for IntervalYearVector#getAsStringBuilder
> -
>
> Key: ARROW-5435
> URL: https://issues.apache.org/jira/browse/ARROW-5435
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5762) [Integration][JS] Integration Tests for Map Type

2019-06-27 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated ARROW-5762:

Summary: [Integration][JS] Integration Tests for Map Type  (was: 
[Integration][JS] Integration Tests for MapType)

> [Integration][JS] Integration Tests for Map Type
> 
>
> Key: ARROW-5762
> URL: https://issues.apache.org/jira/browse/ARROW-5762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration, JavaScript
>Reporter: Bryan Cutler
>Priority: Major
>
> ARROW-1279 enabled integration tests for MapType between Java and C++, but 
> JavaScript had to be disabled for the map case due to an error.  Once this is 
> fixed, {{generate_map_case}} could be moved under {{generate_nested_case}} 
> with the other nested types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >