[jira] [Created] (ARROW-17135) [C++] Reduce code size in arrow/compute/kernels/scalar_compare.cc

2022-07-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-17135:


 Summary: [C++] Reduce code size in 
arrow/compute/kernels/scalar_compare.cc
 Key: ARROW-17135
 URL: https://issues.apache.org/jira/browse/ARROW-17135
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney


I had noticed the large symbol sizes in scalar_compare.cc when looking at the 
shared library. I had a quick hack on the plane to try to reduce the code size



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask

2022-07-19 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-17134:


 Summary: [C++(?)/Python] pyarrow.compute.replace_with_mask does 
not replace null when providing an array mask
 Key: ARROW-17134
 URL: https://issues.apache.org/jira/browse/ARROW-17134
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 8.0.0
Reporter: Matthew Roeschke


 
{code:java}
In [1]: import pyarrow as pa

In [2]: arr1 = pa.array([1, 0, 1, None, None])

In [3]: arr2 = pa.array([None, None, 1, 0, 1])

In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], 
arr2)

Out[4]:

[
  1,
  0,
  1,
  null, # I would expect 0
  null  # I would expect 1
]

In [5]: pa.__version__
Out[5]: '8.0.0'{code}
 

I have noticed this behavior occur with the integer, floating, bool, temporal 
types

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17133) pqarrow: PlainFixedLenByteArrayEncoder behaves differently from DictFixedLenByteArrayEncoder with null values where schema has Nullable: false

2022-07-19 Thread Phillip LeBlanc (Jira)
Phillip LeBlanc created ARROW-17133:
---

 Summary: pqarrow: PlainFixedLenByteArrayEncoder behaves 
differently from DictFixedLenByteArrayEncoder with null values where schema has 
Nullable: false
 Key: ARROW-17133
 URL: https://issues.apache.org/jira/browse/ARROW-17133
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go, Parquet
Affects Versions: 8.0.0
Reporter: Phillip LeBlanc


I have created a small repro to illustrate this bug: 
https://gist.github.com/phillipleblanc/5e3e2d0e6914d276cf9fd79e019581de

When writing a Decimal128 array to a Parquet file the pqarrow package will 
prefer to use DictFixedLenByteArrayEncoder. If the size of the array goes over 
some threshold, it will switch to using PlainFixedLenByteArrayEncoder.

The DictFixedLenByteArrayEncoder tolerates null values in a Decimal128 array 
with the arrow schema set to Nullable: false, however the 
PlainFixedLenByteArrayEncoder will not tolerate null values and will panic.

Having null values in an array marked as non-nullable is an issue in the user 
code - however, it was surprising that my buggy code was working some times and 
not working other times. I would expect the PlainFixedLen encoder to handle 
nulls the same way as the DictFixedLen encoder or for the DictFixedLen encoder 
to panic.

An observation is that most other array types handle nulls with the schema 
marked as non-nullable when writing to Parquet; this was the first instance I 
found in the pqarrow package where having the Arrow schema marked as Nullable 
was necessary for Parquet writing arrays with null values. Again, debatable if 
this is desirable or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17132) [R] Mutate in compare_dplyr_binding returns wrong type

2022-07-19 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-17132:
--

 Summary: [R] Mutate in compare_dplyr_binding returns wrong type
 Key: ARROW-17132
 URL: https://issues.apache.org/jira/browse/ARROW-17132
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Rok Mihevc


The following:
{code:r}
df <- tibble::tibble(
  time = as.POSIXct(seq(as.Date("1999-12-31", tz = "UTC"), 
as.Date("2001-01-01", tz = "UTC"), by = "day"))
)

compare_dplyr_binding(
  .input %>%
mutate(x = yday(time)) %>%
collect(),
  df
)
{code}

Fails with:

{code:bash}
Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
`object` (`actual`) not equal to `expected` (`expected`).

`attr(actual$time, 'tzone')` is a character vector ('UTC')
`attr(expected$time, 'tzone')` is absent
Backtrace:
 1. arrow:::compare_dplyr_binding(...)
  at test-dplyr-funcs-datetime.R:574:2
 2. arrow:::expect_equal(via_batch, expected, ...)
  at tests/testthat/helper-expectation.R:115:4
 3. testthat::expect_equal(...)
  at tests/testthat/helper-expectation.R:42:4

Failure (test-dplyr-funcs-datetime.R:574:3): extract wday from timestamp
`object` (`actual`) not equal to `expected` (`expected`).

`attr(actual$time, 'tzone')` is a character vector ('UTC')
`attr(expected$time, 'tzone')` is absent
Backtrace:
 1. arrow:::compare_dplyr_binding(...)
  at test-dplyr-funcs-datetime.R:574:2
 2. arrow:::expect_equal(via_table, expected, ...)
  at tests/testthat/helper-expectation.R:129:4
 3. testthat::expect_equal(...)
  at tests/testthat/helper-expectation.R:42:4
{code}

This also happens for qday and probably other functions where input is temporal 
and output is numeric.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17131) [Python] add a field() method to StructType that returns the user a field

2022-07-19 Thread Anja Boskovic (Jira)
Anja Boskovic created ARROW-17131:
-

 Summary: [Python]  add a field() method to StructType that returns 
the user a field
 Key: ARROW-17131
 URL: https://issues.apache.org/jira/browse/ARROW-17131
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Anja Boskovic
Assignee: Anja Boskovic


Joris suggested here "we could also add a {{field()}} method that returns you a 
field? (that is more discoverable than {{{}[]{}}}, and would be consistent with 
a Schema and with StructArray (to get the child array for that field)".

 

Completion of this issue would also mean updating the example in the API docs 
for StructType to mention `.field()`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [arrow-julia] palday opened a new issue, #328: `fromarrow` dispatch for `::Type{Union{Missing, T}}` is broken when `T` is parametric

2022-07-19 Thread GitBox


palday opened a new issue, #328:
URL: https://github.com/apache/arrow-julia/issues/328

   xref: https://github.com/JuliaCloud/AWSS3.jl/issues/263
   
   The fix seems to be change `::Type{Union{Missing, T}}` to 
`::Type{<:Union{Missing, T}}`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (ARROW-17130) Enable multiple character delimiters in read_csv

2022-07-19 Thread Jack Howard (Jira)
Jack Howard created ARROW-17130:
---

 Summary: Enable multiple character delimiters in read_csv
 Key: ARROW-17130
 URL: https://issues.apache.org/jira/browse/ARROW-17130
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Affects Versions: 8.0.1
Reporter: Jack Howard


Read_CSV ParseOptions allows only a single character delimiter.   Single 
character delimiters are highly susceptible to the candidate value existing 
within the data to be loaded, negating the ability to serve as a delimiter.

If a double character delimiter is used, the current limit of a single 
character returns  "only single character unicode strings can be converted to 
Py_UCS4, got length 2"



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17129) [C++][Compute] Improve memory efficiency in Grouper

2022-07-19 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-17129:


 Summary: [C++][Compute] Improve memory efficiency in Grouper
 Key: ARROW-17129
 URL: https://issues.apache.org/jira/browse/ARROW-17129
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


There are APIs in arrow::compute::Grouper (GetUniques, Consume) which may be 
able to be refactored to write into preallocated memory or otherwise have a 
mode that does less mandatory allocation. We can investigate at some point



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17128) [C++] Sporadic DCHECK failure in arrow-dataset-scanner-test (2)

2022-07-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17128:
--

 Summary: [C++] Sporadic DCHECK failure in 
arrow-dataset-scanner-test (2)
 Key: ARROW-17128
 URL: https://issues.apache.org/jira/browse/ARROW-17128
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


Just got this sporadic assertion error:
{code}
[ RUN  ] 
TestScannerThreading/TestScanner.CountRowsWithMetadata/3Threaded2d16b1024r
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:331:  Check failed: 
!IsFutureFinished(state_) Future already marked finished
{code}

Stack trace:
{code}
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x74a24859 in __GI_abort () at abort.c:79
#2  0x756f635c in arrow::util::CerrLog::~CerrLog (this=0x5586b330, 
__in_chrg=) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/logging.cc:72
#3  0x756f6378 in arrow::util::CerrLog::~CerrLog (this=0x5586b330, 
__in_chrg=) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/logging.cc:74
#4  0x756f66dd in arrow::util::ArrowLog::~ArrowLog 
(this=0x7fffebffd970, __in_chrg=) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/logging.cc:250
#5  0x756c7af1 in arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed 
(this=0x5585e910, state=arrow::FutureState::SUCCESS)
at /home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:331
#6  0x756c70e7 in arrow::ConcreteFutureImpl::DoMarkFinished 
(this=0x5585e910) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:232
#7  0x756c8288 in arrow::FutureImpl::MarkFinished (this=0x5585e910) 
at /home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:409
#8  0x7564e4f7 in arrow::Future::DoMarkFinished 
(this=0x55896bf0, res=...) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.h:725
#9  0x7564c198 in 
arrow::Future::MarkFinished (this=0x55896bf0, s=...)
at /home/antoine/arrow/dev/cpp/src/arrow/util/future.h:476
#10 0x7599d045 in arrow::compute::(anonymous 
namespace)::ScalarAggregateNode::Finish (this=0x55896b60)
at /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/aggregate_node.cc:255
#11 0x7599c422 in arrow::compute::(anonymous 
namespace)::ScalarAggregateNode::InputReceived (this=0x55896b60, 
input=0x559077c0, batch=...)
at /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/aggregate_node.cc:176
#12 0x759c8567 in operator() (__closure=0x7fffebffdd40) at 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/exec_plan.cc:531
#13 0x759c873a in 
arrow::compute::MapNode::SubmitTask(std::function
 (arrow::compute::ExecBatch)>, arrow::compute::ExecBatch) (this=0x559077c0, 
map_fn=..., batch=...) at 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/exec_plan.cc:535
#14 0x75a97524 in arrow::compute::(anonymous 
namespace)::ProjectNode::InputReceived (this=0x559077c0, 
input=0x55913150, batch=...)
at /home/antoine/arrow/dev/cpp/src/arrow/compute/exec/project_node.cc:111
#15 0x75aa3da2 in operator() (__closure=0x7fffa000aba0) at 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/source_node.cc:119
#16 0x75aaa56f in std::__invoke_impl::&)>::&>(std::__invoke_other,
 struct {...} &) (__f=...)
at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/invoke.h:60
#17 0x75aa943c in std::__invoke_r::&)>::&>(struct
 {...} &) (__fn=...) at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/invoke.h:115
#18 0x75aa79ca in std::_Function_handler::&)>::
 >::_M_invoke(const std::_Any_data &) (__functor=...)
at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/std_function.h:292
#19 0x759ca6b5 in std::function::operator()() const 
(this=0x5593e700)
at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/std_function.h:622
#20 0x759df33b in 
arrow::detail::ContinueFuture::operator()&, , 
arrow::Status, arrow::Future 
>(arrow::Future, std::function&) 
const (this=0x5593e6f8, next=..., f=...) at 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.h:150
#21 0x759df19f in std::__invoke_impl&, 
std::function&>(std::__invoke_other, 
arrow::detail::ContinueFuture&, arrow::Future&, 
std::function&) (__f=...)
at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/invoke.h:60
#22 0x759def33 in std::__invoke&, std::function&>(arrow::detail::ContinueFuture&, arrow::Future&, 
std::function&) (__fn=...)
at 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/invoke.h:95
#23 0x759deb86 in std::_Bind, std::function)>::__call(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) 
(this=0x5593e6f8, __args=...) at 

[jira] [Created] (ARROW-17127) [C++] Sporadic crash in arrow-dataset-scanner-test (1)

2022-07-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17127:
--

 Summary: [C++] Sporadic crash in arrow-dataset-scanner-test (1)
 Key: ARROW-17127
 URL: https://issues.apache.org/jira/browse/ARROW-17127
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 9.0.0


See GDB backtrace at 
https://gist.github.com/pitrou/ef47ab902cbbba80440ee0375a1d7ed3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17126) [C++] Remove FutureWaiter

2022-07-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17126:
--

 Summary: [C++] Remove FutureWaiter
 Key: ARROW-17126
 URL: https://issues.apache.org/jira/browse/ARROW-17126
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


{{FutureWaiter}} and dependent APIs ({{FutureIterator}}, {{WaitForAll}}, 
{{WaitForAny}}).

Removing {{FutureWaiter}} would significantly simplify the {{Future}} 
implementation, making it also more maintainable and potentially faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17125) Unable to install pyarrow on Debian 10 (i686)

2022-07-19 Thread Rustam Guliev (Jira)
Rustam Guliev created ARROW-17125:
-

 Summary: Unable to install pyarrow on Debian 10 (i686)
 Key: ARROW-17125
 URL: https://issues.apache.org/jira/browse/ARROW-17125
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.1, 7.0.1
 Environment: Debian GNU/Linux 10 (buster)
Python 3.9.7
pip 22.1.2 
cmake 3.22.5


$ lscpu
Architecture:        i686
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       45 bits physical, 48 bits virtual
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               45
Model name:          Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
Stepping:            7
CPU MHz:             1995.000
BogoMIPS:            3990.00
Hypervisor vendor:   VMware
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ss nx rdtscp lm constant_tsc 
arch_perfmon xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 cx16 
sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm 
pti ssbd ibrs ibpb stibp tsc_adjust arat md_clear flush_l1d arch_capabilities  
Reporter: Rustam Guliev


Hi,

I am not able to install pyarrow on Debian 10. First, the installation (via 
`pip` or `poetry install`) fails with the following:

 
{code:java}
  EnvCommandError  Command 
['/home/rustam/.cache/pypoetry/virtualenvs/spectra-annotator-Vr_f9e53-py3.9/bin/pip',
 'install', '--no-deps', 
'file:///home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz']
 errored with the following return code 1, and output:
  Processing 
/home/rustam/.cache/pypoetry/artifacts/b2/96/6a/2a784854a355f986090eafd225285e4a1c6167b5a6adc6c859d785a095/pyarrow-7.0.0.tar.gz
    Installing build dependencies: started
    Installing build dependencies: finished with status 'done'
    Getting requirements to build wheel: started
    Getting requirements to build wheel: finished with status 'done'
    Preparing metadata (pyproject.toml): started
    Preparing metadata (pyproject.toml): finished with status 'done'
  Building wheels for collected packages: pyarrow
    Building wheel for pyarrow (pyproject.toml): started
    Building wheel for pyarrow (pyproject.toml): finished with status 'error'
    error: subprocess-exited-with-error    × Building wheel for pyarrow 
(pyproject.toml) did not run successfully.
    │ exit code: 1
    ╰─> [261 lines of output]
        running bdist_wheel
        running build
        running build_py
        running egg_info
        writing pyarrow.egg-info/PKG-INFO
        writing dependency_links to pyarrow.egg-info/dependency_links.txt
        writing entry points to pyarrow.egg-info/entry_points.txt
        writing requirements to pyarrow.egg-info/requires.txt
        writing top-level names to pyarrow.egg-info/top_level.txt
        listing git files failed - pretending there aren't any
        reading manifest file 'pyarrow.egg-info/SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        warning: no files found matching '../LICENSE.txt'
        warning: no files found matching '../NOTICE.txt'
        warning: no previously-included files matching '*.so' found anywhere in 
distribution
        warning: no previously-included files matching '*.pyc' found anywhere 
in distribution
        warning: no previously-included files matching '*~' found anywhere in 
distribution
        warning: no previously-included files matching '#*' found anywhere in 
distribution
        warning: no previously-included files matching '.git*' found anywhere 
in distribution
        warning: no previously-included files matching '.DS_Store' found 
anywhere in distribution
        no previously-included directories found matching '.asv'
        
/tmp/pip-build-env-umvxn44o/overlay/lib/python3.9/site-packages/setuptools/command/build_py.py:153:
 SetuptoolsDeprecationWarning:     Installing 'pyarrow.includes' as data is 
deprecated, please list it in `packages`.
            !!
            
            # Package would be ignored #
            
            Python recognizes 'pyarrow.includes' as an importable package,
            but it is not listed in the `packages` configuration of setuptools. 
           'pyarrow.includes' has been automatically added to the distribution 
only
            because it may contain data files, but this behavior is likely to 
change
            in future versions of setuptools (and therefore is considered 
deprecated).            Please 

[jira] [Created] (ARROW-17124) [C++] Data race between future signalling and destruction

2022-07-19 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-17124:
--

 Summary: [C++] Data race between future signalling and destruction
 Key: ARROW-17124
 URL: https://issues.apache.org/jira/browse/ARROW-17124
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Antoine Pitrou


This sporadic Thread Sanitizer error just occurred to me:
{code}
WARNING: ThreadSanitizer: data race (pid=636020)
  Write of size 8 at 0x7b2c17d0 by main thread:
#0 pthread_cond_destroy 
../../../../libsanitizer/tsan/tsan_interceptors_posix.cpp:1208 
(libtsan.so.0+0x31c14)
#1 arrow::ConcreteFutureImpl::~ConcreteFutureImpl() 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:211 
(libarrow.so.900+0xa70b62)
#2 arrow::ConcreteFutureImpl::~ConcreteFutureImpl() 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:211 
(libarrow.so.900+0xa70ba0)
#3 std::default_delete::operator()(arrow::FutureImpl*) 
const 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/unique_ptr.h:85
 (arrow-dataset-file-test+0x584a1)
#4 std::_Sp_counted_deleter, std::allocator, 
(__gnu_cxx::_Lock_policy)2>::_M_dispose() 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:474
 (arrow-dataset-file-test+0xa9638)
#5 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  
(libarrow.so.900+0x2e1158)
#6 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() 
 (libarrow.so.900+0x2dc6ed)
#7 std::__shared_ptr::~__shared_ptr()  (libarrow.so.900+0x978fee)
#8 std::shared_ptr::~shared_ptr()  
(libarrow.so.900+0x97901c)
#9 arrow::Future::~Future()  
(libarrow.so.900+0x97904a)
#10 ~ExecPlanImpl 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/exec_plan.cc:52 
(libarrow.so.900+0xe8160b)
#11 ~ExecPlanImpl 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/exec_plan.cc:58 
(libarrow.so.900+0xe8166e)
#12 _M_dispose 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:380
 (libarrow.so.900+0xea6c2a)
#13 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release()  
(libarrow_dataset.so.900+0x7bd10)
#14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:733
 (libarrow_dataset.so.900+0x77ad9)
#15 std::__shared_ptr::~__shared_ptr() 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr_base.h:1183
 (libarrow_dataset.so.900+0xd3dfc)
#16 std::shared_ptr::~shared_ptr() 
/home/antoine/miniconda3/envs/pyarrow/x86_64-conda-linux-gnu/include/c++/10.3.0/bits/shared_ptr.h:121
 (libarrow_dataset.so.900+0xd3e2a)
#17 
arrow::dataset::FileSystemDataset::Write(arrow::dataset::FileSystemDatasetWriteOptions
 const&, std::shared_ptr) 
/home/antoine/arrow/dev/cpp/src/arrow/dataset/file_base.cc:398 
(libarrow_dataset.so.900+0xd49ca)
#18 arrow::dataset::TestFileSystemDataset_WriteProjected_Test::TestBody() 
/home/antoine/arrow/dev/cpp/src/arrow/dataset/file_test.cc:330 
(arrow-dataset-file-test+0x2e382)
#19 void 
testing::internal::HandleExceptionsInMethodIfSupported(testing::Test*, void (testing::Test::*)(), char const*)  
(libgtest.so.1.11.0+0x5bd3d)

  Previous read of size 8 at 0x7b2c17d0 by thread T3:
#0 pthread_cond_broadcast 
../../../../libsanitizer/tsan/tsan_interceptors_posix.cpp:1201 
(libtsan.so.0+0x31b51)
#1 arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:343 
(libarrow.so.900+0xa6bee0)
#2 arrow::ConcreteFutureImpl::DoMarkFinished() 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:232 
(libarrow.so.900+0xa6b0f4)
#3 arrow::FutureImpl::MarkFinished() 
/home/antoine/arrow/dev/cpp/src/arrow/util/future.cc:409 
(libarrow.so.900+0xa6c83f)
#4 
arrow::Future::DoMarkFinished(arrow::Result)
 /home/antoine/arrow/dev/cpp/src/arrow/util/future.h:725 
(libarrow.so.900+0x9cbf81)
#5 void 
arrow::Future::MarkFinished(arrow::Status) /home/antoine/arrow/dev/cpp/src/arrow/util/future.h:476 
(libarrow.so.900+0x9c921c)
#6 operator() 
/home/antoine/arrow/dev/cpp/src/arrow/compute/exec/exec_plan.cc:192 
(libarrow.so.900+0xe82ee6)
#7 operator() /home/antoine/arrow/dev/cpp/src/arrow/util/future.h:522 
(libarrow.so.900+0xea70a3)
{code}

I think the fix is simply to signal the condition variable with the mutex 
locked (which might be a bit worse performance-wise).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17123) [JS] Unable to open reader on .arrow file after fetch: Uncaught (in promise) Error: Expected to read 1329865020 metadata bytes, but only read 1123.

2022-07-19 Thread Benoit Cantin (Jira)
Benoit Cantin created ARROW-17123:
-

 Summary: [JS] Unable to open reader on .arrow file after fetch: 
Uncaught (in promise) Error: Expected to read 1329865020 metadata bytes, but 
only read 1123.
 Key: ARROW-17123
 URL: https://issues.apache.org/jira/browse/ARROW-17123
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: 8.0.1
Reporter: Benoit Cantin


I created a file in raw arrow format with the script given in the Py arrow 
cookbook here: 
[https://arrow.apache.org/cookbook/py/io.html#saving-arrow-arrays-to-disk]

 

In a Node.js application, this file can be read doing:

 
{code:java}
const r = await RecordBatchReader.from(fs.createReadStream(filePath));  
  await r.open();
for (let i = 0; i < r.numRecordBatches; i++) {
const rb = await r.readRecordBatch(i); 
if (rb !== null) {
console.log(rb.numRows);
}
} {code}
However this method loads the whole file in memory (is that a bug?), which is 
not scalable.

 

To solve this scalability issue, I try to load the data with fetch as described 
in the the 
[README.md|[https://github.com/apache/arrow/tree/master/js#load-data-with-fetch].]
 Both:

 
{code:java}
import { tableFromIPC } from "apache-arrow";

const table = await tableFromIPC(fetch(filePath));
console.table([...table]);{code}
 and

 

 
{code:java}
const r = await RecordBatchReader.from(await fetch(filePath));
await r.open(); {code}
fail with error:

Uncaught (in promise) Error: Expected to read 1329865020 metadata bytes, but 
only read 1123.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17122) [Python] Cleanup after moving Python related code into pyarrow

2022-07-19 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-17122:
---

 Summary: [Python] Cleanup after moving Python related code into 
pyarrow
 Key: ARROW-17122
 URL: https://issues.apache.org/jira/browse/ARROW-17122
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 10.0.0


This is an umbrella issue for follow-up work that needs to be done after 
https://issues.apache.org/jira/browse/ARROW-16340 is resolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17121) [Gandiva][C++] Adding mask function

2022-07-19 Thread Palak Pariawala (Jira)
Palak Pariawala created ARROW-17121:
---

 Summary: [Gandiva][C++] Adding mask function
 Key: ARROW-17121
 URL: https://issues.apache.org/jira/browse/ARROW-17121
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Palak Pariawala
Assignee: Palak Pariawala


Add mask(str inp)/mask(str inp, str uc-mask, str lc-mask, str num-mask) 
function to Gandiva.

With default masking upper case letters as 'X', lower case letters as 'x' and 
numbers as 'n'.
Custom masking as specified in parameters.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)