[jira] [Created] (ARROW-13546) [Python] Breaking API change in FSSpecHandler, requires metadata argument

2021-08-04 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-13546:


 Summary: [Python] Breaking API change in FSSpecHandler, requires 
metadata argument
 Key: ARROW-13546
 URL: https://issues.apache.org/jira/browse/ARROW-13546
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


[https://github.com/apache/arrow/pull/10295] introduced the required metadata 
argument to FSSpecHandler.open_output_stream 
Noticed this in our CI/testsuite at [https://github.com/vaexio/vaex/pull/1490] 
{code:java}
def create():
261>   return fs.open_output_stream(path)
262E   TypeError: open_output_stream() missing 1 required positional 
argument: 'metadata'
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10959) [C++] Add scalar string join kernel

2020-12-18 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10959:


 Summary: [C++] Add scalar string join kernel
 Key: ARROW-10959
 URL: https://issues.apache.org/jira/browse/ARROW-10959
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Maarten Breddels


Similar to Python's str.join



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10799) [C++] Take on string chunked arrays slow and fails

2020-12-03 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10799:


 Summary: [C++] Take on string chunked arrays slow and fails
 Key: ARROW-10799
 URL: https://issues.apache.org/jira/browse/ARROW-10799
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Maarten Breddels


 
{code:java}
import pyarrow as pa
a = pa.array(['a'] * 2**26)
c = pa.chunked_array([a] * 2*18)
c.take([0, 1])
{code}
Gives
{noformat}

ArrowInvalidTraceback (most recent call last)
 in 
> 1 c.take([0, 1])

~/github/apache/arrow/python/pyarrow/table.pxi in 
pyarrow.lib.ChunkedArray.take()

~/github/apache/arrow/python/pyarrow/compute.py in take(data, indices, 
boundscheck, memory_pool)
421 """
422 options = TakeOptions(boundscheck=boundscheck)
--> 423 return call_function('take', [data, indices], options, memory_pool)
424 
425 

~/github/apache/arrow/python/pyarrow/_compute.pyx in 
pyarrow._compute.call_function()

~/github/apache/arrow/python/pyarrow/_compute.pyx in 
pyarrow._compute.Function.call()

~/github/apache/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: offset overflow while concatenating arrays
{noformat}
 

PS: did not check master but  3.0.0.dev238+gb0bc9f8d

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

2020-11-25 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10739:


 Summary: [Python] Pickling a sliced array serializes all the 
buffers
 Key: ARROW-10739
 URL: https://issues.apache.org/jira/browse/ARROW-10739
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


If a large array is sliced, and pickled, it seems the full buffer is 
serialized, this leads to excessive memory usage and data transfer when using 
multiprocessing or dask.
{code:java}
>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
74
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165

NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165{code}
I think this makes sense if you know arrow, but kind of unexpected as a user.

Is there a workaround for this? For instance copy an arrow array to get rid of 
the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

2020-11-25 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10736:


 Summary: [Python] feather/arrow row splitting and counting 
(Dataset API)
 Key: ARROW-10736
 URL: https://issues.apache.org/jira/browse/ARROW-10736
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Maarten Breddels


For parquet files using the Dataset API, we have the option to access the row 
groups, and count the total number of rows within each. I don't see the option 
to get the number of rows from a dataset with feather/arrow ipc files. For 
instance, a scan without any columns is not possible it seems, nor any method 
to get the row count.

Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and it 
is not possible to read only a portion of a filefragment (row slicing), similar 
to how one could work with ParquetFileFragment.split_by_row_group.

I don't know of any other way within Apache Arrow to work with feather/arrow 
ipc files and only read portions of it (e.g. a particular column for row i to 
j).

Are these features possible any other way, or is this already planned, possibly 
difficult to implement?

cheers,

Maarten



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10709) [Python] Difficult to make an efficient zero-copy file reader in Python

2020-11-24 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10709:


 Summary: [Python] Difficult to make an efficient zero-copy file 
reader in Python
 Key: ARROW-10709
 URL: https://issues.apache.org/jira/browse/ARROW-10709
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Maarten Breddels


There is an option to do efficient data transport using file.read_buffer() 
using zero memory copies (benchmarking have confirmed that, very nice!).

However, file.read_buffer() when backed by a Python object (via PythonFile), 
will call PythonFile.read() via PyReadableFile::Read. A 'normal' file.read() 
that does memory copying, also calls the PythonFile.read() method, but only 
allows for a bytes object (PyBytes_Check is used in PyReadableFile::Read).
This makes it hard to create 1 file object in Python land that supports normal 
.read() (and thus needs to returns a bytes object) and to also support a 
zero-copy route where .read() can return a memory view.
Possibly the strict check on PyBytes_Check can me lifted by also allowing 
trying PyObject_GetBuffer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10557) [C++] Add scalar string slicing/substring kernel

2020-11-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10557:


 Summary: [C++] Add scalar string slicing/substring kernel 
 Key: ARROW-10557
 URL: https://issues.apache.org/jira/browse/ARROW-10557
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


This should implement slicing scalar string values of strings arrays with 
Python semantics with start, stop ,step arguments. This may seem similar to 
lists, or binary array, but the string length semantics enter into this kernel, 
which does not need to equal the number of bytes, nor the number of codepoints 
(accents, etc should be skipped).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10556) [C++] Caching pre computed data based on FunctionOptions in the kernel state

2020-11-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10556:


 Summary: [C++] Caching pre computed data based on FunctionOptions 
in the kernel state
 Key: ARROW-10556
 URL: https://issues.apache.org/jira/browse/ARROW-10556
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


See discussion here:

[https://github.com/apache/arrow/pull/8621#issuecomment-724796243]

 

A kernel might need to pre-compute something based on the function options 
passed. Since the Kernel-FunctionOptions mapping is not 1-to-1, it does not 
make sense to store this in the function option object. 

Currently, match_substring calculates a `prefix_table` on each Exec call. In 
trim ([https://github.com/apache/arrow/pull/8621)] we compute a vector on 
each Exec call. This should be done only once and cached in the kernel state 
instead.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10541) [C++] Add re2 library to core arrow / ARROW_WITH_RE2

2020-11-10 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10541:


 Summary: [C++] Add re2 library to core arrow / ARROW_WITH_RE2
 Key: ARROW-10541
 URL: https://issues.apache.org/jira/browse/ARROW-10541
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


For https://issues.apache.org/jira/browse/ARROW-10195 we need the re2 linked 
into the core arrow library, as discussed:
[https://github.com/apache/arrow/pull/8459#pullrequestreview-508337720]

This might be good to put under an ARROW_WITH_RE2 CMake option, maybe default 
on when ARROW_COMPUTE=ON?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10306) [C++] Add string replacement kernel

2020-10-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10306:


 Summary: [C++] Add string replacement kernel 
 Key: ARROW-10306
 URL: https://issues.apache.org/jira/browse/ARROW-10306
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to 
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html]
 with a plain variant, and optionally a RE2 version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10209) [Python] support positional arguments for options in compute wrapper

2020-10-07 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10209:


 Summary: [Python] support positional arguments for options in 
compute wrapper
 Key: ARROW-10209
 URL: https://issues.apache.org/jira/browse/ARROW-10209
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Maarten Breddels


As mentioned here:

[https://github.com/apache/arrow/pull/8271#discussion_r500897047]

we cannot support
{code:java}
pc.split_pattern(arr, "---")
{code}
where the second argument is a positional argument of the FunctionObject class.

I think it makes sense for a small subset (like this function) to support 
non-keyword arguments.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10208) [C++] comparing list arrays with nulls fails in test framework

2020-10-07 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10208:


 Summary: [C++] comparing list arrays with nulls fails in test 
framework
 Key: ARROW-10208
 URL: https://issues.apache.org/jira/browse/ARROW-10208
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


I am not sure if this is a specific test issue or valid behavior, but when 
writing a test in [https://github.com/apache/arrow/pull/8271] 

The following test fails:
{code:java}
this->CheckUnary("split_pattern", R"(["foo bar", "foo", null])", 
list(this->type()),  //  R"([["foo", "bar"], ["foo"], null])", 
);
{code}
with the following output
{code:java}
Failed:
Got: 
  [
[
  [
"foo",
"bar"
  ]
],
[
  [
"foo"
  ],
  null
]
  ]
Expected: 
  [
[
  [
"foo",
"bar"
  ]
],
[
  [
"foo"
  ],
  null
]
  ]
{code}
while the outputs are the same, the arrays are seen as unequal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10207) C++] Unary kernels that results in a list have no preallocated offset buffer

2020-10-07 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10207:


 Summary: C++] Unary kernels that results in a list have no 
preallocated offset buffer
 Key: ARROW-10207
 URL: https://issues.apache.org/jira/browse/ARROW-10207
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Maarten Breddels


I noticed in

[https://github.com/apache/arrow/pull/8271]

That a string->list[string] kernel does not have the offsets preallocated in 
the output. I believe there is a preference for not doing allocations in 
kernels, so this can be optimized at a higher level. I think it can also be 
done in this case. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-10195) [C++] Add string struct extract kernel using re2

2020-10-06 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-10195:


 Summary: [C++] Add string struct extract kernel using re2
 Key: ARROW-10195
 URL: https://issues.apache.org/jira/browse/ARROW-10195
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to Pandas' str.extract a way to convert a string to a struct of strings 
using the re2 regex library (when having named captured groups). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9991) [C++] split kernsl for strings/binary

2020-09-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9991:
---

 Summary: [C++] split kernsl for strings/binary
 Key: ARROW-9991
 URL: https://issues.apache.org/jira/browse/ARROW-9991
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


Similar to Python str.split and bytes.split, we'd like to have a way to convert 
str into list[str] (and similarly for bytes).

When the separator is given, the algorithms for both types are the same. 
Python, however, overloads strip. When given no separator, the algorithm will 
split considering all whitespace (unicode for str, ascii for bytes) as 
separator.

I'd rather see not too much overloaded kernels, e.g.
 # 
binary_split (takes string/binary separator, and maxsplit arg, no special utf8 
version needed)


 
utf8_split_whitespace (similar to Python's version given no separator)
asi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9471) [C++] Scan Dataset in reverse

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9471:
---

 Summary: [C++] Scan Dataset in reverse
 Key: ARROW-9471
 URL: https://issues.apache.org/jira/browse/ARROW-9471
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


If a dataset does not fit into the OS cache, it can be beneficial to alternate 
between normal and reverse 'scanning'. Even if 90% of the a set of files fits 
into cache, scanning the same set twice will not make use of the OS cache. On 
the other hand, if the second time, scanning goes in reverse order, 90% will 
still be in OS cache. We use this trick in vaex, and I'd like to support that 
for parquet reading as well. (Is there a proper name/term for this?)

Note that since you don't want to reverse on byte level, you may want to 
reverse the way of traversing the fragment, or fragment and row groups. Too 
small chunks (e.g. pages) could lead to a performance decrease because most 
read algorithms implement read-ahead optimization (not the reverse). I think 
doing this on fragment level might be enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9458) [Python] Dataset singlethreaded only

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9458:
---

 Summary: [Python] Dataset singlethreaded only
 Key: ARROW-9458
 URL: https://issues.apache.org/jira/browse/ARROW-9458
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an 
issue in the C++ layer.

I have 1000 parquet files with a total of 1 billion rows (1 million rows each 
file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 
columns efficiently (vaex use case).

 
{code:java}
import pyarrow.parquet
import pyarrow as pa
import pyarrow.dataset as ds
import glob
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
scanned = 0
for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], 
use_threads=True):
for record_batch in scan_task.execute():
scanned += record_batch.num_rows
scanned
{code}
This only seems to use 1 cpu.

Using a threadpool from Python:
{code:java}
# %%timeit
import concurrent.futures
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
def process(scan_task):
scan_count = 0
for record_batch in scan_task.execute():
scan_count += len(record_batch)
return scan_count
sum(pool.map(process, ds.scan(batch_size=1_000_000, 
columns=['passenger_count'], use_threads=False)))
{code}
Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu).

py-spy (profiler for Python) shows no GIL, so this might be something at the 
C++ layer.

Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a 
problem on this system (it actually all comes from OS cache, no disk read 
observed)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9456) [Python] Dataset segfault when not importing pyarrow.parquet

2020-07-14 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9456:
---

 Summary: [Python] Dataset segfault when not importing 
pyarrow.parquet 
 Key: ARROW-9456
 URL: https://issues.apache.org/jira/browse/ARROW-9456
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Maarten Breddels


To reproduce:
# import pyarrow.parquet # if we skip this...
import pyarrow as pa
import pyarrow.dataset as ds
import glob
ds = pa.dataset.dataset('/data/taxi_parquet/data_0.parquet')
ds.to_table() # this will crash
 
$ python pyarrow/crash.py dev
terminate called after throwing an instance of 'parquet::ParquetException'
 what(): The file only has 19 columns, requested metadata for column: 1049198736
[1] 1559395 abort (core dumped) python pyarrow/crash.py
 
When the import is there, it will work fine.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9403) [Python] add .tolist as alias of to_pylist

2020-07-10 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9403:
---

 Summary: [Python] add .tolist as alias of to_pylist
 Key: ARROW-9403
 URL: https://issues.apache.org/jira/browse/ARROW-9403
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Maarten Breddels
Assignee: Maarten Breddels


As discussed on the mailing list, it helps to write library agnostic code 
(NumPy/Pyarrow) if arrays support a .tolist(), as alias to .to_pylist().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9268) [C++] Add is{alnum,alpha,...} kernels for strings

2020-06-29 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9268:
---

 Summary: [C++] Add is{alnum,alpha,...} kernels for strings
 Key: ARROW-9268
 URL: https://issues.apache.org/jira/browse/ARROW-9268
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels
Assignee: Maarten Breddels


A good list of kernels to have would str->bool kernels, similar to:

[https://docs.python.org/3/library/stdtypes.html#str.isalnum] and friends.

I think all but `isidentifier` make sense to have. The semantics of the Python 
functions seem quite reasonable to have in Arrow, but maybe others can provide 
feedback if this is a complete/reasonable list to have or not.

I am not sure if we need more (or less) functions, or if we want more atomic 
functions, e.g. test for membership in Unicode categories. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9133) [C++] Add utf8_upper and utf_lower

2020-06-15 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9133:
---

 Summary: [C++] Add utf8_upper and utf_lower
 Key: ARROW-9133
 URL: https://issues.apache.org/jira/browse/ARROW-9133
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Maarten Breddels


This is the equivalent of https://issues.apache.org/jira/browse/ARROW-9100 for 
utf8. This will be a good test for unilib vs utf8proc, performance, and API 
wise.

Also, since Unicode strings can grow and shrink, this is also a good start to 
think about a strategy for memory allocation.

How much can a 'string' (or byte sequence) length actually grow? 

Item 5.18 mentioned that a string can expand by a factor of 3, by which they 
seem to mean 3 codepoints. This can be validated by checking with Python:
{code:python}
for i in range(0x100, 0x11):
codepoint = chr(i)
try:
bytes_before = codepoint.encode()
except UnicodeEncodeError:
continue
bytes_after = codepoint.upper().encode()
if len(bytes_before) != len(bytes_after):
print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), 
len(bytes_after))

912 0x390 ΐ Ϊ́ 2 6
...{code}
showing that a two-byte codepoint can expand to 3 (2 byte) codepoints (2 bytes 
=> 6 bytes). The character Ϊ́ has no single precomposed capital character, so 
it is composed of a single base character and two combining characters. However 
there are different situations explain in 
[https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt])

This increase by a factor of 3 is used in CPython 
[https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Objects/unicodeobject.c#L10058]
 which is an easy solution not to have to grow the buffer dynamically.

However, growing 3x in size seems at odds with the API of both utf8proc:

[https://github.com/JuliaStrings/utf8proc/blob/08fa0698639f15d07b12c0065a4494f2d504/utf8proc.c#L375]

[https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]

and unilib:

[https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79]

Which can only return a single 32bit value (thus 1 codepoint, encoding 1 
character). Both libraries seem to ignore the special cases of case mapping (no 
library uses/downloads SpecialCasing.txt).

This means that if Arrow wants to support the same features as Python regarding 
upper casing and lower casing (which means really implementing the Unicode), 
neither libraries are sufficient.

There are more edges cases/irregularities. But I propose I start with a version 
of utf8_lower and utf8_upper that ignore the special cases. 

 

PS:

Another interesting finding is that although upper casing can increase a buffer 
length by a factor of 3, lowercasing a utf8 string will only increase the byte 
length by a factor of 3/2 at maximum.
{code:python}
for i in range(0x100, 0x11):
codepoint = chr(i)
try:
bytes_before = codepoint.encode()
except UnicodeEncodeError:
continue
bytes_after = codepoint.lower().encode()
if len(bytes_before) != len(bytes_after):
print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), 
len(bytes_after))
304 0x130 İ i̇ 2 3
570 0x23a Ⱥ ⱥ 2 3
574 0x23e Ⱦ ⱦ 2 3
7838 0x1e9e ẞ ß 3 2
8486 0x2126 Ω ω 3 2
8490 0x212a K k 3 1
8491 0x212b Å å 3 2
11362 0x2c62 Ɫ ɫ 3 2
11364 0x2c64 Ɽ ɽ 3 2
11373 0x2c6d Ɑ ɑ 3 2
11374 0x2c6e Ɱ ɱ 3 2
11375 0x2c6f Ɐ ɐ 3 2
11376 0x2c70 Ɒ ɒ 3 2
11390 0x2c7e Ȿ ȿ 3 2
11391 0x2c7f Ɀ ɀ 3 2
42893 0xa78d Ɥ ɥ 3 2
42922 0xa7aa Ɦ ɦ 3 2
42923 0xa7ab Ɜ ɜ 3 2
42924 0xa7ac Ɡ ɡ 3 2
42925 0xa7ad Ɬ ɬ 3 2
42926 0xa7ae Ɪ ɪ 3 2
42928 0xa7b0 Ʞ ʞ 3 2
42929 0xa7b1 Ʇ ʇ 3 2
42930 0xa7b2 Ʝ ʝ 3 2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9131) [C++] Faster ascii_lower and ascii_upper

2020-06-15 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9131:
---

 Summary: [C++] Faster ascii_lower and ascii_upper
 Key: ARROW-9131
 URL: https://issues.apache.org/jira/browse/ARROW-9131
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Maarten Breddels


The current version is using a lookup table for doing the case conversion. 
Using [http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU] it seems using a 
boolean check and +/-32 seems ~5x times faster (4.6x for clang 9, 6.4x for GCC  
9.2).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9100) Add ascii_lower kernel

2020-06-11 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-9100:
---

 Summary: Add ascii_lower kernel
 Key: ARROW-9100
 URL: https://issues.apache.org/jira/browse/ARROW-9100
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Maarten Breddels






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library

2020-06-01 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121205#comment-17121205
 ] 

Maarten Breddels commented on ARROW-8990:
-

FYI, I've been using that library and 
[https://github.com/skarupke/flat_hash_map] for Vaex. After some benchmarking 
settled for the tsl one, but my research/benchmark wasn't very thorough, 
because the idea was I could easily switch if needed. But because the 
performance was great, I never looked back actually, so I'd be interested in 
the benchmark result.

By the same author, the [https://github.com/Tessil/hat-trie] library can also 
be very interesting to take a look at.

> [C++] Benchmark hash table against thirdparty options, possibly vendor a 
> thirdparty hash table library
> --
>
> Key: ARROW-8990
> URL: https://issues.apache.org/jira/browse/ARROW-8990
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>
> While we have our own hash table implementation, it would be worthwhile to 
> set up some benchmarks so that we can compare against std::unordered_map and 
> some other thirdparty libraries for hash tables to know whether we should 
> possibly use a thirdparty library. See e.g.
> https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html
> Libraries to consider: 
> * https://github.com/sparsehash/sparsehash



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library

2020-05-28 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118713#comment-17118713
 ] 

Maarten Breddels commented on ARROW-8961:
-

FWIW, in Vaex i've relied on [https://github.com/ufal/unilib] which is a very 
minimal/barebone library, I have no strong opinions about this though (unless 
benchmarks tell me otherwise).

> [C++] Vendor utf8proc library
> -
>
> Key: ARROW-8961
> URL: https://issues.apache.org/jira/browse/ARROW-8961
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> This is a minimal MIT-licensed library for UTF-8 data processing originally 
> developed for use in Julia
> https://github.com/JuliaStrings/utf8proc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-05-22 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114105#comment-17114105
 ] 

Maarten Breddels commented on ARROW-555:


Sounds good. I think it would help me a lot to see str->scalar and str->str 
(and possibly a str->[str, str]) example. They can be trivial, like always 
return ["a", "b"], but with that, I can probably get up to speed very quickly, 
if it's not too much to ask. 

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8865) Windows distribution for 0.17.1 seems broken (conda only)

2020-05-20 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112105#comment-17112105
 ] 

Maarten Breddels commented on ARROW-8865:
-

Thanks Joris, We got CI working by installing from pypi for the meantime. Feel 
free to close this if you don't think it belongs here.

> Windows distribution for 0.17.1 seems broken (conda only)
> -
>
> Key: ARROW-8865
> URL: https://issues.apache.org/jira/browse/ARROW-8865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Maarten Breddels
>Priority: Major
>
> We just started seeing issues with importing pyarrow on our CI:
> [https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401]
> Long logs, the issue appears here:
> > import pyarrow._parquet as _parquet 
> [2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E
>  ImportError: DLL load failed: The specified procedure could not be found.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8865) Windows distribution for 0.17.1 seems broken (conda only)

2020-05-19 Thread Maarten Breddels (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maarten Breddels updated ARROW-8865:

Summary: Windows distribution for 0.17.1 seems broken (conda only)  (was: 
windows distribution for 0.17.1 seems broken (conda only?)

> Windows distribution for 0.17.1 seems broken (conda only)
> -
>
> Key: ARROW-8865
> URL: https://issues.apache.org/jira/browse/ARROW-8865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.17.1
>Reporter: Maarten Breddels
>Priority: Major
>
> We just started seeing issues with importing pyarrow on our CI:
> [https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401]
> Long logs, the issue appears here:
> > import pyarrow._parquet as _parquet 
> [2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E
>  ImportError: DLL load failed: The specified procedure could not be found.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8865) windows distribution for 0.17.1 seems broken (conda only?

2020-05-19 Thread Maarten Breddels (Jira)
Maarten Breddels created ARROW-8865:
---

 Summary: windows distribution for 0.17.1 seems broken (conda only?
 Key: ARROW-8865
 URL: https://issues.apache.org/jira/browse/ARROW-8865
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.1
Reporter: Maarten Breddels


We just started seeing issues with importing pyarrow on our CI:

[https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401]

Long logs, the issue appears here:
> import pyarrow._parquet as _parquet 
[2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E
 ImportError: DLL load failed: The specified procedure could not be found.
 
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-05-11 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104666#comment-17104666
 ] 

Maarten Breddels commented on ARROW-555:


Something to consider (or should I move this discussion to the list?), is the 
support of ASCII vs utf8. I noticed the Gandiva code assumed ASCII (at least 
not utf8), while Arrow assumes strings are utf8 only. Having written the vaex 
string code, I'm pretty sure ASCII will be much faster though (you know the 
byte length of a string in advance). Is there interest in supporting more than 
utf8, ASCII for instance, or utf16/32? Or should it be utf8 only?

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-05-11 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104525#comment-17104525
 ] 

Maarten Breddels commented on ARROW-555:


I am likely to be able to start working on strings in Arrow this month, so I 
think the timing is good. Some pointers/examples to get me started would be 
great.

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-03-04 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051644#comment-17051644
 ] 

Maarten Breddels commented on ARROW-555:


What are the limitation, and is this somewhere documented? It might be good to 
keep those in mind.

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray

2020-03-04 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051264#comment-17051264
 ] 

Maarten Breddels commented on ARROW-555:


Related: https://issues.apache.org/jira/browse/ARROW-7083

I will probably start working on this a few weeks from now. My initial 
intention would be to separate the algorithms as much as possible so it would 
be possible to add them both to gandiva and a 'bare' kernel, or with a minimal 
amount of refactoring.

[~wesm]: what's your reason to choose re2? Gandiva and vaex both use pcre, but 
I have no strong preference (except being a bit familiar with pcre).

 

> [C++] String algorithm library for StringArray/BinaryArray
> --
>
> Key: ARROW-555
> URL: https://issues.apache.org/jira/browse/ARROW-555
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
>
> This is a parent JIRA for starting a module for processing strings in-memory 
> arranged in Arrow format. This will include using the re2 C++ regular 
> expression library and other standard string manipulations (such as those 
> found on Python's string objects)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7396) [Format] Register media types (MIME types) for Apache Arrow formats to IANA

2019-12-17 Thread Maarten Breddels (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998111#comment-16998111
 ] 

Maarten Breddels commented on ARROW-7396:
-

According to [https://en.wikipedia.org/wiki/Media_type]
_(about using x. or x-):_ _Media types in this tree cannot be registered. 
According to RFC 6838 (published in January 2013), any use of types in the 
unregistered tree is strongly discouraged. In addition, subtypes prefixed with 
{{x-}} or {{X-}} are no longer considered to be members of this tree._

This refers to [https://tools.ietf.org/html/rfc6838]

It seems to me that registring with a vnd prefix is more likely to be accepted 
at [https://www.iana.org/form/media-types]
 * application/vnd.apache.arrow.file
 * application/vnd.apache.arrow.stream

Possibly with an optional parameter for a version?

I have to serialize Apache Arrow Tables in JSON files and want to store the 
mime type with it, hence my interest.

> [Format] Register media types (MIME types) for Apache Arrow formats to IANA
> ---
>
> Key: ARROW-7396
> URL: https://issues.apache.org/jira/browse/ARROW-7396
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Kouhei Sutou
>Priority: Major
>
> See "MIME types" thread for details: 
> https://lists.apache.org/thread.html/b15726d0c0da2223ba1b45a226ef86263f688b20532a30535cd5e267%40%3Cdev.arrow.apache.org%3E
> Summary:
>   * If we don't register our media types for Apache Arrow formats (IPC File 
> Format and IPC Streaming Format) to IANA, we should use "x-" prefix such as 
> "application/x-apache-arrow-file".
>   * It may be better that we reuse the same manner as Apache Thrift. Apache 
> Thrift registers their media types as "application/vnd.apache.thrift.XXX". If 
> we use the same manner as Apache Thrift, we will use 
> "application/vnd.apache.arrow.file" or something.
> TODO:
>   * Decide which media types should we register. (Do we need vote?)
>   * Register our media types to IANA.
>   ** Media types page: 
> https://www.iana.org/assignments/media-types/media-types.xhtml
>   ** Application form for new media types: 
> https://www.iana.org/form/media-types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets

2019-03-08 Thread Maarten Breddels (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788160#comment-16788160
 ] 

Maarten Breddels commented on ARROW-4810:
-

I see BinaryArray/StringArray classes have similar implementation, the same 
holds for that, create a Large(Binary/String)Array?

> [Format][C++] Add "LargeList" type with 64-bit offsets
> --
>
> Key: ARROW-4810
> URL: https://issues.apache.org/jira/browse/ARROW-4810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Mentioned in https://github.com/apache/arrow/issues/3845



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets

2019-03-08 Thread Maarten Breddels (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788150#comment-16788150
 ] 

Maarten Breddels commented on ARROW-4810:
-

> Having arrays with > 2GB elements or binary arrays with > 2GB of data would 
> be considered an anti-pattern in the context of database systems, regardless 
> of whether the offsets are 32- or 64-bit. So in light of these it doesn't 
> make sense to have the default type be 64-bit capable if this capability is 
> seldom used

I agree it's not the best idea, but people will find a reason to do it, and 
since there will not be a straightforward workaround, it may spin off another 
'standard' :)

But, since allow/supporting it will solve both issues (>2GB elements, and less 
code complexity) I thought I would mention that as well.

 

As far as the implementation, are you thinking about a new class (apart from 
ArrayList), or does it seem feasible to include a type for the value_offsets?

> [Format][C++] Add "LargeList" type with 64-bit offsets
> --
>
> Key: ARROW-4810
> URL: https://issues.apache.org/jira/browse/ARROW-4810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.14.0
>
>
> Mentioned in https://github.com/apache/arrow/issues/3845



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3685) [Python] Use fixed size binary for NumPy fixed-size string dtypes

2018-11-01 Thread Maarten Breddels (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671614#comment-16671614
 ] 

Maarten Breddels commented on ARROW-3685:
-

I tried to make a PR, but it's opening a whole can of worms, so maybe this part 
should be vaex specific, or maybe go into the docs.

> [Python] Use fixed size binary for NumPy fixed-size string dtypes
> -
>
> Key: ARROW-3685
> URL: https://issues.apache.org/jira/browse/ARROW-3685
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Maarten Breddels
>Priority: Major
>
> I'm working on getting support for arrow in vaex (out of core dataframe 
> library for Python) in this PR:
> [https://github.com/maartenbreddels/vaex/pull/116]
> And I fixed length binary arrays for numpy (say dtype='S42') will be 
> converted to a non-fixed length array. Trying to convert that back to numpy 
> will fail, since there is no such conversion.
> It makes more sense to convert dtype='S42', to an arrow array with 
> pyarrow.binary(42) type. As I do in:
> https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3686) Support for masked arrays in to/from numpy

2018-11-01 Thread Maarten Breddels (JIRA)
Maarten Breddels created ARROW-3686:
---

 Summary: Support for masked arrays in to/from numpy
 Key: ARROW-3686
 URL: https://issues.apache.org/jira/browse/ARROW-3686
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.11.1
Reporter: Maarten Breddels


Again, in this PR for vaex: [https://github.com/maartenbreddels/vaex/pull/116] 
I support masked arrays, it would be nice if this goes into pyarrow. If this 
approach looks good I could do a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3685) [Python] Use fixed size binary for NumPy fixed-size string dtypes

2018-11-01 Thread Maarten Breddels (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671515#comment-16671515
 ] 

Maarten Breddels commented on ARROW-3685:
-

Would you say this needs a change in to_pandas_dtype, or should it be an 
exception for numpy?

> [Python] Use fixed size binary for NumPy fixed-size string dtypes
> -
>
> Key: ARROW-3685
> URL: https://issues.apache.org/jira/browse/ARROW-3685
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Maarten Breddels
>Priority: Major
>
> I'm working on getting support for arrow in vaex (out of core dataframe 
> library for Python) in this PR:
> [https://github.com/maartenbreddels/vaex/pull/116]
> And I fixed length binary arrays for numpy (say dtype='S42') will be 
> converted to a non-fixed length array. Trying to convert that back to numpy 
> will fail, since there is no such conversion.
> It makes more sense to convert dtype='S42', to an arrow array with 
> pyarrow.binary(42) type. As I do in:
> https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3685) Better roundtrip between numpy and arrow binary array

2018-11-01 Thread Maarten Breddels (JIRA)
Maarten Breddels created ARROW-3685:
---

 Summary: Better roundtrip between numpy and arrow binary array
 Key: ARROW-3685
 URL: https://issues.apache.org/jira/browse/ARROW-3685
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.11.1
Reporter: Maarten Breddels


I'm working on getting support for arrow in vaex (out of core dataframe library 
for Python) in this PR:
[https://github.com/maartenbreddels/vaex/pull/116]
And I fixed length binary arrays for numpy (say dtype='S42') will be converted 
to a non-fixed length array. Trying to convert that back to numpy will fail, 
since there is no such conversion.

It makes more sense to convert dtype='S42', to an arrow array with 
pyarrow.binary(42) type. As I do in:
https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3669) pyarrow swallows big endian arrow without converting or error msg

2018-11-01 Thread Maarten Breddels (JIRA)
Maarten Breddels created ARROW-3669:
---

 Summary: pyarrow swallows big endian arrow without converting or 
error msg
 Key: ARROW-3669
 URL: https://issues.apache.org/jira/browse/ARROW-3669
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: Maarten Breddels
 Attachments: Screen Shot 2018-11-01 at 09.10.48.png

I've been playing around getting vaex to support arrow, and it's been going 
really well, except for some corner cases.

I expect

 
{code:java}
import numpy as np
import pyarrow as pa
np_array = np.arange(10, dtype='>f8')
pa.array(np_array)

{code}
To give an error, or show proper values, instead I get:

!Screen Shot 2018-11-01 at 09.10.48.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)