[jira] [Created] (ARROW-13546) [Python] Breaking API change in FSSpecHandler, requires metadata argument
Maarten Breddels created ARROW-13546: Summary: [Python] Breaking API change in FSSpecHandler, requires metadata argument Key: ARROW-13546 URL: https://issues.apache.org/jira/browse/ARROW-13546 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels [https://github.com/apache/arrow/pull/10295] introduced the required metadata argument to FSSpecHandler.open_output_stream Noticed this in our CI/testsuite at [https://github.com/vaexio/vaex/pull/1490] {code:java} def create(): 261> return fs.open_output_stream(path) 262E TypeError: open_output_stream() missing 1 required positional argument: 'metadata' {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10959) [C++] Add scalar string join kernel
Maarten Breddels created ARROW-10959: Summary: [C++] Add scalar string join kernel Key: ARROW-10959 URL: https://issues.apache.org/jira/browse/ARROW-10959 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Maarten Breddels Similar to Python's str.join -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10799) [C++] Take on string chunked arrays slow and fails
Maarten Breddels created ARROW-10799: Summary: [C++] Take on string chunked arrays slow and fails Key: ARROW-10799 URL: https://issues.apache.org/jira/browse/ARROW-10799 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Maarten Breddels {code:java} import pyarrow as pa a = pa.array(['a'] * 2**26) c = pa.chunked_array([a] * 2*18) c.take([0, 1]) {code} Gives {noformat} ArrowInvalidTraceback (most recent call last) in > 1 c.take([0, 1]) ~/github/apache/arrow/python/pyarrow/table.pxi in pyarrow.lib.ChunkedArray.take() ~/github/apache/arrow/python/pyarrow/compute.py in take(data, indices, boundscheck, memory_pool) 421 """ 422 options = TakeOptions(boundscheck=boundscheck) --> 423 return call_function('take', [data, indices], options, memory_pool) 424 425 ~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.call_function() ~/github/apache/arrow/python/pyarrow/_compute.pyx in pyarrow._compute.Function.call() ~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/github/apache/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: offset overflow while concatenating arrays {noformat} PS: did not check master but 3.0.0.dev238+gb0bc9f8d -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers
Maarten Breddels created ARROW-10739: Summary: [Python] Pickling a sliced array serializes all the buffers Key: ARROW-10739 URL: https://issues.apache.org/jira/browse/ARROW-10739 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels If a large array is sliced, and pickled, it seems the full buffer is serialized, this leads to excessive memory usage and data transfer when using multiprocessing or dask. {code:java} >>> import pyarrow as pa >>> ar = pa.array(['foo'] * 100_000) >>> ar.nbytes 74 >>> import pickle >>> len(pickle.dumps(ar.slice(10, 1))) 700165 NumPy for instance >>> import numpy as np >>> ar_np = np.array(ar) >>> ar_np array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) >>> import pickle >>> len(pickle.dumps(ar_np[10:11])) 165{code} I think this makes sense if you know arrow, but kind of unexpected as a user. Is there a workaround for this? For instance copy an arrow array to get rid of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)
Maarten Breddels created ARROW-10736: Summary: [Python] feather/arrow row splitting and counting (Dataset API) Key: ARROW-10736 URL: https://issues.apache.org/jira/browse/ARROW-10736 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Maarten Breddels For parquet files using the Dataset API, we have the option to access the row groups, and count the total number of rows within each. I don't see the option to get the number of rows from a dataset with feather/arrow ipc files. For instance, a scan without any columns is not possible it seems, nor any method to get the row count. Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and it is not possible to read only a portion of a filefragment (row slicing), similar to how one could work with ParquetFileFragment.split_by_row_group. I don't know of any other way within Apache Arrow to work with feather/arrow ipc files and only read portions of it (e.g. a particular column for row i to j). Are these features possible any other way, or is this already planned, possibly difficult to implement? cheers, Maarten -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10709) [Python] Difficult to make an efficient zero-copy file reader in Python
Maarten Breddels created ARROW-10709: Summary: [Python] Difficult to make an efficient zero-copy file reader in Python Key: ARROW-10709 URL: https://issues.apache.org/jira/browse/ARROW-10709 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Maarten Breddels There is an option to do efficient data transport using file.read_buffer() using zero memory copies (benchmarking have confirmed that, very nice!). However, file.read_buffer() when backed by a Python object (via PythonFile), will call PythonFile.read() via PyReadableFile::Read. A 'normal' file.read() that does memory copying, also calls the PythonFile.read() method, but only allows for a bytes object (PyBytes_Check is used in PyReadableFile::Read). This makes it hard to create 1 file object in Python land that supports normal .read() (and thus needs to returns a bytes object) and to also support a zero-copy route where .read() can return a memory view. Possibly the strict check on PyBytes_Check can me lifted by also allowing trying PyObject_GetBuffer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10557) [C++] Add scalar string slicing/substring kernel
Maarten Breddels created ARROW-10557: Summary: [C++] Add scalar string slicing/substring kernel Key: ARROW-10557 URL: https://issues.apache.org/jira/browse/ARROW-10557 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels This should implement slicing scalar string values of strings arrays with Python semantics with start, stop ,step arguments. This may seem similar to lists, or binary array, but the string length semantics enter into this kernel, which does not need to equal the number of bytes, nor the number of codepoints (accents, etc should be skipped). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10556) [C++] Caching pre computed data based on FunctionOptions in the kernel state
Maarten Breddels created ARROW-10556: Summary: [C++] Caching pre computed data based on FunctionOptions in the kernel state Key: ARROW-10556 URL: https://issues.apache.org/jira/browse/ARROW-10556 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels See discussion here: [https://github.com/apache/arrow/pull/8621#issuecomment-724796243] A kernel might need to pre-compute something based on the function options passed. Since the Kernel-FunctionOptions mapping is not 1-to-1, it does not make sense to store this in the function option object. Currently, match_substring calculates a `prefix_table` on each Exec call. In trim ([https://github.com/apache/arrow/pull/8621)] we compute a vector on each Exec call. This should be done only once and cached in the kernel state instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10541) [C++] Add re2 library to core arrow / ARROW_WITH_RE2
Maarten Breddels created ARROW-10541: Summary: [C++] Add re2 library to core arrow / ARROW_WITH_RE2 Key: ARROW-10541 URL: https://issues.apache.org/jira/browse/ARROW-10541 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels For https://issues.apache.org/jira/browse/ARROW-10195 we need the re2 linked into the core arrow library, as discussed: [https://github.com/apache/arrow/pull/8459#pullrequestreview-508337720] This might be good to put under an ARROW_WITH_RE2 CMake option, maybe default on when ARROW_COMPUTE=ON? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10306) [C++] Add string replacement kernel
Maarten Breddels created ARROW-10306: Summary: [C++] Add string replacement kernel Key: ARROW-10306 URL: https://issues.apache.org/jira/browse/ARROW-10306 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels Similar to [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.replace.html] with a plain variant, and optionally a RE2 version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10209) [Python] support positional arguments for options in compute wrapper
Maarten Breddels created ARROW-10209: Summary: [Python] support positional arguments for options in compute wrapper Key: ARROW-10209 URL: https://issues.apache.org/jira/browse/ARROW-10209 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Maarten Breddels As mentioned here: [https://github.com/apache/arrow/pull/8271#discussion_r500897047] we cannot support {code:java} pc.split_pattern(arr, "---") {code} where the second argument is a positional argument of the FunctionObject class. I think it makes sense for a small subset (like this function) to support non-keyword arguments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10208) [C++] comparing list arrays with nulls fails in test framework
Maarten Breddels created ARROW-10208: Summary: [C++] comparing list arrays with nulls fails in test framework Key: ARROW-10208 URL: https://issues.apache.org/jira/browse/ARROW-10208 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels I am not sure if this is a specific test issue or valid behavior, but when writing a test in [https://github.com/apache/arrow/pull/8271] The following test fails: {code:java} this->CheckUnary("split_pattern", R"(["foo bar", "foo", null])", list(this->type()), // R"([["foo", "bar"], ["foo"], null])", ); {code} with the following output {code:java} Failed: Got: [ [ [ "foo", "bar" ] ], [ [ "foo" ], null ] ] Expected: [ [ [ "foo", "bar" ] ], [ [ "foo" ], null ] ] {code} while the outputs are the same, the arrays are seen as unequal. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10207) C++] Unary kernels that results in a list have no preallocated offset buffer
Maarten Breddels created ARROW-10207: Summary: C++] Unary kernels that results in a list have no preallocated offset buffer Key: ARROW-10207 URL: https://issues.apache.org/jira/browse/ARROW-10207 Project: Apache Arrow Issue Type: Improvement Reporter: Maarten Breddels I noticed in [https://github.com/apache/arrow/pull/8271] That a string->list[string] kernel does not have the offsets preallocated in the output. I believe there is a preference for not doing allocations in kernels, so this can be optimized at a higher level. I think it can also be done in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-10195) [C++] Add string struct extract kernel using re2
Maarten Breddels created ARROW-10195: Summary: [C++] Add string struct extract kernel using re2 Key: ARROW-10195 URL: https://issues.apache.org/jira/browse/ARROW-10195 Project: Apache Arrow Issue Type: New Feature Reporter: Maarten Breddels Assignee: Maarten Breddels Similar to Pandas' str.extract a way to convert a string to a struct of strings using the re2 regex library (when having named captured groups). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9991) [C++] split kernsl for strings/binary
Maarten Breddels created ARROW-9991: --- Summary: [C++] split kernsl for strings/binary Key: ARROW-9991 URL: https://issues.apache.org/jira/browse/ARROW-9991 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels Similar to Python str.split and bytes.split, we'd like to have a way to convert str into list[str] (and similarly for bytes). When the separator is given, the algorithms for both types are the same. Python, however, overloads strip. When given no separator, the algorithm will split considering all whitespace (unicode for str, ascii for bytes) as separator. I'd rather see not too much overloaded kernels, e.g. # binary_split (takes string/binary separator, and maxsplit arg, no special utf8 version needed) utf8_split_whitespace (similar to Python's version given no separator) asi -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9471) [C++] Scan Dataset in reverse
Maarten Breddels created ARROW-9471: --- Summary: [C++] Scan Dataset in reverse Key: ARROW-9471 URL: https://issues.apache.org/jira/browse/ARROW-9471 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels If a dataset does not fit into the OS cache, it can be beneficial to alternate between normal and reverse 'scanning'. Even if 90% of the a set of files fits into cache, scanning the same set twice will not make use of the OS cache. On the other hand, if the second time, scanning goes in reverse order, 90% will still be in OS cache. We use this trick in vaex, and I'd like to support that for parquet reading as well. (Is there a proper name/term for this?) Note that since you don't want to reverse on byte level, you may want to reverse the way of traversing the fragment, or fragment and row groups. Too small chunks (e.g. pages) could lead to a performance decrease because most read algorithms implement read-ahead optimization (not the reverse). I think doing this on fragment level might be enough. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9458) [Python] Dataset singlethreaded only
Maarten Breddels created ARROW-9458: --- Summary: [Python] Dataset singlethreaded only Key: ARROW-9458 URL: https://issues.apache.org/jira/browse/ARROW-9458 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels I'm not sure this is a misunderstanding, or a compilation issue (flags?) or an issue in the C++ layer. I have 1000 parquet files with a total of 1 billion rows (1 million rows each file, ~20 columns). I wanted to see if I could go through all rows 1 of 2 columns efficiently (vaex use case). {code:java} import pyarrow.parquet import pyarrow as pa import pyarrow.dataset as ds import glob ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) scanned = 0 for scan_task in ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=True): for record_batch in scan_task.execute(): scanned += record_batch.num_rows scanned {code} This only seems to use 1 cpu. Using a threadpool from Python: {code:java} # %%timeit import concurrent.futures pool = concurrent.futures.ThreadPoolExecutor() ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet')) def process(scan_task): scan_count = 0 for record_batch in scan_task.execute(): scan_count += len(record_batch) return scan_count sum(pool.map(process, ds.scan(batch_size=1_000_000, columns=['passenger_count'], use_threads=False))) {code} Gives me a similar performance, again, only 100% cpu usage (=1 core/cpu). py-spy (profiler for Python) shows no GIL, so this might be something at the C++ layer. Am I 'holding it wrong' or could this be a bug? Note that IO speed is not a problem on this system (it actually all comes from OS cache, no disk read observed) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9456) [Python] Dataset segfault when not importing pyarrow.parquet
Maarten Breddels created ARROW-9456: --- Summary: [Python] Dataset segfault when not importing pyarrow.parquet Key: ARROW-9456 URL: https://issues.apache.org/jira/browse/ARROW-9456 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Maarten Breddels To reproduce: # import pyarrow.parquet # if we skip this... import pyarrow as pa import pyarrow.dataset as ds import glob ds = pa.dataset.dataset('/data/taxi_parquet/data_0.parquet') ds.to_table() # this will crash $ python pyarrow/crash.py dev terminate called after throwing an instance of 'parquet::ParquetException' what(): The file only has 19 columns, requested metadata for column: 1049198736 [1] 1559395 abort (core dumped) python pyarrow/crash.py When the import is there, it will work fine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9403) [Python] add .tolist as alias of to_pylist
Maarten Breddels created ARROW-9403: --- Summary: [Python] add .tolist as alias of to_pylist Key: ARROW-9403 URL: https://issues.apache.org/jira/browse/ARROW-9403 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Maarten Breddels Assignee: Maarten Breddels As discussed on the mailing list, it helps to write library agnostic code (NumPy/Pyarrow) if arrays support a .tolist(), as alias to .to_pylist(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9268) [C++] Add is{alnum,alpha,...} kernels for strings
Maarten Breddels created ARROW-9268: --- Summary: [C++] Add is{alnum,alpha,...} kernels for strings Key: ARROW-9268 URL: https://issues.apache.org/jira/browse/ARROW-9268 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels Assignee: Maarten Breddels A good list of kernels to have would str->bool kernels, similar to: [https://docs.python.org/3/library/stdtypes.html#str.isalnum] and friends. I think all but `isidentifier` make sense to have. The semantics of the Python functions seem quite reasonable to have in Arrow, but maybe others can provide feedback if this is a complete/reasonable list to have or not. I am not sure if we need more (or less) functions, or if we want more atomic functions, e.g. test for membership in Unicode categories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9133) [C++] Add utf8_upper and utf_lower
Maarten Breddels created ARROW-9133: --- Summary: [C++] Add utf8_upper and utf_lower Key: ARROW-9133 URL: https://issues.apache.org/jira/browse/ARROW-9133 Project: Apache Arrow Issue Type: Improvement Reporter: Maarten Breddels This is the equivalent of https://issues.apache.org/jira/browse/ARROW-9100 for utf8. This will be a good test for unilib vs utf8proc, performance, and API wise. Also, since Unicode strings can grow and shrink, this is also a good start to think about a strategy for memory allocation. How much can a 'string' (or byte sequence) length actually grow? Item 5.18 mentioned that a string can expand by a factor of 3, by which they seem to mean 3 codepoints. This can be validated by checking with Python: {code:python} for i in range(0x100, 0x11): codepoint = chr(i) try: bytes_before = codepoint.encode() except UnicodeEncodeError: continue bytes_after = codepoint.upper().encode() if len(bytes_before) != len(bytes_after): print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after)) 912 0x390 ΐ Ϊ́ 2 6 ...{code} showing that a two-byte codepoint can expand to 3 (2 byte) codepoints (2 bytes => 6 bytes). The character Ϊ́ has no single precomposed capital character, so it is composed of a single base character and two combining characters. However there are different situations explain in [https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt]) This increase by a factor of 3 is used in CPython [https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Objects/unicodeobject.c#L10058] which is an easy solution not to have to grow the buffer dynamically. However, growing 3x in size seems at odds with the API of both utf8proc: [https://github.com/JuliaStrings/utf8proc/blob/08fa0698639f15d07b12c0065a4494f2d504/utf8proc.c#L375] [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79] and unilib: [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79] Which can only return a single 32bit value (thus 1 codepoint, encoding 1 character). Both libraries seem to ignore the special cases of case mapping (no library uses/downloads SpecialCasing.txt). This means that if Arrow wants to support the same features as Python regarding upper casing and lower casing (which means really implementing the Unicode), neither libraries are sufficient. There are more edges cases/irregularities. But I propose I start with a version of utf8_lower and utf8_upper that ignore the special cases. PS: Another interesting finding is that although upper casing can increase a buffer length by a factor of 3, lowercasing a utf8 string will only increase the byte length by a factor of 3/2 at maximum. {code:python} for i in range(0x100, 0x11): codepoint = chr(i) try: bytes_before = codepoint.encode() except UnicodeEncodeError: continue bytes_after = codepoint.lower().encode() if len(bytes_before) != len(bytes_after): print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), len(bytes_after)) 304 0x130 İ i̇ 2 3 570 0x23a Ⱥ ⱥ 2 3 574 0x23e Ⱦ ⱦ 2 3 7838 0x1e9e ẞ ß 3 2 8486 0x2126 Ω ω 3 2 8490 0x212a K k 3 1 8491 0x212b Å å 3 2 11362 0x2c62 Ɫ ɫ 3 2 11364 0x2c64 Ɽ ɽ 3 2 11373 0x2c6d Ɑ ɑ 3 2 11374 0x2c6e Ɱ ɱ 3 2 11375 0x2c6f Ɐ ɐ 3 2 11376 0x2c70 Ɒ ɒ 3 2 11390 0x2c7e Ȿ ȿ 3 2 11391 0x2c7f Ɀ ɀ 3 2 42893 0xa78d Ɥ ɥ 3 2 42922 0xa7aa Ɦ ɦ 3 2 42923 0xa7ab Ɜ ɜ 3 2 42924 0xa7ac Ɡ ɡ 3 2 42925 0xa7ad Ɬ ɬ 3 2 42926 0xa7ae Ɪ ɪ 3 2 42928 0xa7b0 Ʞ ʞ 3 2 42929 0xa7b1 Ʇ ʇ 3 2 42930 0xa7b2 Ʝ ʝ 3 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9131) [C++] Faster ascii_lower and ascii_upper
Maarten Breddels created ARROW-9131: --- Summary: [C++] Faster ascii_lower and ascii_upper Key: ARROW-9131 URL: https://issues.apache.org/jira/browse/ARROW-9131 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Maarten Breddels The current version is using a lookup table for doing the case conversion. Using [http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU] it seems using a boolean check and +/-32 seems ~5x times faster (4.6x for clang 9, 6.4x for GCC 9.2). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9100) Add ascii_lower kernel
Maarten Breddels created ARROW-9100: --- Summary: Add ascii_lower kernel Key: ARROW-9100 URL: https://issues.apache.org/jira/browse/ARROW-9100 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Maarten Breddels -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8990) [C++] Benchmark hash table against thirdparty options, possibly vendor a thirdparty hash table library
[ https://issues.apache.org/jira/browse/ARROW-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17121205#comment-17121205 ] Maarten Breddels commented on ARROW-8990: - FYI, I've been using that library and [https://github.com/skarupke/flat_hash_map] for Vaex. After some benchmarking settled for the tsl one, but my research/benchmark wasn't very thorough, because the idea was I could easily switch if needed. But because the performance was great, I never looked back actually, so I'd be interested in the benchmark result. By the same author, the [https://github.com/Tessil/hat-trie] library can also be very interesting to take a look at. > [C++] Benchmark hash table against thirdparty options, possibly vendor a > thirdparty hash table library > -- > > Key: ARROW-8990 > URL: https://issues.apache.org/jira/browse/ARROW-8990 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > > While we have our own hash table implementation, it would be worthwhile to > set up some benchmarks so that we can compare against std::unordered_map and > some other thirdparty libraries for hash tables to know whether we should > possibly use a thirdparty library. See e.g. > https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html > Libraries to consider: > * https://github.com/sparsehash/sparsehash -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8961) [C++] Vendor utf8proc library
[ https://issues.apache.org/jira/browse/ARROW-8961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118713#comment-17118713 ] Maarten Breddels commented on ARROW-8961: - FWIW, in Vaex i've relied on [https://github.com/ufal/unilib] which is a very minimal/barebone library, I have no strong opinions about this though (unless benchmarks tell me otherwise). > [C++] Vendor utf8proc library > - > > Key: ARROW-8961 > URL: https://issues.apache.org/jira/browse/ARROW-8961 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > This is a minimal MIT-licensed library for UTF-8 data processing originally > developed for use in Julia > https://github.com/JuliaStrings/utf8proc -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17114105#comment-17114105 ] Maarten Breddels commented on ARROW-555: Sounds good. I think it would help me a lot to see str->scalar and str->str (and possibly a str->[str, str]) example. They can be trivial, like always return ["a", "b"], but with that, I can probably get up to speed very quickly, if it's not too much to ask. > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8865) Windows distribution for 0.17.1 seems broken (conda only)
[ https://issues.apache.org/jira/browse/ARROW-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112105#comment-17112105 ] Maarten Breddels commented on ARROW-8865: - Thanks Joris, We got CI working by installing from pypi for the meantime. Feel free to close this if you don't think it belongs here. > Windows distribution for 0.17.1 seems broken (conda only) > - > > Key: ARROW-8865 > URL: https://issues.apache.org/jira/browse/ARROW-8865 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 >Reporter: Maarten Breddels >Priority: Major > > We just started seeing issues with importing pyarrow on our CI: > [https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401] > Long logs, the issue appears here: > > import pyarrow._parquet as _parquet > [2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E > ImportError: DLL load failed: The specified procedure could not be found. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8865) Windows distribution for 0.17.1 seems broken (conda only)
[ https://issues.apache.org/jira/browse/ARROW-8865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maarten Breddels updated ARROW-8865: Summary: Windows distribution for 0.17.1 seems broken (conda only) (was: windows distribution for 0.17.1 seems broken (conda only?) > Windows distribution for 0.17.1 seems broken (conda only) > - > > Key: ARROW-8865 > URL: https://issues.apache.org/jira/browse/ARROW-8865 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.17.1 >Reporter: Maarten Breddels >Priority: Major > > We just started seeing issues with importing pyarrow on our CI: > [https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401] > Long logs, the issue appears here: > > import pyarrow._parquet as _parquet > [2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E > ImportError: DLL load failed: The specified procedure could not be found. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8865) windows distribution for 0.17.1 seems broken (conda only?
Maarten Breddels created ARROW-8865: --- Summary: windows distribution for 0.17.1 seems broken (conda only? Key: ARROW-8865 URL: https://issues.apache.org/jira/browse/ARROW-8865 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.1 Reporter: Maarten Breddels We just started seeing issues with importing pyarrow on our CI: [https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401] Long logs, the issue appears here: > import pyarrow._parquet as _parquet [2541|https://github.com/vaexio/vaex/pull/749/checks?check_run_id=689857401#step:15:2541]E ImportError: DLL load failed: The specified procedure could not be found. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104666#comment-17104666 ] Maarten Breddels commented on ARROW-555: Something to consider (or should I move this discussion to the list?), is the support of ASCII vs utf8. I noticed the Gandiva code assumed ASCII (at least not utf8), while Arrow assumes strings are utf8 only. Having written the vaex string code, I'm pretty sure ASCII will be much faster though (you know the byte length of a string in advance). Is there interest in supporting more than utf8, ASCII for instance, or utf16/32? Or should it be utf8 only? > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104525#comment-17104525 ] Maarten Breddels commented on ARROW-555: I am likely to be able to start working on strings in Arrow this month, so I think the timing is good. Some pointers/examples to get me started would be great. > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051644#comment-17051644 ] Maarten Breddels commented on ARROW-555: What are the limitation, and is this somewhere documented? It might be good to keep those in mind. > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-555) [C++] String algorithm library for StringArray/BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051264#comment-17051264 ] Maarten Breddels commented on ARROW-555: Related: https://issues.apache.org/jira/browse/ARROW-7083 I will probably start working on this a few weeks from now. My initial intention would be to separate the algorithms as much as possible so it would be possible to add them both to gandiva and a 'bare' kernel, or with a minimal amount of refactoring. [~wesm]: what's your reason to choose re2? Gandiva and vaex both use pcre, but I have no strong preference (except being a bit familiar with pcre). > [C++] String algorithm library for StringArray/BinaryArray > -- > > Key: ARROW-555 > URL: https://issues.apache.org/jira/browse/ARROW-555 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > > This is a parent JIRA for starting a module for processing strings in-memory > arranged in Arrow format. This will include using the re2 C++ regular > expression library and other standard string manipulations (such as those > found on Python's string objects) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7396) [Format] Register media types (MIME types) for Apache Arrow formats to IANA
[ https://issues.apache.org/jira/browse/ARROW-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16998111#comment-16998111 ] Maarten Breddels commented on ARROW-7396: - According to [https://en.wikipedia.org/wiki/Media_type] _(about using x. or x-):_ _Media types in this tree cannot be registered. According to RFC 6838 (published in January 2013), any use of types in the unregistered tree is strongly discouraged. In addition, subtypes prefixed with {{x-}} or {{X-}} are no longer considered to be members of this tree._ This refers to [https://tools.ietf.org/html/rfc6838] It seems to me that registring with a vnd prefix is more likely to be accepted at [https://www.iana.org/form/media-types] * application/vnd.apache.arrow.file * application/vnd.apache.arrow.stream Possibly with an optional parameter for a version? I have to serialize Apache Arrow Tables in JSON files and want to store the mime type with it, hence my interest. > [Format] Register media types (MIME types) for Apache Arrow formats to IANA > --- > > Key: ARROW-7396 > URL: https://issues.apache.org/jira/browse/ARROW-7396 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Kouhei Sutou >Priority: Major > > See "MIME types" thread for details: > https://lists.apache.org/thread.html/b15726d0c0da2223ba1b45a226ef86263f688b20532a30535cd5e267%40%3Cdev.arrow.apache.org%3E > Summary: > * If we don't register our media types for Apache Arrow formats (IPC File > Format and IPC Streaming Format) to IANA, we should use "x-" prefix such as > "application/x-apache-arrow-file". > * It may be better that we reuse the same manner as Apache Thrift. Apache > Thrift registers their media types as "application/vnd.apache.thrift.XXX". If > we use the same manner as Apache Thrift, we will use > "application/vnd.apache.arrow.file" or something. > TODO: > * Decide which media types should we register. (Do we need vote?) > * Register our media types to IANA. > ** Media types page: > https://www.iana.org/assignments/media-types/media-types.xhtml > ** Application form for new media types: > https://www.iana.org/form/media-types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788160#comment-16788160 ] Maarten Breddels commented on ARROW-4810: - I see BinaryArray/StringArray classes have similar implementation, the same holds for that, create a Large(Binary/String)Array? > [Format][C++] Add "LargeList" type with 64-bit offsets > -- > > Key: ARROW-4810 > URL: https://issues.apache.org/jira/browse/ARROW-4810 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > Mentioned in https://github.com/apache/arrow/issues/3845 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4810) [Format][C++] Add "LargeList" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788150#comment-16788150 ] Maarten Breddels commented on ARROW-4810: - > Having arrays with > 2GB elements or binary arrays with > 2GB of data would > be considered an anti-pattern in the context of database systems, regardless > of whether the offsets are 32- or 64-bit. So in light of these it doesn't > make sense to have the default type be 64-bit capable if this capability is > seldom used I agree it's not the best idea, but people will find a reason to do it, and since there will not be a straightforward workaround, it may spin off another 'standard' :) But, since allow/supporting it will solve both issues (>2GB elements, and less code complexity) I thought I would mention that as well. As far as the implementation, are you thinking about a new class (apart from ArrayList), or does it seem feasible to include a type for the value_offsets? > [Format][C++] Add "LargeList" type with 64-bit offsets > -- > > Key: ARROW-4810 > URL: https://issues.apache.org/jira/browse/ARROW-4810 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.14.0 > > > Mentioned in https://github.com/apache/arrow/issues/3845 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3685) [Python] Use fixed size binary for NumPy fixed-size string dtypes
[ https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671614#comment-16671614 ] Maarten Breddels commented on ARROW-3685: - I tried to make a PR, but it's opening a whole can of worms, so maybe this part should be vaex specific, or maybe go into the docs. > [Python] Use fixed size binary for NumPy fixed-size string dtypes > - > > Key: ARROW-3685 > URL: https://issues.apache.org/jira/browse/ARROW-3685 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1 >Reporter: Maarten Breddels >Priority: Major > > I'm working on getting support for arrow in vaex (out of core dataframe > library for Python) in this PR: > [https://github.com/maartenbreddels/vaex/pull/116] > And I fixed length binary arrays for numpy (say dtype='S42') will be > converted to a non-fixed length array. Trying to convert that back to numpy > will fail, since there is no such conversion. > It makes more sense to convert dtype='S42', to an arrow array with > pyarrow.binary(42) type. As I do in: > https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3686) Support for masked arrays in to/from numpy
Maarten Breddels created ARROW-3686: --- Summary: Support for masked arrays in to/from numpy Key: ARROW-3686 URL: https://issues.apache.org/jira/browse/ARROW-3686 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.11.1 Reporter: Maarten Breddels Again, in this PR for vaex: [https://github.com/maartenbreddels/vaex/pull/116] I support masked arrays, it would be nice if this goes into pyarrow. If this approach looks good I could do a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3685) [Python] Use fixed size binary for NumPy fixed-size string dtypes
[ https://issues.apache.org/jira/browse/ARROW-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671515#comment-16671515 ] Maarten Breddels commented on ARROW-3685: - Would you say this needs a change in to_pandas_dtype, or should it be an exception for numpy? > [Python] Use fixed size binary for NumPy fixed-size string dtypes > - > > Key: ARROW-3685 > URL: https://issues.apache.org/jira/browse/ARROW-3685 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.11.1 >Reporter: Maarten Breddels >Priority: Major > > I'm working on getting support for arrow in vaex (out of core dataframe > library for Python) in this PR: > [https://github.com/maartenbreddels/vaex/pull/116] > And I fixed length binary arrays for numpy (say dtype='S42') will be > converted to a non-fixed length array. Trying to convert that back to numpy > will fail, since there is no such conversion. > It makes more sense to convert dtype='S42', to an arrow array with > pyarrow.binary(42) type. As I do in: > https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3685) Better roundtrip between numpy and arrow binary array
Maarten Breddels created ARROW-3685: --- Summary: Better roundtrip between numpy and arrow binary array Key: ARROW-3685 URL: https://issues.apache.org/jira/browse/ARROW-3685 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.11.1 Reporter: Maarten Breddels I'm working on getting support for arrow in vaex (out of core dataframe library for Python) in this PR: [https://github.com/maartenbreddels/vaex/pull/116] And I fixed length binary arrays for numpy (say dtype='S42') will be converted to a non-fixed length array. Trying to convert that back to numpy will fail, since there is no such conversion. It makes more sense to convert dtype='S42', to an arrow array with pyarrow.binary(42) type. As I do in: https://github.com/maartenbreddels/vaex/blob/4b4facb64fea9f83593ce0f0b82fc26ddf96b506/packages/vaex-arrow/vaex_arrow/convert.py#L4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3669) pyarrow swallows big endian arrow without converting or error msg
Maarten Breddels created ARROW-3669: --- Summary: pyarrow swallows big endian arrow without converting or error msg Key: ARROW-3669 URL: https://issues.apache.org/jira/browse/ARROW-3669 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: Maarten Breddels Attachments: Screen Shot 2018-11-01 at 09.10.48.png I've been playing around getting vaex to support arrow, and it's been going really well, except for some corner cases. I expect {code:java} import numpy as np import pyarrow as pa np_array = np.arange(10, dtype='>f8') pa.array(np_array) {code} To give an error, or show proper values, instead I get: !Screen Shot 2018-11-01 at 09.10.48.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)