Re: zero-copy Take?

2023-03-28 Thread John Muehlhausen
pecific case. > > Best, > > Will > > On Tue, Mar 28, 2023 at 10:14 AM John Muehlhausen wrote: > > > Is there a way to pass a RecordBatch (or a batch wrapped as a Table) to > > Take and get back a Table composed of in-place (zero copy) slices of the > > input?

zero-copy Take?

2023-03-28 Thread John Muehlhausen
Is there a way to pass a RecordBatch (or a batch wrapped as a Table) to Take and get back a Table composed of in-place (zero copy) slices of the input? I suppose this is not too hard to code, just wondered if there is already a utility. Result Take(const Datum& values, const Datum& indices,

[Java] VectorSchemaRoot? batches->table

2022-12-12 Thread John Muehlhausen
Hello, pyarrow.Table from_batches(batches, Schema schema=None) Construct a Table from a sequence or iterator of Arrow RecordBatches. What is the equivalent of this in Java? What is the relationship between VectorSchemaRoot, Table and RecordBatch in Java? It all seems a bit different...

Re: Array::GetValue ?

2022-11-30 Thread John Muehlhausen
ytes(int64_t index) > > > > > > > > > I think this would be problematic for Boolean? > > > > > > On Tue, Nov 15, 2022 at 11:01 AM John Muehlhausen wrote: > > > > > >> If that covers primitive and binary(string) types, that would wor

Re: Array::GetValue ?

2022-11-15 Thread John Muehlhausen
If that covers primitive and binary(string) types, that would work for me. On Tue, Nov 15, 2022 at 13:50 Antoine Pitrou wrote: > > Then perhaps we can define a method: > > std::string_view FlatArray::GetValueBytes(int64_t index) > > ? > > > Le 15/11/2022 à 19:3

Re: Array::GetValue ?

2022-11-15 Thread John Muehlhausen
r place for this method if there is > > consensus on adding it. > > > > Cheers, > > Micah > > > > [1] > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/array_base.h#L219 > > > > On Mon, Nov 14, 2022 at 11:46 AM John Muehlha

Array::GetValue ?

2022-11-14 Thread John Muehlhausen
There exists: const uint8_t* BaseBinaryArray::GetValue(int64_t i, offset_type* out_length) const What about adding: const uint8_t* Array::GetValue(int64_t i, offset_type* out_length) const This would allow GetValue to get the untyped bytes/length of any value? E.g. out_length would be set to

C# and -1 null_count

2022-10-20 Thread John Muehlhausen
if (fieldNullCount < 0) { throw new InvalidDataException("Null count length must be >= 0"); // TODO:Localize exception message } Above from Ipc/ArrowReaderImplementation.cs. pyarrow is fine with -1, probably due to the following. It would be

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
building /// messages using the encapsulated IPC message, padding bytes may be written /// after a buffer, but such padding bytes do not need to be accounted for in /// the size here. length: long; } On Thu, Sep 22, 2022 at 9:10 AM John Muehlhausen wrote: > Regarding tab=feather.read_table(fn

Re: compressed feather v2 "slicing from the middle"

2022-09-22 Thread John Muehlhausen
e positions of the messages are declared in the file's footer's > "record_batches". > > [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L87 > > Best, > Jorge > > > On Thu, Sep 22, 2022 at 3:01 AM John Muehlhausen

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
On Wed, Sep 21, 2022 at 7:49 PM John Muehlhausen wrote: > The following seems like good news... like I should be able to decompress > just one column of a RecordBatch in the middle of a compressed feather v2 > file. Is there a Python API for this kind of access? C++? > > /// Provi

Re: compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
/// compression does not yield appreciable savings. BUFFER } On Wed, Sep 21, 2022 at 7:03 PM John Muehlhausen wrote: > ``Internal structure supports random access and slicing from the middle. > This also means that you can read a large file chunk by chunk without > having to pull the wh

compressed feather v2 "slicing from the middle"

2022-09-21 Thread John Muehlhausen
``Internal structure supports random access and slicing from the middle. This also means that you can read a large file chunk by chunk without having to pull the whole thing into memory.'' https://ursalabs.org/blog/2020-feather-v2/ For a compressed v2 file, can I decompress just one column of a

std::string_view?

2022-07-12 Thread John Muehlhausen
error: invalid operands to binary expression ('nonstd::sv_lite::basic_string_view >' and 'basic_string_view') This from val == "str"sv Is there a way to access a util::string_view as a std::string_view other than re-building a std::string_view from data()/size() ? -John

Re: StreamDecoder zero-copy (?) for pre-framed contiguous Messages

2022-07-01 Thread John Muehlhausen
mp; options, io::InputStream* stream); On Fri, Jul 1, 2022 at 3:18 PM John Muehlhausen wrote: > If I call `Consume(std::shared_ptr buffer)` and it is already > pre-framed to contain (e.g.) an entire RecordBatch Message and nothing > else, will it use this Buffer in zero-copy mode when c

StreamDecoder zero-copy (?) for pre-framed contiguous Messages

2022-07-01 Thread John Muehlhausen
If I call `Consume(std::shared_ptr buffer)` and it is already pre-framed to contain (e.g.) an entire RecordBatch Message and nothing else, will it use this Buffer in zero-copy mode when calling my Listener::OnRecordBatchDecoded() implementation? I.e. will data in that RecordBatch refer directly

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
l on Linux, and/or interception/auditing > of system pool" on Tue, 14 Jun 2022 09:06:51 -0500, > John Muehlhausen wrote: > > > Hello, > > > > This comment is regarding installation with `apt` on ubuntu 18.04 ... > > `libarrow-dev/bionic,now 8.0.0-1 a

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
oc -fno-builtin-__libc_memalign -fno-builtin-__posix_memalign -fno-builtin-operator_new -fno-builtin-operator_delete" cmake --preset ninja-debug-minimal -DARROW_JEMALLOC=OFF -DARROW_MIMALLOC=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_INSTALL_PREFIX=/usr/local .. On Tue, Jun 14, 2022 at 12:36 PM John Muehl

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
My best guess at this moment is that the Arrow lib I'm using was built with a compiler that had something like __builtin_posix_memalign in effect ?? I say this because deploying __builtin_malloc has the same deleterious effect on my own .so On Tue, Jun 14, 2022 at 10:53 AM John Muehlhausen

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
uses the system allocator for all non-buffer allocations. > > So, for example, when reading in a large IPC file, the majority of the > > data will be allocated by Arrow's memory pool. However, the schema, > > and the wrapper array object itself will be allocated by the system > >

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
I take that back... the preload is not intercepting memory_pool.cc -> SystemAllocator -> AllocateAligned -> posix_memalign (if indeed this is the system allocator path), although it is intercepting posix_memalign from a different .so On Tue, Jun 14, 2022 at 10:27 AM John Muehlhaus

Re: Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
22 at 9:06 AM John Muehlhausen wrote: > Hello, > > This comment is regarding installation with `apt` on ubuntu 18.04 ... > `libarrow-dev/bionic,now 8.0.0-1 amd64` > > I'm a bit confused about the memory pool situation: > > * I run with `ARROW_DEFAULT_MEMORY_PO

Custom default C++ memory pool on Linux, and/or interception/auditing of system pool

2022-06-14 Thread John Muehlhausen
Hello, This comment is regarding installation with `apt` on ubuntu 18.04 ... `libarrow-dev/bionic,now 8.0.0-1 amd64` I'm a bit confused about the memory pool situation: * I run with `ARROW_DEFAULT_MEMORY_POOL=system` and check that `arrow::default_memory_pool()->backend_name() ==

Create large IPC format record batch(es) in-place without copy or prior data analysis

2021-10-20 Thread John Muehlhausen
Motivation: We have memory-mappable Arrow IPC files with N batches where column(s) are sorted to support binary search. Because log2(n) < log2(n/2)+log2(n/2) and binary search is required on each batch, we prefer the batches to be as large as possible to reduce total search time... perhaps

Re: pyarrow vc++ redistributable?

2020-10-06 Thread John Muehlhausen
to build. > > This is one of many reasons we recommend using conda to organizations > because things like the VS runtime are automatically handled. I'm not > sure if there's a way to equivalently handle this with pip > > On Tue, Oct 6, 2020 at 9:16 AM John Muehlhausen wrote: &

pyarrow vc++ redistributable?

2020-10-06 Thread John Muehlhausen
"pip install pyarrow If you encounter any importing issues of the pip wheels on Windows, you may need to install the Visual C++ Redistributable for Visual Studio 2015." http://arrow.apache.org/docs/python/install.html Just now wading into the use of pyarrow on Windows. Users are confused and

Re: [DISCUSS] Format additions for encoding/compression

2020-01-24 Thread John Muehlhausen
gt; > Cheers, > Micah > > > [1] > > https://15721.courses.cs.cmu.edu/spring2018/papers/22-vectorization2/p31-feng.pdf > [2] https://github.com/apache/arrow/pull/4815 > [3] > > https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#extens

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
new datatypes there is no separate flag to check? On Thu, Jan 23, 2020 at 1:09 PM Wes McKinney wrote: > On Thu, Jan 23, 2020 at 12:42 PM John Muehlhausen wrote: > > > > Again, I know very little about Parquet, so your patience is appreciated. > > > > At the moment I

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
compression algorithm where the columnar engine can > benefit from it [1] than marginally improving a file-system-os > specific feature. > > François > > [1] Section 4.3 http://db.csail.mit.edu/pubs/abadi-column-stores.pdf > > > > > On Thu, Jan 23, 2020 at 12:43 PM

Re: [DISCUSS] Format additions for encoding/compression

2020-01-23 Thread John Muehlhausen
, Jan 23, 2020 at 11:23 AM Antoine Pitrou wrote: > > > Le 23/01/2020 à 18:16, John Muehlhausen a écrit : > > Perhaps related to this thread, are there any current or proposed tools to > > transform columns for fixed-length data types according to a "shuffle?" > >

Re: [DISCUSS] Format additions for encoding/compression (Was: [Discuss] Format additions to Arrow for sparse data and data integrity)

2020-01-23 Thread John Muehlhausen
Perhaps related to this thread, are there any current or proposed tools to transform columns for fixed-length data types according to a "shuffle?" For precedent see the implementation of the shuffle filter in hdf5.

predict whether pa.array() will produce ChunkedArray

2019-12-03 Thread John Muehlhausen
Given input data and a type, how do we predict whether array() will produce ChunkedArray? I figure the formula involves: - the length of input - the type, and max length (to be conservative) for variable length types - some constant(s) that Arrow knows internally... that may change in the future?

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-18 Thread John Muehlhausen
> does modify headers) and then "touch" up the metadata for later analysis, > so it conforms to the specification (and standard libraries can be used). > > [1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L49 > [2] https://github.com/apache/arrow/blob/master/format/

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-17 Thread John Muehlhausen
gt; > the counter-examples of concrete harm? > > > I'm not sure there is anything obviously wrong, however changes to > semantics are always dangerous. One blemish on the current proposal is > one can't determine easily if a mismatch in row-length is a programming > err

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen
"that's where the danger lies" What danger? I have no idea what the specific danger is, assuming that all reference implementations have test cases that hedge around this. I contend that it can only be useful and will never be harmful. What are the counter-examples of concrete harm?

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-16 Thread John Muehlhausen
we can go back to how the user ignores the empty/undefined array portions without knowing whether they exist. -John On Wed, Oct 16, 2019 at 10:45 AM Wes McKinney wrote: > On Wed, Oct 16, 2019 at 10:17 AM John Muehlhausen wrote: > > > > "pyarrow is intended as a developer-fac

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-10-15 Thread John Muehlhausen
fashion and therefore has some unused array elements. The change itself seems relatively simple. What negative consequences do we anticipate, if any? Thanks, -John On Fri, Jul 5, 2019 at 10:42 AM John Muehlhausen wrote: > This seems to help... still testing it though. > > Status GetF

Re: Looking ahead to 1.0

2019-10-15 Thread John Muehlhausen
ARROW-6837 (which, er, includes ARROW-6836) and ARROW-5916 have PRs. Would appreciate some feedback. I will finish the Python part of 6837 when I know I'm on the right track. Thanks, John On Thu, Oct 10, 2019 at 9:54 AM John Muehlhausen wrote: > The format change is ARROW-6836 ...

build-support/update-flatbuffers.sh usage

2019-10-14 Thread John Muehlhausen
I'm missing something about this script. FORMAT_DIR=$CWD/../.. How can any of the fbs files be in ../../ when they are in format/ ?

Re: Looking ahead to 1.0

2019-10-10 Thread John Muehlhausen
h integration tests to prove it. The issues you listed > sound more like C++ library changes to me? > > If you want to propose Format-related changes, that would need to > happen right away otherwise the ship will sail on that. > > - Wes > > On Wed, Oct 9, 2019 at 9:08 PM John M

[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6840: --- Summary: [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd Key: ARROW-6840 URL: https://issues.apache.org/jira/browse/ARROW-6840

[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6839: --- Summary: [Java] access File Footer custom_metadata Key: ARROW-6839 URL: https://issues.apache.org/jira/browse/ARROW-6839 Project: Apache Arrow Issue

[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6838: --- Summary: [JS] access File Footer custom_metadata Key: ARROW-6838 URL: https://issues.apache.org/jira/browse/ARROW-6838 Project: Apache Arrow Issue

[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6837: --- Summary: [C++/Python] access File Footer custom_metadata Key: ARROW-6837 URL: https://issues.apache.org/jira/browse/ARROW-6837 Project: Apache Arrow

[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6836: --- Summary: [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Key: ARROW-6836 URL: https://issues.apache.org/jira/browse/ARROW-6836

Re: uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen
I thought I should open all of the issues for tracking even if I don't implement all of them right away? On Thu, Oct 3, 2019 at 5:46 PM Antoine Pitrou wrote: > > Le 04/10/2019 à 00:18, John Muehlhausen a écrit : > > I need to create two (or more) issues for > > custom_

Re: arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen
Pitrou wrote: > > Le 03/10/2019 à 23:21, John Muehlhausen a écrit : > > > > Would we just make a variant of Open() that takes a fd rather than a > path? > > That sounds like a good idea. Would you like to open a JIRA and a PR? > > > Would this API have any anal

uncertain about JIRA issue granularity

2019-10-03 Thread John Muehlhausen
I need to create two (or more) issues for custom_metadata in Footer ... https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E and memory map based on fd ...

arrow::io::MemoryMappedFile from fd rather than path

2019-10-03 Thread John Muehlhausen
I have a situation where multiple processes need to access a memory mapped file. However, between the time the first process maps the file and the time a subsequent process in the group maps the file, the file may have been removed from the filesystem. (I.e. has no "path") Coordinating the

[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-07-11 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5916: --- Summary: [C++] Allow RecordBatch.length to be less than array lengths Key: ARROW-5916 URL: https://issues.apache.org/jira/browse/ARROW-5916 Project: Apache

flatbuffers vectors and --gen-object-api

2019-07-05 Thread John Muehlhausen
It seems as if Arrow expects for some vectors to be empty rather than null. (Examples: Footer.dictionaries, Field.children) Anyone using --gen-object-api with flatc will get code that writes null when (e.g.) _o->children.size() is zero in CreateField(). I may be missing something but I don't

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen
kely malformed"); } const flatbuf::FieldNode* node = nodes->Get(field_index); *//out->length = node->length();* *out->length = metadata_->length();* out->null_count = node->null_count(); out->offset = 0; return Status::OK(); } On Fri, Jul

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-05 Thread John Muehlhausen
So far it seems as if pyarrow is completely ignoring the RecordBatch.length field. More info to follow... On Tue, Jul 2, 2019 at 3:02 PM John Muehlhausen wrote: > Crikey! I'll do some testing around that and suggest some test cases to > ensure it continues to work, assuming that i

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen
Crikey! I'll do some testing around that and suggest some test cases to ensure it continues to work, assuming that it does. -John On Tue, Jul 2, 2019 at 2:41 PM Wes McKinney wrote: > Thanks for the attachment, it's helpful. > > On Tue, Jul 2, 2019 at 1:40 PM John Muehlhaus

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen
Attachments referred to in previous two messages: https://www.dropbox.com/sh/6ycfuivrx70q2jx/AAAt-RDaZWmQ2VqlM-0s6TqWa?dl=0 On Tue, Jul 2, 2019 at 1:14 PM John Muehlhausen wrote: > Thanks, Wes, for the thoughtful reply. I really appreciate the > engagement. In order to clarify things a

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen
nd, length 1 RecordBatches that don't result in a stream that is computationally efficient. On the other hand, adding artificial latency by accumulating events before "freezing" a larger batch and only then making it available to computation. -John On Tue, Jul 2, 2019 at 12:21 PM We

[Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

2019-07-02 Thread John Muehlhausen
During my time building financial analytics and trading systems (23 years!), both the "batch processing" and "stream processing" paradigms have been extensively used by myself and by colleagues. Unfortunately, the tools used in these paradigms have not successfully overlapped. For example, an

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

2019-06-30 Thread John Muehlhausen
If there is going to be a breaking change to the IPC format, I'd appreciate some discussion about an idea I had for RecordBatch metadata. I previously promised to create a discussion thread with an initial write-up but have not yet done so. I will try to do this tomorrow. (The basic idea is to

Re: Propose custom_metadata for Footer

2019-06-11 Thread John Muehlhausen
; > > > Note here are the other places where we have such fields: > > > > > > * Field > > > * Schema > > > * Message > > > > > > An alternative solution would be to handle such metadata in a separate > > > file, but I see the

Propose custom_metadata for Footer

2019-05-29 Thread John Muehlhausen
Original write of File: Schema: custom_metadata: {"value":1} Message Message Footer Schema: custom_metadata: {"value":1} Process appends messages (new data in bold): Schema: custom_metadata: {"value":1} Message Message *Message* *Footer* * Schema: custom_metadata: {"value":2}* Re-writing

[jira] [Created] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5439: --- Summary: [Java] Utilize stream EOS in File format Key: ARROW-5439 URL: https://issues.apache.org/jira/browse/ARROW-5439 Project: Apache Arrow Issue

[jira] [Created] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5438: --- Summary: [JS] Utilize stream EOS in File format Key: ARROW-5438 URL: https://issues.apache.org/jira/browse/ARROW-5438 Project: Apache Arrow Issue Type

Re: Should EOS be mandatory for IPC File format?

2019-05-24 Thread John Muehlhausen
ybe we can just sort out C++ > for now > > On Wed, May 22, 2019 at 3:03 PM John Muehlhausen wrote: > > > > I added this to https://github.com/apache/arrow/pull/4372 and am hoping > CI > > will test it for me. Do Java/JS require separate JIRA entries? > > > &

Re: memory mapped IPC File of RecordBatches?

2019-05-24 Thread John Muehlhausen
t; platforms > > On Wed, May 22, 2019 at 11:02 PM John Muehlhausen wrote: > > > > Well, it works fine on Linux... and the Linux mmap man page seems to > > indicate you are right about MAP_PRIVATE: > > > > "It is unspecified whether changes made to the file after

[Python] Any reason to exclude __lt__ from ArrayValue ?

2019-05-24 Thread John Muehlhausen
We have __eq__ leaning on as_py() already ... any reason not to have __lt__ ? This makes it possible to use bisect to find slices in ordered data without a __getitem__ wrapper: 1176.0 key=pa.array(['AAPL']) 110.0 print(bisect.bisect_left(batch[3],key[0])) 64.0

Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen
ob/master/ci/conda_env_cpp.yml#L31 > > On Thu, May 23, 2019 at 12:53 PM John Muehlhausen wrote: > > > > The pyarrow-dev conda environment does not include llvm 7, which appears > to > > be a requirement for Gandiva. > > > > So I'm just trying to figure out a pain-free

Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen
n.rst > > Let us know if that does not work. > > - Wes > > On Wed, May 22, 2019 at 11:02 AM John Muehlhausen wrote: > > > > Set up pyarrow-dev conda environment as at > > https://arrow.apache.org/docs/developers/python.html > > > > Got the following e

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen
es it work as expected on MacOS. Still odd that the changes are only sometimes visible ... but I guess that is compatible with it being "unspecified." -John On Wed, May 22, 2019 at 8:56 PM John Muehlhausen wrote: > I'll mess with this on various platforms and report back. Thanks &

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen
ld1 > 0 1.0 > 1 NaN > > Now ran dd to overwrite the file contents > > In [14]: batch.to_pandas() > Out[14]: > field1 > 0 NaN > 1 -245785081.0 > > On Wed, May 22, 2019 at 8:34 PM John Muehlhausen wrote: > > > > I don't think that

Re: memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen
(new test attached) On Wed, May 22, 2019 at 8:09 PM John Muehlhausen wrote: > I don't think that is it. I changed my mmap to MAP_PRIVATE in the first > raw mmap test and the dd changes are still visible. I also changed to > storing the stream format instead of the file format and got

memory mapped IPC File of RecordBatches?

2019-05-22 Thread John Muehlhausen
Is there an example somewhere of referring to the RecordBatch data in a memory-mapped IPC File in a zero-copy manner? I tried to do this in Python and must be doing something wrong. (I don't really care whether the example is Python or C++) In the attached test, when I get to the first prompt

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
or/ipc/ArrowFileWriter.java#L67 > > On Wed, May 22, 2019 at 12:24 PM John Muehlhausen wrote: > > > > https://github.com/apache/arrow/pull/4372 > > > > First contribution attempt... sorry in advance if I'm not coloring inside > > the lines! > > >

[jira] [Created] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5395: --- Summary: Utilize stream EOS in File format Key: ARROW-5395 URL: https://issues.apache.org/jira/browse/ARROW-5395 Project: Apache Arrow Issue Type

Python development setup and LLVM 7 / Gandiva

2019-05-22 Thread John Muehlhausen
Set up pyarrow-dev conda environment as at https://arrow.apache.org/docs/developers/python.html Got the following error. I will disable Gandiva for now but I'd like to get it back at some point. I'm on Mac OS 10.13.6. CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package): Could not

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
quot;wrong". > > On Wed, May 22, 2019 at 8:37 AM John Muehlhausen wrote: > > > > I believe the change involves updating the File format notes as above, as > > well as something like the following. The format also mentions "there is > > no requiremen

Re: Should EOS be mandatory for IPC File format?

2019-05-22 Thread John Muehlhausen
s like a reasonable change. Is there any reason that we shouldnt > always append EOS? > > On Tuesday, May 21, 2019, John Muehlhausen wrote: > > > Wes, > > > > Check out reader.cpp. It seg faults when it gets to the next > > message-that-is-not-a-message... it is a foo

Re: Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
> > https://github.com/apache/arrow/blob/6f80ea4928f0d26ca175002f2e9f511962c8b012/cpp/src/arrow/ipc/message.cc#L281 > > If the end of the byte stream is reached, or EOS (0) is encountered, > then the stream reader stops iteration. > > - Wes > > On Tue, May 21, 2019 at 4:34 PM J

Should EOS be mandatory for IPC File format?

2019-05-21 Thread John Muehlhausen
https://arrow.apache.org/docs/format/IPC.html#file-format If this stream marker is optional in the file format, doesn't this prevent someone from reading the file without being able to seek() it, e.g. if it is "piped in" to a program? Or otherwise they'll have to stream in the entire thing

Re: Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen
19 at 8:36 AM Wes McKinney > > wrote: > > > > > > > hi John -- I'd recommend implementing these capabilities as Kernel > > > > functions under cpp/src/arrow/compute, then they can be exposed in > > > > Python easily. > > >

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
rties and start designing a proposal (which may > or may not include spec additions). > > Regards > > Antoine. > > > Le 13/05/2019 à 15:38, John Muehlhausen a écrit : > > Micah, yes, it all works at the moment. How have we staked out that it > > will always wor

Pyarrow filter/sort/bsearch

2019-05-13 Thread John Muehlhausen
Does pyarrow currently support filter/sort/search without conversion to pandas? I don’t see anything but want to be sure. Sorry if I overlooked it. Specific needs: 1- filter an arrow record batch and sort the results into a new batch 2- find slice locations for a sorted batch using binary

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
nges to the binary protocol for this use case; if others > > have opinions I'll let them speak for themselves. > > > > - Wes > > > > On Mon, May 13, 2019 at 7:50 AM John Muehlhausen wrote: > > > > > > Any thoughts on a RecordBatch distinguishi

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
o that readers know to call "Slice" on the blocks to obtain > only the written-so-far portion. I'm not likely to be in favor of > making changes to the binary protocol for this use case; if others > have opinions I'll let them speak for themselves. > > - Wes > > On

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-13 Thread John Muehlhausen
Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow std::vector terminology) Thanks, John On Thu, May 9, 2019 at 2:46 PM John Muehlhausen wrote: > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split > the "length" parameter into

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-09 Thread John Muehlhausen
e case of the file format, while the file is locked, a new RecordBatch would overwrite the previous file Footer and a new Footer would be written. In order to be able to delete or archive old data multiple files could be strung together in a logical series. -John On Tue, May 7, 2019 at 2:39 PM We

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
f you'd like to experiment with creating an API for pre-allocating > > fixed-size Arrow protocol blocks and then mutating the data and > > metadata on disk in-place, please be our guest. We don't have the > > tools developed yet to do this for you > > > &g

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
ure how to better make my case -John On Tue, May 7, 2019 at 11:02 AM Wes McKinney wrote: > hi John, > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen wrote: > > > > Wes et al, I completed a preliminary study of populating a Feather file > > incrementally. Some not

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-07 Thread John Muehlhausen
king the project, IMHO that is a dark path > that leads nowhere good. We have a large community here and we accept > pull requests -- I think the challenge is going to be defining the use > case to suitable clarity that a general purpose solution can be > developed. > > - W

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen wrote: > > > > Wes, > > > > I’m not afraid of writing my own C++ code to deal with all of this on the > >

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
d or two separate processes active simultaneously) you'll > > need to build up your own data structures to help with this. > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen wrote: > > > > > Hello, > > > > > > Glad to learn of this project— g

Re: Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
the > specific pattern you're trying to undertake for building. > > If you're trying to go across independent processes (whether the same > process restarted or two separate processes active simultaneously) you'll > need to build up your own data structures to help with this. > &g

Stored state of incremental writes to fixed size Arrow buffer?

2019-05-06 Thread John Muehlhausen
Hello, Glad to learn of this project— good work! If I allocate a single chunk of memory and start building Arrow format within it, does this chunk save any state regarding my progress? For example, suppose I allocate a column for floating point (fixed width) and a column for string (variable