Re: Arrow IPC files - number of rows

2022-10-27 Thread Wes McKinney
I think there is an open Jira from several years ago about adding an
optional number-of-rows field to the file footer so that it can be
precomputed and stored rather than requiring the application to look
at all the batch metadata to compute it when needed. This seems like a
harmless addition that should not cause any backward/forward
compatibility issues.

On Thu, Oct 20, 2022 at 11:30 PM Weston Pace  wrote:
>
> It's not in the footer metadata but each record batch should have its
> own metadata the batch's metadata should contain the # of rows.  So
> you should be able to do it without reading any data.  In pyarrow,
> this *should* be what count_rows is doing but it has been a while
> since I've really dove into that code and I may be remembering
> incorrectly.
>
> Can you use a MessageReader[1]?  I have not used it myself.  I don't
> actually know if it will read the buffer data as well or just the
> metadata.
>
> [1] 
> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.MessageReader.html#pyarrow.ipc.MessageReader
>
> On Thu, Oct 20, 2022 at 9:14 AM Quentin Lhoest  wrote:
> >
> > Hi everyone ! I was wondering:
> > What is the most efficient way to know the number of rows in dataset of 
> > Arrow IPC files ?
> >
> > I expected each file to have the number of rows as metadata in the footer, 
> > but it doesn’t seem to be the case. Therefore I need to call count_rows() 
> > which is less efficient than reading metadata.
> >
> > Maybe the number of row can be written as custom_metadata in the footer, 
> > but the writing/reading custom_metadata functions don’t seem to be exposed 
> > in python - if I’m not mistaken.
> >
> > Thanks in advance :)
> >
> > --
> > Quentin


Re: guidance on extension types

2022-09-20 Thread Wes McKinney
hi Chang,

There are a few rough edges here that you've run into:

* It looks like Array.to_numpy does not "automatically lower" to the
storage type when trying to convert to NumPy format. In the absence of
some other conversion rule, converting to the storage type seems like
a reasonable alternative to failing. Can you open a Jira issue about
this? This could probably be fixed easily in time for the 10.0.0
release, much more easily than the next issue

* On the query, it looks like the filter portion at least is being
handled by Arrow/Acero — the syntax / UX relating to nested types here
is relatively unexplored relative to non-nested types. Here comparing
the label type (itself a list of dictionary-encoded strings) to a
string seems invalid, probably you would need to check for inclusion
of the string in the label list-of-strings. I do not know what the
syntax for this would be with DuckDB (to check for inclusion of a
string in a list of strings) but in principle this is something that
should be able to be made to work with some effort

- Wes

On Sun, Sep 18, 2022 at 8:23 PM Chang She  wrote:
>
> Hey y'all, thanks in advance for the discussion.
>
> I'm creating Arrow extensions for computer vision and I'm running into issues 
> in two scenarios. I couldn't find the answers in the archive so I thought I'd 
> post here.
>
> Example:
> I make an extension type called "Label" that has storage type 
> "dictionary". This is an object detection dataset so each row 
> represents an image and has multiple detected objects that needs to be 
> labeled. So there's a "name" column that is "list":
>
> Example table schema:
> image_id: int
> uri: string
> label: list   # list>  storage type
>
>
> Problems:
> 1. `to_numpy` does not seem to work with a nested column. e.g., if I try to 
> call `to_numpy` on the `label` column, then I get "Not implemented type for 
> Arrow list to pandas: extension>"
> 2. If I'm querying this dataset using duckdb, running "select * from dataset 
> where label='person'" results in: "Function 'equal' has no kernel matching 
> input types (extension>, string)"
>
> Am I missing an alternate path to make this work with extension types?
> Does implementing this in Arrow consist of checking if something is an 
> extension type and if so, use the storage type instead? Is this something 
> that's already on the roadmap at all?
>
> Thanks!
>
> Chang She


Re: [cpp] Alignment and Padding

2022-08-05 Thread Wes McKinney
The alignment and padding recommendations from the columnar format
documents only applies to memory buffers allocated to represent Arrow
vectors/arrays, not for memory allocation for other data structures /
classes / objects more generally in any of the libraries.

The rest of the C++ codebase delegates most memory allocation for C++
objects to the language defaults / STL defaults — so if the STL by
default (e.g. std::vector, std::shared_ptr and APIs like make_shared)
is in conflict with -qopt-assume-safe-padding then you have your
answer. We do have an STL-compatible allocator wrapper that could be
used to make all memory allocations aligned and padded, but it would
be a hardship to refactor the codebase to be used consistently in all
places where memory is allocated by the STL.

On Fri, Aug 5, 2022 at 12:42 AM James  wrote:
>
> Perhaps the thing I’m misunderstanding is that the compiler flag in question 
> only pertains to data loaded in an AVX register?
>
> On Thu, Aug 4, 2022 at 9:14 PM James  wrote:
>>
>> In the columnar format doc, it is noted that buffers ought to be allocated 
>> such that they're 64 byte aligned and padded. It is also noted that this 
>> allows the use of compiler options such as -qopt-assume-safe-padding. 
>> However, my understanding of -qopt-assume-safe-padding is that all heap 
>> allocations need to be 64 byte aligned and padded. Is this consistent with 
>> the use of std::make_shared?  Perhaps I'm missing something, but it looks 
>> like there are many heap allocations throughout the cpp code that are not 64 
>> byte aligned and padded. Is it the intention that only the columnar buffer 
>> data be aligned and padded. If yes, then does the use of std::make_shared 
>> make use of -qopt-assume-safe-padding not possible?


Re: [C++] Scalar cast of MAP and LIST to STRING

2022-07-27 Thread Wes McKinney
hi Louis -- it's hard to say. Since this project is developed by
volunteers, the issue would generally be completed by someone who
needs the feature. If you would like to try to submit a PR for this,
it may not be a large amount of work to implement.

- Wes

On Wed, Jul 27, 2022 at 7:47 AM Louis C  wrote:
>
> Hello,
>
> Thanks for you answer and support. Do you have any rough idea when this will 
> be available ?
>
> Thanks
> Louis
> 
> De : David Li 
> Envoyé : mardi 26 juillet 2022 17:22
> À : dl 
> Objet : Re: [C++] Scalar cast of MAP and LIST to STRING
>
> Hi Louis,
>
> It just happens they aren't implemented, I don't think there's any particular 
> reason. I've filed ARROW-17214 [1] to track this.
>
> [1]: https://issues.apache.org/jira/browse/ARROW-17214
>
> -David
>
> On Tue, Jul 26, 2022, at 05:19, Louis C wrote:
>
> Hello,
>
> I use scalar casts to STRING type as a way to represent easily the more 
> complex arrow types such as STRUCT or UNION. For these types, the Scalar cast 
> (CastTo method) is correctly implemented, giving results such as :"union{1: 
> bool = false}"
> However, when casting LIST, LARGE_LIST or MAP types, it gives NotImplemented 
> error.
> Is there a particular reason why it is the case ?
> Is it planned to make it work ?
>
> Best regards
> Louis C
>
>


Re: Arrow compute/dataset design doc missing

2022-07-06 Thread Wes McKinney
I definitely think that adding runtime SIMD dispatching for arithmetic
(and using xsimd to write generic kernels that can be cross-compiled
for different SIMD targets, i.e. AVX2, AVX512, NEON) functions is a
good idea and hopefully it will be pretty low hanging fruit (~a day or
two of work) to make applications faster across the board.

On Thu, Jun 9, 2022 at 1:45 PM Sasha Krassovsky
 wrote:
>
> To add on to Weston’s response, the only SIMD that will ever be generated for 
> the kernels by compilers at the moment is with SSE4.2, it will not generate 
> AVX2 as we have not set up the compiler flags to do that. Also the way the 
> code is written doesn’t seem super easy to vectorize for the compiler, and 
> needs a bit of massaging. I do plan to tackle this at some point after the 
> per-kernel overhead work that Weston mentioned is complete.
>
> Sasha Krassovsky
>
> > On Jun 9, 2022, at 8:42 AM, Shawn Yang  wrote:
> >
> > I see, thanks. I'll do more tests and dive into more arrow compute code.
> >
> > Sent from my iPhone
> >
> >> On Jun 9, 2022, at 5:30 PM, Weston Pace  wrote:
> >>
> >> 
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow 
> >>> compute?
> >>
> >> I don't know that anyone has done a fully systematic analysis of which
> >> kernels support and do not support SIMD at the moment.  The kernels
> >> are still in flux.  There is an active effort to reduce overhead[1]
> >> which is the top priority as this could possibly have more impact on
> >> performance than SIMD when running expressions involving multiple
> >> kernels across multiple threads.
> >>
> >>> I only found very little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>
> >> A lack of explicit vectorization instructions does not mean a lack of
> >> SIMD support.  For many kernels we expect modern compilers to be smart
> >> enough to automatically implement vectorization as long as the data is
> >> provided in a vectorized fashion (e.g. columnar) and the kernel is
> >> simple enough.  For more complex kernels there are options such as
> >> xsimd but this hasn't yet been very thoroughly explored.  At the
> >> moment I'm not aware of anyone writing explicitly vectorized kernels
> >> as this tends to be rather hardware specific and have a small return
> >> on investment.  Instead, we benchmark regularly and have
> >> micro-optimized certain critical sections (e.g. some of the hash
> >> stuff).
> >>
> >>> I tested with numpy and found arrow is ten times slower:
> >>
> >> That result you posted appears to be 3.5x slower.  You might want to
> >> double check and ensure that Arrow was compiled with the appropriate
> >> architecture (the cmake files are generally good at figuring this out)
> >> but I wouldn't be too surprised if this was the case.  Some of this
> >> might be unavoidable.  For example, does numpy support null values (I
> >> don't know for sure but I seem to recall it does not)?  Some of this
> >> might be an inefficiency or overhead problem in Arrow-C++.  It is
> >> possible that the add kernel is not being vectorized correctly by the
> >> compiler but I don't think those numbers alone are enough proof of
> >> that.
> >>
> >> Performance can be quite tricky.  It is important for us but Arrow's
> >> compute functionality is still relatively new compared to numpy and
> >> work on performance is balanced with work on features.
> >>
> >> [1] https://lists.apache.org/thread/rh10ykcolt0gxydhgv4vxk2m7ktwx5mh
> >>
> >>> On Wed, Jun 8, 2022 at 11:08 PM Shawn Yang  
> >>> wrote:
> >>>
> >>> Hi, do you guys know which functions support vectorized SIMD in arrow 
> >>> compute? After a quick look as arrow compute cpp code, I only found very 
> >>> little functions support vectorized SIMD:
> >>> ● bloom filter: avx2 ● key compare: avx2 ● key hash: avx2 ● key map: avx2
> >>>
> >>> Does scalar operation support vectorized SIMD?
> >>>
> >>> I tested with numpy and found arrow is ten times slower:
> >>>
> >>> def test_multiply(rows=500):
> >>>   a = pa.array(list(range(rows, 0, -1)))
> >>>   b = pa.array(list(range(rows)))
> >>>   import pyarrow.compute as pc
> >>>
> >>>   print("arrow multiply took", timeit.timeit(
> >>>   lambda: pc.multiply(a, b), number=3))
> >>>   a = np.array(list(range(rows, 0, -1)))
> >>>   b = np.array(list(range(rows)))
> >>>   print("numpy multiply took", timeit.timeit(
> >>>   lambda: a * b, number=3))
> >>>   # arrow multiply took 0.1482605700015
> >>>   # numpy multiply took 0.0404705130071
> >>>
> >>>
>  On Wed, May 25, 2022 at 10:09 PM Shawn Yang  
>  wrote:
> 
>  I see, the key for multiple loop is to ensure the data can be hold in l2 
>  cache, so that later
>  calculation can process this batch without reading from the main memory, 
>  and we can record the exec stats for every batch , and do better local 

Re: [Python] Pyarrow Computation inplace?

2022-06-01 Thread Wes McKinney
> we may need to consider c++ implementations to get full benefit.

The way that you are executing your expression is not the way that we
intend users to write performant code.

The intended way is to build an expression and then execute the
expression — the current expression evaluator is not very efficient
(it does not reuse temporary allocations yet), but that is where the
optimization work should happen rather than in custom C++ code.

I'm struggling to find documentation for this right now (expressions
aren't discussed in https://arrow.apache.org/docs/python/compute.html)
but there are many examples in the unit tests:

https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py

On Tue, May 31, 2022 at 11:25 PM Cedric Yau  wrote:
>
> I have code like below to range partition a file.  It looks like each of the 
> pc.less, pc.cast, and pc.add allocates new arrays.  So the code appears to be 
> spending more time performing memory allocations than it is doing the 
> comparisons.  The performance is still pretty good (and faster than the 
> alternatives), but it does make me think as we start shifting more 
> calculations to arrow, we may need to consider c++ implementations to get 
> full benefit.
>
> Thanks,
> Cedric
>
> import pyarrow.parquet as pq
> import pyarrow.compute as pc
>
> t = pq.read_table('/path/to/file', columns=['username'])
> ta = t.column('username')
>
> output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
> output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
> output_file = pc.coalesce(output_file,9)
>
> On Tue, May 31, 2022 at 2:25 PM Weston Pace  wrote:
>>
>> I'd be more interested in some kind of buffer / array pool plus the
>> ability to specify an output buffer for a kernel function.  I think it
>> would achieve the same goal (avoiding allocation) with more
>> flexibility (e.g. you wouldn't have to overwrite your input buffer).
>>
>> At the moment though I wonder if this is a concern.  Jemalloc should
>> do some level of memory reuse.  Is there a specific performance issue
>> you are encountering?
>>
>> On Tue, May 31, 2022 at 11:45 AM Wes McKinney  wrote:
>> >
>> > *In principle*, it would be possible to provide mutable output buffers
>> > for a kernel's execution, so that input and output buffers could be
>> > the same (essentially exposing the lower-level kernel execution
>> > interface that underlies arrow::compute::CallFunction). But this would
>> > be a fair amount of development work to achieve. If there are others
>> > interested in exploring an implementation, we could create a Jira
>> > issue.
>> >
>> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield  
>> > wrote:
>> > >
>> > > I think even in cython this might be difficult as Array data structures 
>> > > are generally considered immutable, so this is inherently unsafe, and 
>> > > requires doing with care.
>> > >
>> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau  wrote:
>> > >>
>> > >> Suppose I have an array with 1MM integers and I add 1 to them with 
>> > >> pyarrow.compute.add.  It looks like a new array is assigned.
>> > >>
>> > >> Is there a way to do this inplace?  It looks like a new array is 
>> > >> allocated.  Would cython be required at this point?
>> > >>
>> > >> ```
>> > >> import pyarrow as pa
>> > >> import pyarrow.compute as pc
>> > >>
>> > >> a = pa.array(range(100))
>> > >> print(id(a))
>> > >> a = pc.add(a,1)
>> > >> print(id(a))
>> > >>
>> > >> # output
>> > >> # 139634974909024
>> > >> # 139633492705920
>> > >> ```
>> > >>
>> > >> Thanks,
>> > >> Cedric
>
>
>


Re: [Python] Pyarrow Computation inplace?

2022-05-31 Thread Wes McKinney
*In principle*, it would be possible to provide mutable output buffers
for a kernel's execution, so that input and output buffers could be
the same (essentially exposing the lower-level kernel execution
interface that underlies arrow::compute::CallFunction). But this would
be a fair amount of development work to achieve. If there are others
interested in exploring an implementation, we could create a Jira
issue.

On Sun, May 29, 2022 at 3:04 PM Micah Kornfield  wrote:
>
> I think even in cython this might be difficult as Array data structures are 
> generally considered immutable, so this is inherently unsafe, and requires 
> doing with care.
>
> On Sun, May 29, 2022 at 11:21 AM Cedric Yau  wrote:
>>
>> Suppose I have an array with 1MM integers and I add 1 to them with 
>> pyarrow.compute.add.  It looks like a new array is assigned.
>>
>> Is there a way to do this inplace?  It looks like a new array is allocated.  
>> Would cython be required at this point?
>>
>> ```
>> import pyarrow as pa
>> import pyarrow.compute as pc
>>
>> a = pa.array(range(100))
>> print(id(a))
>> a = pc.add(a,1)
>> print(id(a))
>>
>> # output
>> # 139634974909024
>> # 139633492705920
>> ```
>>
>> Thanks,
>> Cedric


June 23 virtual conference to highlight work in the Arrow ecosystem

2022-05-13 Thread Wes McKinney
hi all,

My employer (Voltron Data) is organizing a free virtual conference on
June 23 to highlight development work and usage of Apache Arrow — you
can register for this or apply to give a talk here:

https://thedatathread.com/

We are especially interested in hearing from users (as opposed to only
project developers/contributors!) about how they are using Arrow in
their downstream applications. If you would be interested in speaking
(talks will be pre-recorded, so you don't need to be available on June
23), please apply to give a short talk (~15 min) on the website!

Thanks,
Wes


Re: Unsubscribe

2022-03-28 Thread Wes McKinney
To unsubscribe, you can e-mail user-unsubscr...@arrow.apache.org

On Mon, Mar 28, 2022 at 5:28 AM Alina Valea  wrote:
>
>


Re: [Python] Performance rapidly decreases when reading nested structs

2022-03-25 Thread Wes McKinney
hi Partha — in the examples you gave:

* Simple struct: 2 string fields, 3 primitive/numeric fields
* Complex struct: 9 string fields, 4 primitive/numeric fields

I would guess that the larger number of binary/string fields and
overall size of the schema (5 vs 13 fields) is influencing the
decoding time more than the nesting level. That said, the more
deeply-nested case has not been optimized as much as the shallow/flat
case.

On Mon, Mar 14, 2022 at 9:59 AM Partha Dutta  wrote:
>
> I've been trying to understand some slowness in my application. I am reading 
> data from Azure ADLS using fsspec, and I am finding that reading columns that 
> have nested structs are much slower.
>
> The file is about 1GB in size, and I am reading a single row group from the 
> file (approximately 453,000 records)
>
> I tried with different column types, and these are the execution times that I 
> observed to read a single row group, and a single column:
>
> timestamp column: 0.468 seconds
> simple struct (no nesting, 5 fields): 0.672 seconds
> nested struct (3 levels of nesting): 4.12 seconds
>
> This is the parquet definition of the simple struct:
> optional group field_id=-1 device {
> optional binary field_id=676 typeIDService (String);
> optional binary field_id=677 typeID (String);
> optional int32 field_id=678 screenWidth;
> optional int32 field_id=679 screenHeight;
> optional int32 field_id=680 colorDepth;
>   }
>
> And this is the nested struct:
> optional group field_id=-1 web {
> optional group field_id=-1 webPageDetails {
>   optional binary field_id=59 name (String);
>   optional binary field_id=60 server (String);
>   optional binary field_id=61 URL (String);
>   optional boolean field_id=62 isErrorPage;
>   optional boolean field_id=63 isHomePage;
>   optional binary field_id=64 siteSection (String);
>   optional group field_id=-1 pageViews {
> optional double field_id=66 value;
>   }
> }
> optional group field_id=-1 webReferrer {
>   optional binary field_id=67 type (String);
>   optional binary field_id=68 URL (String);
> }
> optional group field_id=-1 webInteraction {
>   optional binary field_id=69 type (String);
>   optional binary field_id=70 name (String);
>   optional group field_id=-1 linkClicks {
> optional double field_id=73 value;
>   }
>   optional binary field_id=72 URL (String);
> }
>   }
>
> I am curious as to why the performance is so slow.
> --
> Partha Dutta
> partha.du...@gmail.com


Re: [C++][parquet] Multiple build issues from Parquet, Arrow libraries

2022-03-24 Thread Wes McKinney
hi David,

I looked at your SO question, and if you are statically linking, you need
to include all of the transitive dependencies (like Thrift) when linking.
We've tried to make this convenient but creating a "bundle" of the
dependencies as described here:

https://arrow.apache.org/docs/developers/cpp/building.html#statically-linking

If you also link this when linking arrow_static.lib and parquet_static.lib
with your application, it should link properly.

Thanks,
Wes

On Mon, Mar 7, 2022 at 12:18 PM David Griffin 
wrote:

> Hi Shawn,
>
> Thank you for taking the time to reply to my email.  I appreciate the
> links, but the Apache Arrow docs were what I referenced in my first
> attempts.  I then used this question
> 
> on Stack Overflow, as the author gave significantly more detail as to how
> they went about making their build; things like, after using cmake, you
> need to build the INSTALL.vcproj in Visual Studio to get header files and
> .lib's installed to your local directory.  I had no idea this was necessary
> as it was never mentioned on the Apache website and this is the first time
> I have ever needed to build a C++ library. Though the Apache website caters
> to many different types of users using many different environments, it was
> certainly not written for anyone who just has a background in coding but
> not development.
>
> The Python documentation for pyarrow I have found to be perfectly
> serviceable.  I prefer Python as my language of choice (versus mostly R,
> Matlab, Fortran, or Visual Basic) and using the pyarrow package in the past
> is how I know I would like to save my data in the Parquet format. I
> absolutely would use Python for this project, but I need to read a file,
> organize the data, then save as Parquet. My preliminary checks showed that
> C++ was about 5x faster than Python (admittedly, this was for writing a CSV
> file not Parquet, but my only assumption can be that Parquet would be
> similar). I have hundreds of terabytes of data to parse, so that "5x
> faster" is a deal breaker.
>
> Thank you again,
> David
>
> On Sun, Mar 6, 2022 at 11:48 PM Shawn Zeng  wrote:
>
>> I guess you can refer to the docs at
>> https://arrow.apache.org/docs/cpp/build_system.html and
>> https://arrow.apache.org/docs/cpp/parquet.html.
>>
>> When I first use Arrow, I do also feel the C++ doc is not as
>> comprehensive as the Python one. You can also use pyarrow since it is just
>> a wrapper on the C++ implementation. It is much simpler and the performance
>> is almost the same.
>>
>> On Fri, Mar 4, 2022 at 4:54 PM David Griffin 
>> wrote:
>>
>>> I hope this email finds you well and that this is the proper address for
>>> my question. First off, let me say I am not a programmer by training and
>>> have very little experience in C++. However, it is the best option for what
>>> I'm doing, which is ultimately writing data in Parquet format. I can save
>>> my data as CSV right now no problem, but I cannot get even the simplest
>>> example code for arrow and/or parquet to work for me.  And unfortunately, I
>>> have no idea what I'm doing when it comes to troubleshooting a manual build
>>> of libraries in C++.  I've tried asking Stack Overflow and Reddit, but so
>>> far, I haven't gotten any responses.  I've read as much as I could find on
>>> those sites and the general internet at large, but I still can't figure out
>>> how to properly access Apache libraries in my own C++ project.
>>>
>>> I'd be happy to go through everything I've done and the issues I am
>>> running into, but I was wondering first if there was any location online
>>> that walks completely through the set up of the Apache Arrow and Apache
>>> Parquet libraries (ideally in Windows 10, but I can make an Ubuntu
>>> partition if necessary) to see if there is something I missed when I did it
>>> myself? Even just a nudge in the right direction would be appreciated.
>>>
>>> Thank you for taking the time to read my email and awesome job on this
>>> technology.
>>>
>>> -David
>>>
>>> --
>>>
>>>
>
> --
>
>


Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

2022-03-08 Thread Wes McKinney
Since this isn't the first time this specific issue has happened in a
major release, is there a way that a test or benchmark regression
check could be introduced to prevent this category of problem in the
future?

On Thu, Feb 24, 2022 at 9:48 PM Weston Pace  wrote:
>
> Thanks for reporting this.  It seems a regression crept into 7.0.0
> that accidentally disabled parallel column decoding when
> pyarrow.parquet.read_table is called with a single file.  I have filed
> [1] and should have a fix for it before the next release.  As a
> workaround you can use the datasets API directly, this is already what
> pyarrow.parquet.read_table is using under the hood when
> use_legacy_dataset=False.  Or you can continue using
> use_legacy_dataset=True.
>
> import pyarrow.dataset as ds
> table = ds.dataset('file.parquet', format='parquet').to_table()
>
> [1] https://issues.apache.org/jira/browse/ARROW-15784
>
> On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng  wrote:
> >
> > I am using a public benchmark. The origin file is 
> > https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
> >  . I used pyarrow version 7.0.0 and pq.write_table api to write the csv 
> > file as a parquet file, with compression=snappy and use_dictionary=true. 
> > The data has ~20M rows and 43 columns. So there is only one row group with 
> > row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on 
> > local disk.
> >
> > Weston Pace  于2022年2月24日周四 16:45写道:
> >>
> >> That doesn't really solve it but just confirms that the problem is the 
> >> newer datasets logic.  I need more information to really know what is 
> >> going on as this still seems like a problem.
> >>
> >> How many row groups and how many columns does your file have?  Or do you 
> >> have a sample parquet file that shows this issue?
> >>
> >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng  wrote:
> >>>
> >>> use_legacy_dataset=True fixes the problem. Could you explain a little 
> >>> about the reason? Thanks!
> >>>
> >>> Weston Pace  于2022年2月24日周四 13:44写道:
> 
>  What version of pyarrow are you using?  What's your OS?  Is the file on 
>  a local disk or S3?  How many row groups are in your file?
> 
>  A difference of that much is not expected.  However, they do use 
>  different infrastructure under the hood.  Do you also get the faster 
>  performance with pq.read_table(use_legacy_dataset=True) as well.
> 
>  On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng  wrote:
> >
> > Hi all, I found that for the same parquet file, using 
> > pq.ParquetFile(file_name).read() takes 6s while 
> > pq.read_table(file_name) takes 17s. How do those two apis differ? I 
> > thought they use the same internals but it seems not. The parquet file 
> > is 865MB, snappy compression and enable dictionary. All other settings 
> > are default, writing with pyarrow.


Re: [Python] weird memory usage issue when reading a parquet file.

2022-01-03 Thread Wes McKinney
By default we use jemalloc as our memory allocator which empirically has
been seen to yield better application performance. jemalloc does not
release memory to the operating system right away, this can be altered by
using a different default allocator (for example, the system allocator may
return memory to the OS right away):

https://arrow.apache.org/docs/cpp/memory.html#overriding-the-default-memory-pool

I expect that the reason that psutil-reported allocated memory is higher in
the last case is because some temporary allocations made during the
filtering process are raising the "high water mark". I believe can see what
is reported as the peak memory allocation by looking at
pyarrow.default_memory_pool().max_memory()

On Mon, Dec 20, 2021 at 5:10 AM Yp Xie  wrote:

> Hi guys,
>
> I'm getting this weird memory usage info when I tried to start using
> pyarrow to read a parquet file.
>
> I wrote a simple script to show how much memory is consumed after each
> step.
> the result is illustrated in the table:
>
> row number pa.total_allocated_bytes memory usage by psutil
> without filters 5131100 177M 323M
> with field filter 57340 2041K 323M
> with column pruning 5131100 48M 154M
> with both field filter and column pruning 57340 567K 204M
>
> the weird part is: the total memory usage when I apply both field filter
> and column pruning is *larger* than only column pruning applied.
>
> I don't know how that happened, do you guys know the reason for this?
>
> thanks.
>
> env info:
>
> platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10
> distro info: ('Ubuntu', '20.04', 'focal')
> pyarrow: 6.0.1
>
>
> script code:
>
> import pyarrow as pa
> import psutil
> import os
> import pyarrow.dataset as ds
>
> pid = os.getpid()
>
> def show_mem(action: str) -> None:
> mem = psutil.Process(pid).memory_info()[0] >> 20
> print(f"*** memory usage after {action} **")
> print(f"*   {mem}M*")
> print(f"**")
>
> dataset = ds.dataset("tmp/uber.parquet", format="parquet")
> show_mem("read dataset")
> projection = {
> "Dispatching_base_num": ds.field("Dispatching_base_num")
> }
> filter = ds.field("locationID") == 100
> table = dataset.to_table(
> filter=filter,
> columns=projection
> )
> print(f"table row number: {table.num_rows}")
> print(f"total bytes: {pa.total_allocated_bytes() >> 10}K")
> show_mem("dataset.to_table")
>


Re: [Python] Cannot read parquet files from windows shared drive with pyarrow

2021-11-14 Thread Wes McKinney
It sounds like the operation is blocking on the initial metadata
inspection, which is well known to be a problem when the dataset has a
large number of files. We should have some better-documented ways
around this — I don't think it's possible to pass an explicit Arrow
schema to parquet.read_table, but that is something we should look at.
Another would be to make the simplifying assumption that the schemas
are all the same and so to use a single file to collect the schema and
then validate that each file has the same schema when you actually
read it, rather than inspecting and validating all 10,000 files before
beginning to read them.

I see this discussion of common metadata files in

https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files

But I don't see anything in the documentation about how to use them
for better performance for a many-file dataset. We should also fix
that

On Sat, Nov 13, 2021 at 6:27 AM Farhad Taebi
 wrote:
>
> Thanks for the investigation. Your test code works and so does any other 
> where a single parquet file is targeted.
> My dataset is partitioned into 1 files. And that seems to be the problem. 
> Even if I use a filter that targets only one partition. If I use a small 
> number of partitions, it works.
> That looks like pyarrow tries to locate all file paths in the dataset before 
> running the query, even if only one needs to be known and since the network 
> drive is slow, it just waits for the response. Wouldn't it be better, if a 
> meta data file would be created along with the partitions, so the needed 
> paths could be read fast instead of asking the OS every time?
> I don't know if my thoughts are correct though.
>
> Cheers


Re: Build PyArrow with GPU support

2021-11-10 Thread Wes McKinney
hi Wenlei,

We haven't made any releases with the CUDA capabilities enabled, but
we have tested it periodically so turning it on should be a matter of
setting the flags when building the C++ and Python libraries. It would
be nice to be able to release this functionality in a separate wheel
but there would be some engineering to do to make that possible.

The compute functions only work with CPU buffers for now.

Thanks,
Wes

On Fri, Nov 5, 2021 at 11:03 AM Wenlei Xie  wrote:
>
> Hello Apache Arrow team,
>
> I am trying to use Arrow on GPU device on PyArrow (6.0 installed via pip). 
> But it errored with
>
> > ModuleNotFoundError: No module named 'pyarrow._cuda'
>
> I assume the pip package doesn't enable CUDA feature at build time. I am 
> wondering how to build pyarrow with GPU support from source? -- I only found 
> Building Arrow C++ guide: 
> https://arrow.apache.org/docs/developers/cpp/building.html
>
> Also wondering if Arrow Compute functions are supported on GPU-buffered Array?
>
> Thanks!
>
>


Re: [python] Duplication of data in 'ARROW:schema' metadata?

2021-10-12 Thread Wes McKinney
hi Vasilis,

The Arrow schema is used to restore metadata (like timestamp time
zones) and reconstruct other Arrow types which might otherwise be lost
in the roundtrip (like returning data as dictionary-encoded if it was
written originally that way). This can be disabled disabling the
store_schema option in ArrowWriterProperties

You are right that schema metadata is being duplicated both in the
ARROW:schema and in the Parquet schema-level metadata — I believe this
is a bug and we should fix it either by not storing the Arrow metadata
in the Parquet metadata (only storing the metadata in ARROW:schema) or
dropping the metadata from ARROW:schema and using that only for
restoring data types and type metadata.

https://issues.apache.org/jira/browse/ARROW-14303

Thanks,
Wes

On Tue, Oct 12, 2021 at 4:40 AM Vasilis Themelis  wrote:
>
> Hi,
>
> It looks like pyarrow adds some metadata under 'ARROW:schema' that duplicates 
> the rest of the key-value metadata in the resulting parquet file:
>
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import base64 as b64
>
> df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "example.parquet")
> metadata = pq.read_metadata("example.parquet").metadata
> print(" All metadata ")
> print(metadata)
> print("")
> print(" ARROW:schema ")
> print(metadata[b'ARROW:schema'])
> print("")
> print(" b64 decoded ")
> print(b64.b64decode(metadata[b'ARROW:schema']))
>
> The above should show the duplication between "All metadata" and "b64 
> decoded" ARROW:schema.
>
> What is the reason for this? Is there a good use for ARROW:schema?
>
> I have used other libraries to write parquet files without an issue and none 
> of them adds the 'ARROW:schema' metadata. I had no issues with reading their 
> output files with pyarrow or similar. As an example, here is the result of 
> writing the same dataframe into parquet using fastparquet:
>
> from fastparquet import write
> write("example-fq.parquet", df)
> print(pq.read_metadata("example-fq.parquet").metadata)
>
> Also, given that this duplication can significantly increase the size of the 
> file when there is a large amount of metadata stored, would it be possible to 
> optionally disable writing 'ARROW:schema' if the output files are still 
> functional?
>
> Vasilis Themelis


Re: [Python] Accessing child array of ListArray

2021-08-27 Thread Wes McKinney
I think it's the ListArray.values property

https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L1673

On Fri, Aug 27, 2021 at 2:06 PM Micah Kornfield  wrote:
>
> I was looking through the APIs and I couldn't find an explicit method to 
> access the underlying array (is there one)?  It seems like flatten() might do 
> what we want but I'm not sure if there are any gotcha's involved with that?
>
> Thanks,
> Micah


Re: [Python Arrow Flight] Question related with python flight server performance

2021-08-20 Thread Wes McKinney
Flight is ultimately limited by the network bandwidth of the
underlying transport. If you have 40GbE, then you may be able to get
2+GB/second with Flight for uncompressed data. If you enable LZ4 or
ZSTD compression in the server and the data compresses well, you may
be able to exceed in effect the bandwidth of your underlying
transport.

In the article that you referenced, the demonstration shows that
Flight itself does not introduce substantial overhead — there are data
protocols that top out at 200-500MB/s (or less!) no matter how fast
the network is because of data serialization overhead. With Flight
there is very little serialization overhead.

On Fri, Aug 20, 2021 at 5:23 PM Abe Hsu 許 育銘 (abehsu)  wrote:
>
> Hi Micah :
>
> I test the bandwidth between US and Asia server today. The bandwidth around 
> 50-70Mbits/s.
>
> 175MB (non compress)
> US <-> Asia will take around 15s
>
> 1.78GB (non compress)
> US <-> Asia will take around 170~180s
>
> 175MB (compress)
> US <-> Asia will take around 4~8s
>
> 1.78GB (compress)
> US <-> Asia will take around 50~70s
>
>
> Do you think that is accountable?
> If I want to reach the performance 2-3GB/s , how many bandwidths do you think 
> we need to have?
>
> Many Thanks,
> Abe
>
>
> On 2021/08/19 18:07:57, Micah Kornfield  wrote:
> > I don't think the performance numbers accounted for high latency links.>
> > What bandwidth link do you have between the two servers?  YOu might try>
> > using compression in Arrow.>
> >
> > On Thu, Aug 19, 2021 at 10:40 AM Abe Hsu 許 育銘 (abehsu) >
> > wrote:>
> >
> > > Micron Confidential>
> > >>
> > >>
> > > Hi team:>
> > >>
> > > I am Abe from Taiwan. This is my first time sent mail to apache 
> > > community,>
> > > if i do something wrong, please correct me.  I am investigating using 
> > > Arrow>
> > > Flight as data exchange protocol. I am using python to establish a Flight>
> > > Server. And the performance is a little not as my expectation, so I would>
> > > like to ask some suggestion from team. I set up Flight Server on US, and 
> > > my>
> > > python client code is setup on Asia (e.g: Taiwan).>
> > >>
> > > I find if I want to transfer 178MB data with 1001730 rows from US to 
> > > Asia.>
> > > It will need 10s. I expect it will less than 1s?>
> > > Any parts I am missing?>
> > >>
> > >>
> > >>
> > > time python client.py get -c ‘get’>
> > >>
> > >>
> > >>
> > >>
> > >>
> > > RangeIndex: 1001731 entries, 0 to 1001730>
> > >>
> > > Data columns (total 16 columns):>
> > >>
> > > #   Column Non-Null CountDtype>
> > >>
> > > ---  -- --->
> > >>
> > > 0   cmte_id1001731 non-null  object>
> > >>
> > > 1   cand_id1001731 non-null  object>
> > >>
> > > 2   cand_nm1001731 non-null  object>
> > >>
> > > 3   contbr_nm  1001731 non-null  object>
> > >>
> > > 4   contbr_city1001712 non-null  object>
> > >>
> > > 5   contbr_st  1001727 non-null  object>
> > >>
> > > 6   contbr_zip 1001731 non-null  int64>
> > >>
> > > 7   contbr_employer988002 non-null   object>
> > >>
> > > 8   contbr_occupation  993301 non-null   object>
> > >>
> > > 9   contb_receipt_amt  1001731 non-null  float64>
> > >>
> > > 10  contb_receipt_dt   1001731 non-null  object>
> > >>
> > > 11  receipt_desc   14166 non-nullobject>
> > >>
> > > 12  memo_cd92482 non-nullobject>
> > >>
> > > 13  memo_text  97770 non-nullobject>
> > >>
> > > 14  form_tp1001731 non-null  object>
> > >>
> > > 15  file_num   1001731 non-null  int64>
> > >>
> > > dtypes: float64(1), int64(2), object(13)>
> > >>
> > > memory usage: 122.3+ MB>
> > >>
> > >>
> > >>
> > >>
> > >>
> > > real  0m10.405s>
> > >>
> > > user 0m0.297s>
> > >>
> > > sys   0m0.996s>
> > >>
> > >>
> > >>
> > > I will have this expectation is because I look into those articles.>
> > >>
> > > · https://www.dremio.com/is-time-to-replace-odbc-jdbc>
> > >>
> > > With an average size batch size (256K records), the performance of Flight>
> > > exceeded 20 Gb/s for a single stream running on a single core.>
> > >>
> > > ·>
> > > https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/>
> > >>
> > > As far as absolute speed, in our C++ data throughput benchmarks, we are>
> > > seeing end-to-end TCP throughput in excess of 2-3GB/s on localhost 
> > > without>
> > > TLS enabled. This benchmark shows a transfer of ~12 gigabytes of data in>
> > > about 4 seconds:>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > > Many Thanks,>
> > >>
> > > Abe>
> > >>
> > >>
> > >>
> > > Micron Confidential>
> > >>
> >


Re: [Python] Support for MapType with Fields

2021-08-17 Thread Wes McKinney
Seems reasonable. Do you want to open a Jira issue and/or make a PR for this?

On Tue, Aug 17, 2021 at 2:51 AM Jason Reid  wrote:
>
> Currently, the api for instantiating MapType only accepts DataType for the 
> key_type and item_type arguments.  Is it possible to create a MapType passing 
> Field types instead?  Looking for something similar to what is possible for 
> ListType in python or what seems to be possible from the cpp implementation 
> of MapType (untested).
>
> Thanks for any insights,
> Jason


Re: How to create an Apache Arrow table with ragged/jagged columns

2021-08-04 Thread Wes McKinney
hi Jie — could you clarify what you mean? Arrow has a variable-size
list (array) type [1]

[1]: 
https://arrow.apache.org/docs/format/Columnar.html#variable-size-list-layout

On Wed, Aug 4, 2021 at 9:54 AM Jie Ye  wrote:
>
> Hello,
> Does Apache Arrow support creating a table with ragged/jagged column? 
> If it does, how to create it.
> Are there some materials about that?
>
>
> Thanks


Re: State of plasma

2021-07-27 Thread Wes McKinney
Plasma was used by Ray in production as is prior to the fork which
took place last year. We have no concrete plans to remove it at this
time, but if you run into a bug (bugs have seemed to creep up mostly
in stress-testing scenarios, and there are some unfixed bug reports in
the issue tracker), you would be on your own right now. It's
conceivable that its development might be sponsored by some
corporation in the future, and I still think that doing shared memory
IPC in this fashion is a good idea.

On Tue, Jul 27, 2021 at 3:46 PM Simon Fischer  wrote:
>
> Thanks a lot for clearifying, Wes!
>
> You said earlier that plasma is effectively deprecated for lack of
> maintenance, and you also say it works well. Will it be removed from
> Arrow? Can it be considered stable (in terms of "production ready
> runtime stability", not API)?
>
> Thanks again
> Simon
>
> Am 27.07.2021 um 21:55 schrieb Wes McKinney:
> > On Tue, Jul 27, 2021 at 1:29 PM Simon Fischer  
> > wrote:
> >> Hi all,
> >>
> >> could someone (Wes?) elaborate on that a bit more? Because the ray project 
> >> still lists Arrow Plasma as part of there infrastructure , e.g. here :
> > I'm pretty certain this is simply a case of outdated documentation --
> > you could file an issue with Ray and ask them to fix it.
> >
> >> Also, from my understanding plasma is C++ and the client exists for C++, 
> >> and plasma is (still) part of Arrow. But the only documentation on how to 
> >> use plasma with Arrow is for python [2]. There is some docs for plasma C++ 
> >> here [3], but they do not seem to cover the interaction between Arrow and 
> >> plasma. Can you point me somewhere where I can find something in that 
> >> regard?
> >>
> > At this point reading the C++ headers would be the way to go. Plasma
> > is one aspect of building a distributed data caching solution, so it's
> > quite low level but it works well.
> >
> >> Thirdly out of curiosity: are there any data about performance 
> >> (throughput, latency) of plasma (both inter and intra process)?
> > I don't recall anyone having done too many systematic performance
> > studies / benchmarks.
> >
> >> Thanks a lot and best regards
> >> Simon
> >>
> >>
> >> [1] https://docs.ray.io/en/master/serialization.html
> >> [2] https://arrow.apache.org/docs/python/plasma.html
> >> [3] 
> >> https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md
> >>
> >> Am 04.05.2021 um 17:12 schrieb Wes McKinney:
> >>
> >> hi Jacopo — absent developer-maintainers, it is de facto deprecated
> >> since the previous developer-maintainers (who work on the Ray project)
> >> have forked away and abandoned it. If someone wants to resume
> >> development and maintenance of the codebase (or fund work on it,
> >> please contact me if you want to fund it!), that would be great.
> >>
> >> On Tue, May 4, 2021 at 10:07 AM Jacopo Gobbi  wrote:
> >>
> >> Hi everybody,
> >>
> >> My name is Jacopo, I am a software engineer at Orchest (orchest.io). We
> >> are currently making use of the plasma store to do in-memory data
> >> passing, and use pyarrow for some serialization.
> >> Some months ago, in the dev mailing list, there have been talks of the
> >> plasma store getting deprecated. I see that Arrow 4.0.0 has been
> >> released and that the plasma-store is still there, I could not find
> >> exhaustive information in the docs and the README about the current
> >> state of the plasma-store and what to expect in the future.
> >>
> >> Could someone provide some insight as to what the current plans are for
> >> plasma in pyarrow?
> >>
> >> Regards,
> >> Jacopo
> >>
> >>
> >> --
> >> Bitte Beachten: Namensänderung! Zukünftig bitte simon.fisc...@ipp.mpg.de 
> >> verwenden.
> >>
> >> Simon Fischer
> >>
> >> Entwickler - CoDaC
> >> Department Operation
> >>
> >> Max Planck Institut for Plasmaphysics
> >> Wendelsteinstrasse 1
> >> 17491 Greifswald, Germany
> >>
> >> Phone: +49(0)3834 88 1215
>
>
>
>


Re: State of plasma

2021-07-27 Thread Wes McKinney
On Tue, Jul 27, 2021 at 1:29 PM Simon Fischer  wrote:
>
> Hi all,
>
> could someone (Wes?) elaborate on that a bit more? Because the ray project 
> still lists Arrow Plasma as part of there infrastructure , e.g. here :

I'm pretty certain this is simply a case of outdated documentation --
you could file an issue with Ray and ask them to fix it.

> Also, from my understanding plasma is C++ and the client exists for C++, and 
> plasma is (still) part of Arrow. But the only documentation on how to use 
> plasma with Arrow is for python [2]. There is some docs for plasma C++ here 
> [3], but they do not seem to cover the interaction between Arrow and plasma. 
> Can you point me somewhere where I can find something in that regard?
>

At this point reading the C++ headers would be the way to go. Plasma
is one aspect of building a distributed data caching solution, so it's
quite low level but it works well.

> Thirdly out of curiosity: are there any data about performance (throughput, 
> latency) of plasma (both inter and intra process)?

I don't recall anyone having done too many systematic performance
studies / benchmarks.

> Thanks a lot and best regards
> Simon
>
>
> [1] https://docs.ray.io/en/master/serialization.html
> [2] https://arrow.apache.org/docs/python/plasma.html
> [3] https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md
>
> Am 04.05.2021 um 17:12 schrieb Wes McKinney:
>
> hi Jacopo — absent developer-maintainers, it is de facto deprecated
> since the previous developer-maintainers (who work on the Ray project)
> have forked away and abandoned it. If someone wants to resume
> development and maintenance of the codebase (or fund work on it,
> please contact me if you want to fund it!), that would be great.
>
> On Tue, May 4, 2021 at 10:07 AM Jacopo Gobbi  wrote:
>
> Hi everybody,
>
> My name is Jacopo, I am a software engineer at Orchest (orchest.io). We
> are currently making use of the plasma store to do in-memory data
> passing, and use pyarrow for some serialization.
> Some months ago, in the dev mailing list, there have been talks of the
> plasma store getting deprecated. I see that Arrow 4.0.0 has been
> released and that the plasma-store is still there, I could not find
> exhaustive information in the docs and the README about the current
> state of the plasma-store and what to expect in the future.
>
> Could someone provide some insight as to what the current plans are for
> plasma in pyarrow?
>
> Regards,
> Jacopo
>
>
> --
> Bitte Beachten: Namensänderung! Zukünftig bitte simon.fisc...@ipp.mpg.de 
> verwenden.
>
> Simon Fischer
>
> Entwickler - CoDaC
> Department Operation
>
> Max Planck Institut for Plasmaphysics
> Wendelsteinstrasse 1
> 17491 Greifswald, Germany
>
> Phone: +49(0)3834 88 1215


Re: [PyArrow] DictionaryArray isDelta Support

2021-07-23 Thread Wes McKinney
If I'm interpreting you correctly, the issue is that every dictionary
must be a prefix of a common dictionary for the delta logic to work.
So if the first batch has

"a", "b"

then in the next batch

"a", "b", "c" is OK and will emit a delta
"b", "a", "c" is not and will trigger this error

If we wanted to allow for deltas coming from unordered dictionaries as
an option, that could be implemented in theory but it not super
trivial

On Fri, Jul 23, 2021 at 9:25 AM Sam Davis  wrote:
>
> For reference, I think this check in the C++ code triggers regardless of 
> whether the delta option is turned on:
>
> https://github.com/apache/arrow/blob/e0401123736c85283e527797a113a3c38c0915f2/cpp/src/arrow/ipc/writer.cc#L1066
> 
> From: Sam Davis 
> Sent: 23 July 2021 14:43
> To: user@arrow.apache.org 
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> Yes I know this as quoted in the spec. What I am wondering is for the file 
> format how can I write deltas out using PyArrow?
>
> The previous example was a trivial version of reality.
>
> More concretely, say I want to write 100e6 rows out in multiple RecordBatches 
> to a non-streaming file format using PyArrow. I do not want to do a complete 
> pass ahead of time to compute the full set of strings for the relevant 
> columns and would therefore like to dump out deltas when new strings appear 
> in a given column. Is this possible?
>
> In the example code ideally this would "just" add on the delta containing the 
> dictionary difference of it and the previous batches. I'm happy as a user to 
> maintain the full set of categories seen thus far and tell PyArrow what the 
> delta is if necessary.
> 
> From: Wes McKinney 
> Sent: 23 July 2021 14:36
> To: user@arrow.apache.org 
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> Dictionary replacements aren't supported in the file format, only
> deltas. Your use case is a replacement, not a delta. You could use the
> stream format instead.
>
> On Fri, Jul 23, 2021 at 8:32 AM Sam Davis  wrote:
> >
> > Hey Wes,
> >
> > Thanks, I had not spotted this before! It doesn't seem to change the 
> > behaviour with `pa.ipc.new_file` however. Maybe I'm using it incorrectly?
> >
> > ```
> > import pandas as pd
> > import pyarrow as pa
> >
> > print(pa.__version__)
> >
> > schema = pa.schema([
> > ("foo", pa.dictionary(pa.int16(), pa.string()))
> > ])
> >
> > pd1 = pd.DataFrame({"foo": pd.Categorical([""], categories=["a"*i for i 
> > in range(64)])})
> > b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
> >
> > pd2 = pd.DataFrame({"foo": pd.Categorical([""], categories=["b"*i for i 
> > in range(64)])})
> > b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
> >
> > options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
> >
> > with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema, 
> > options=options) as writer:
> > writer.write(b1)
> > writer.write(b2)
> > ```
> >
> > Version printed: 4.0.1
> >
> > Sam
> > 
> > From: Wes McKinney 
> > Sent: 23 July 2021 14:24
> > To: user@arrow.apache.org 
> > Subject: Re: [PyArrow] DictionaryArray isDelta Support
> >
> > hi Sam
> >
> > On Fri, Jul 23, 2021 at 8:15 AM Sam Davis  
> > wrote:
> > >
> > > Hi,
> > >
> > > We want to write out RecordBatches of data, where one or more columns in 
> > > a batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), 
> > > pa.string()` as the column only contains a handful of unique values.
> > >
> > > However, PyArrow seems to lack support for writing these batches out to 
> > > either the streaming or (non-streaming) file format.
> > >
> > > When attempting to write two distinct batches the following error message 
> > > is triggered:
> > >
> > > > ArrowInvalid: Dictionary replacement detected when writing IPC file 
> > > > format. Arrow IPC files only support a single dictionary for a given 
> > > > field across all batches.
> > >
> > > I believe this message is false and that support is possible based on 
> > > reading the spec:
> > >
> > > > Dictionaries are written in the stream and file formats as a sequence 
> > > > of record batches...
> > &g

Re: [PyArrow] DictionaryArray isDelta Support

2021-07-23 Thread Wes McKinney
Dictionary replacements aren't supported in the file format, only
deltas. Your use case is a replacement, not a delta. You could use the
stream format instead.

On Fri, Jul 23, 2021 at 8:32 AM Sam Davis  wrote:
>
> Hey Wes,
>
> Thanks, I had not spotted this before! It doesn't seem to change the 
> behaviour with `pa.ipc.new_file` however. Maybe I'm using it incorrectly?
>
> ```
> import pandas as pd
> import pyarrow as pa
>
> print(pa.__version__)
>
> schema = pa.schema([
> ("foo", pa.dictionary(pa.int16(), pa.string()))
> ])
>
> pd1 = pd.DataFrame({"foo": pd.Categorical([""], categories=["a"*i for i 
> in range(64)])})
> b1 = pa.RecordBatch.from_pandas(pd1, schema=schema)
>
> pd2 = pd.DataFrame({"foo": pd.Categorical([""], categories=["b"*i for i 
> in range(64)])})
> b2 = pa.RecordBatch.from_pandas(pd2, schema=schema)
>
> options = pa.ipc.IpcWriteOptions(emit_dictionary_deltas=True)
>
> with pa.ipc.new_file("/tmp/sdavis_tmp.arrow", schema=b1.schema, 
> options=options) as writer:
> writer.write(b1)
> writer.write(b2)
> ```
>
> Version printed: 4.0.1
>
> Sam
> 
> From: Wes McKinney 
> Sent: 23 July 2021 14:24
> To: user@arrow.apache.org 
> Subject: Re: [PyArrow] DictionaryArray isDelta Support
>
> hi Sam
>
> On Fri, Jul 23, 2021 at 8:15 AM Sam Davis  wrote:
> >
> > Hi,
> >
> > We want to write out RecordBatches of data, where one or more columns in a 
> > batch has a `pa.string()` column encoded as a `pa.dictionary(pa.intX(), 
> > pa.string()` as the column only contains a handful of unique values.
> >
> > However, PyArrow seems to lack support for writing these batches out to 
> > either the streaming or (non-streaming) file format.
> >
> > When attempting to write two distinct batches the following error message 
> > is triggered:
> >
> > > ArrowInvalid: Dictionary replacement detected when writing IPC file 
> > > format. Arrow IPC files only support a single dictionary for a given 
> > > field across all batches.
> >
> > I believe this message is false and that support is possible based on 
> > reading the spec:
> >
> > > Dictionaries are written in the stream and file formats as a sequence of 
> > > record batches...
> > > ...
> > > The dictionary isDelta flag allows existing dictionaries to be expanded 
> > > for future record batch materializations. A dictionary batch with isDelta 
> > > set indicates that its vector should be concatenated with those of any 
> > > previous batches with the same id. In a stream which encodes one column, 
> > > the list of strings ["A", "B", "C", "B", "D", "C", "E", "A"], with a 
> > > delta dictionary batch could take the form:
> >
> > ```
> > 
> > 
> > (0) "A"
> > (1) "B"
> > (2) "C"
> >
> > 
> > 0
> > 1
> > 2
> > 1
> >
> > 
> > (3) "D"
> > (4) "E"
> >
> > 
> > 3
> > 2
> > 4
> > 0
> > EOS
> > ```
> >
> > > Alternatively, if isDelta is set to false, then the dictionary replaces 
> > > the existing dictionary for the same ID. Using the same example as above, 
> > > an alternate encoding could be:
> >
> > ```
> > 
> > 
> > (0) "A"
> > (1) "B"
> > (2) "C"
> >
> > 
> > 0
> > 1
> > 2
> > 1
> >
> > 
> > (0) "A"
> > (1) "C"
> > (2) "D"
> > (3) "E"
> >
> > 
> > 2
> > 1
> > 3
> > 0
> > EOS
> > ```
> >
> > It also specifies in the IPC File Format (non-streaming) section:
> >
> > > In the file format, there is no requirement that dictionary keys should 
> > > be defined in a DictionaryBatch before they are used in a RecordBatch, as 
> > > long as the keys are defined somewhere in the file. Further more, it is 
> > > invalid to have more than one non-delta dictionary batch per dictionary 
> > > ID (i.e. dictionary replacement is not supported). Delta dictionaries are 
> > > applied in the order they appear in the file footer.
> >
> > So for the non-streaming format multiple non-delta dictionaries are not 
> > supported but one non-delta followed by delta dictionaries should be.
>

Re: [Python/C++] Streaming Format to IPC File Format Conversion

2021-07-16 Thread Wes McKinney
hi Micah — makes sense. I agree that starting down the path of "table
management" in Arrow is probably too much scope creep since the
requirements (e.g. schema evolution) can vary so much, but I see the
value in providing a way to do random access into a stream-file after
writing it without having to rewrite the file into the file format
(which may be tricky given possible issues with dictionary deltas)

On Wed, Jul 14, 2021 at 10:58 PM Micah Kornfield  wrote:
>
> I think if we tried to tack this on, I think it might be worth trying to go 
> through the design effort to see if something is possible without external 
> files.  The stream format also allows more flexibility around dictionaries 
> then the file format does, so there is a possibility of impedance mismatch.
>
> Before we went with our own specification for external metadata it seems that 
> looking at integration with something like Iceberg might make sense.
>
> My understanding is that  external metadata files are on the path to 
> deprecation or at least no recommended in parquet [1].
>
> [1] 
> https://lists.apache.org/thread.html/r9897237ce76287e66109994320d876d32e11db6acc32490b99a41842%40%3Cdev.parquet.apache.org%3E
>
> On Wed, Jul 14, 2021 at 4:53 PM Wes McKinney  wrote:
>>
>> On Wed, Jul 14, 2021 at 5:40 PM Aldrin  wrote:
>> >
>> > Forgive me if I am misunderstanding the context, but my initial impression 
>> > would be that this is solved at a higher layer than the file format. While 
>> > some approaches make sense at
>> > the file format level, that approach may not be the best. I suspect that 
>> > book-keeping for this type of conversion would be affected by batching 
>> > granularity (can you group multiple
>> > streamed batches) and what type of process/job it is (is the job at the 
>> > level of like a bash script? is the job at the level of a copy task?).
>> >
>> > Some questions and thoughts below:
>> >
>> >
>> >> One thing that occurs to me is whether we could enable the file
>> >> footer metadata to live in a "sidecar" file to support this use case.
>> >
>> >
>> > This sounds like a good, simple approach that could serve as a default. 
>> > But I feel like this is essentially the same as maintaining an independent 
>> > metadata file, that could be described
>> > in a cookbook or something. Seems odd to me, personally, to include it in 
>> > the format definition.
>>
>> The problem with this is that it is not compliant with our
>> specification ([1]), so applications would not be able to hope for any
>> interoperability. Parquet provides for file footer metadata living
>> separate from the row groups (akin to our record batches), and this is
>> formalized in the format ([2]). None of the Arrow projects have any
>> mechanism to deal with the Footer independently — to do something with
>> that metadata that is not in the project specification is not
>> something we could support and provide backward/forward
>> compatibilities for.
>>
>> [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
>> [2]: 
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787
>>
>> >
>> >> 3. Doing a pass over the record batches to gather the information 
>> >> required to generate the footer data.
>> >
>> >
>> > Could you maintain footer data incrementally and always write to the same 
>> > spot whenever some number of batches are written to the destination?
>> >
>> >
>> >> 2. Writing batches out as they appear.
>> >
>> >
>> > Might batches be received out of order? Is this long running job streaming 
>> > over a network connection? Might the source be distributed/striped over 
>> > multiple sources/locations?
>> >
>> >
>> >> a use case where there is a long running job producing results as it goes 
>> >> that may die and therefore must be restarted
>> >
>> >
>> > Would the long running job only be handling independent streams, 
>> > concurrently? e.g. is it an asynchronous job that handles a single logical 
>> > stream, or does it manage a pool of stream
>> > for concurrent requesting processes?
>> >
>> > Aldrin Montana
>> > Computer Science PhD Student
>> > UC Santa Cruz
>> >
>> >
>> > On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney  wrote:
>> >>
>> >> hi Sam — it's an interesting proposition. Other file

Re: [Python/C++] Streaming Format to IPC File Format Conversion

2021-07-14 Thread Wes McKinney
On Wed, Jul 14, 2021 at 5:40 PM Aldrin  wrote:
>
> Forgive me if I am misunderstanding the context, but my initial impression 
> would be that this is solved at a higher layer than the file format. While 
> some approaches make sense at
> the file format level, that approach may not be the best. I suspect that 
> book-keeping for this type of conversion would be affected by batching 
> granularity (can you group multiple
> streamed batches) and what type of process/job it is (is the job at the level 
> of like a bash script? is the job at the level of a copy task?).
>
> Some questions and thoughts below:
>
>
>> One thing that occurs to me is whether we could enable the file
>> footer metadata to live in a "sidecar" file to support this use case.
>
>
> This sounds like a good, simple approach that could serve as a default. But I 
> feel like this is essentially the same as maintaining an independent metadata 
> file, that could be described
> in a cookbook or something. Seems odd to me, personally, to include it in the 
> format definition.

The problem with this is that it is not compliant with our
specification ([1]), so applications would not be able to hope for any
interoperability. Parquet provides for file footer metadata living
separate from the row groups (akin to our record batches), and this is
formalized in the format ([2]). None of the Arrow projects have any
mechanism to deal with the Footer independently — to do something with
that metadata that is not in the project specification is not
something we could support and provide backward/forward
compatibilities for.

[1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
[2]: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L787

>
>> 3. Doing a pass over the record batches to gather the information required 
>> to generate the footer data.
>
>
> Could you maintain footer data incrementally and always write to the same 
> spot whenever some number of batches are written to the destination?
>
>
>> 2. Writing batches out as they appear.
>
>
> Might batches be received out of order? Is this long running job streaming 
> over a network connection? Might the source be distributed/striped over 
> multiple sources/locations?
>
>
>> a use case where there is a long running job producing results as it goes 
>> that may die and therefore must be restarted
>
>
> Would the long running job only be handling independent streams, 
> concurrently? e.g. is it an asynchronous job that handles a single logical 
> stream, or does it manage a pool of stream
> for concurrent requesting processes?
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Jul 14, 2021 at 2:23 PM Wes McKinney  wrote:
>>
>> hi Sam — it's an interesting proposition. Other file formats like
>> Parquet don't make "resuming" particularly easy, either. The magic
>> number at the beginning of an Arrow file means that it's a lot more
>> expensive to turn a stream file into an Arrow-file-file — if we'd
>> thought about this use case, we might have chosen to only put the
>> magic number at the end of the file.
>>
>> It's also not possible to put the file metadata "outside" the stream
>> file. One thing that occurs to me is whether we could enable the file
>> footer metadata to live in a "sidecar" file to support this use case.
>> To enable this, we would have to add a new optional field to Footer in
>> File.fbs that indicates the file path that the Footer references. This
>> would be null when the footer is part of the same file where the data
>> lives. A function could be implemented to produce this "sidecar index"
>> file from a stream file.
>>
>> Not sure on others' thoughts about this.
>>
>> Thanks,
>> Wes
>>
>>
>> On Wed, Jul 14, 2021 at 5:39 AM Sam Davis  wrote:
>> >
>> > Hi,
>> >
>> > I'm interested in a use case where there is a long running job producing 
>> > results as it goes that may die and therefore must be restarted, making 
>> > sure to continue from the last known-good point.
>> >
>> > For this use case, it seems best to use the "IPC Streaming Format" and 
>> > write out the batches as they are generated.
>> >
>> > However, once the job is finished it would also be beneficial to have 
>> > random access into the file. It seems like this is possible by:
>> >
>> > Manually creating a file with the correct magic number/padding bytes and 
>> > then seq'ing past them.
>> > Writing batches

Re: [Python/C++] Streaming Format to IPC File Format Conversion

2021-07-14 Thread Wes McKinney
hi Sam — it's an interesting proposition. Other file formats like
Parquet don't make "resuming" particularly easy, either. The magic
number at the beginning of an Arrow file means that it's a lot more
expensive to turn a stream file into an Arrow-file-file — if we'd
thought about this use case, we might have chosen to only put the
magic number at the end of the file.

It's also not possible to put the file metadata "outside" the stream
file. One thing that occurs to me is whether we could enable the file
footer metadata to live in a "sidecar" file to support this use case.
To enable this, we would have to add a new optional field to Footer in
File.fbs that indicates the file path that the Footer references. This
would be null when the footer is part of the same file where the data
lives. A function could be implemented to produce this "sidecar index"
file from a stream file.

Not sure on others' thoughts about this.

Thanks,
Wes


On Wed, Jul 14, 2021 at 5:39 AM Sam Davis  wrote:
>
> Hi,
>
> I'm interested in a use case where there is a long running job producing 
> results as it goes that may die and therefore must be restarted, making sure 
> to continue from the last known-good point.
>
> For this use case, it seems best to use the "IPC Streaming Format" and write 
> out the batches as they are generated.
>
> However, once the job is finished it would also be beneficial to have random 
> access into the file. It seems like this is possible by:
>
> Manually creating a file with the correct magic number/padding bytes and then 
> seq'ing past them.
> Writing batches out as they appear.
> Doing a pass over the record batches to gather the information required to 
> generate the footer data.
>
>
> Whilst this seems possible, it doesn't seem like it is a use case that has 
> come up before. However, this does surprise me because adding index 
> information to a "completed" file seems like a genuinely useful thing to want 
> to do.
>
> Has anyone encountered something similar before?
>
> Is there an easier way to achieve this? i.e. does this functionality, or 
> parts of, exist in another language that I can bind to in Python?
>
> Best,
>
> Sam
>
>
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.


Re: [Python] pyarrow.read_feather use_threads option not respected?

2021-07-12 Thread Wes McKinney
hi Burke — to remove yourself, you have to e-mail

user-unsubscr...@arrow.apache.org

On Mon, Jul 12, 2021 at 3:11 PM Burke Kaltenberger
 wrote:
>
> Please take me off the mailing list
>
> On Mon, Jul 12, 2021 at 1:08 PM Wes McKinney  wrote:
>>
>> hi Arun — the `use_threads` argument here only toggles whether
>> multiple threads are used in the conversion from the Arrow/Feather
>> representation to pandas. Since you elected to use compression,
>> multiple threads are used when decompressing the data, and this can
>> only be changed by setting the number of threads globally in the
>> pyarrow library [1]
>>
>> This seems a bit misleading to me, so it would be good to open a Jira
>> issue to clarify in the documentation what "use_threads" does
>>
>> [1]: 
>> http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count
>>
>> On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph  wrote:
>> >
>> > I'm running the following:
>> >
>> > Python 3.7.4 (default, Aug 13 2019, 20:35:49)
>> > [GCC 7.3.0] :: Anaconda, Inc. on linux
>> > Type "help", "copyright", "credits" or "license" for more information.
>> > >>> import pyarrow
>> > >>> pyarrow.__version__
>> > '4.0.1'
>> >
>> > from pyarrow import feather
>> >
>> > feather.write_feather(df, dest=file_path, compression='zstd', 
>> > compression_level=19)
>> > file_path=f'{valid_file_path}'
>> > feather.read_feather(file_path, use_threads=False)
>> >
>> > It seems like the use_threads argument does not alter the number of 
>> > threads launched. I've tested with both use_threads=True and 
>> > use_threads=False. Am I misunderstanding what use_threads actually means? 
>> > It seems like it launches ~12 threads.
>> >
>> > Could this be related to the compression strategy of the file itself?
>> >
>> > Thank You,
>> > Arun Joseph
>> >
>
>
>
> --
> First Talent Search & Placement
> Burke Kaltenberger | Founder
> 408.458.0071


Re: [Python] pyarrow.read_feather use_threads option not respected?

2021-07-12 Thread Wes McKinney
hi Arun — the `use_threads` argument here only toggles whether
multiple threads are used in the conversion from the Arrow/Feather
representation to pandas. Since you elected to use compression,
multiple threads are used when decompressing the data, and this can
only be changed by setting the number of threads globally in the
pyarrow library [1]

This seems a bit misleading to me, so it would be good to open a Jira
issue to clarify in the documentation what "use_threads" does

[1]: 
http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count

On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph  wrote:
>
> I'm running the following:
>
> Python 3.7.4 (default, Aug 13 2019, 20:35:49)
> [GCC 7.3.0] :: Anaconda, Inc. on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> pyarrow.__version__
> '4.0.1'
>
> from pyarrow import feather
>
> feather.write_feather(df, dest=file_path, compression='zstd', 
> compression_level=19)
> file_path=f'{valid_file_path}'
> feather.read_feather(file_path, use_threads=False)
>
> It seems like the use_threads argument does not alter the number of threads 
> launched. I've tested with both use_threads=True and use_threads=False. Am I 
> misunderstanding what use_threads actually means? It seems like it launches 
> ~12 threads.
>
> Could this be related to the compression strategy of the file itself?
>
> Thank You,
> Arun Joseph
>


Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

2021-07-06 Thread Wes McKinney
I left a comment in Jira, but I agree that having a faster method to
"box" Arrow array values as Python objects would be useful in a lot of
places. Then these common C++ code paths could be used to "tupleize"
record batches reasonably efficiently

On Tue, Jul 6, 2021 at 3:08 PM Alessandro Molina
 wrote:
>
> I guess that doing it at the Parquet reader level might allow the 
> implementation to better leverage row groups, without the need to keep in 
> memory the whole Table when you are iterating over data. While the current 
> jira issue seems to suggest the implementation for Table once it's already 
> fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche 
>  wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: 
>> https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about 
>> improving the performance of the pyarrow -> python objects conversion, which 
>> might require some more engineering work to do), but of course what's 
>> proposed in the JIRA is starting from a materialized record batch (so 
>> similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield  wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be 
>>> be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a 
>>> performant way might take some engineering work, but having a quick 
>>> solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams  
>>> wrote:

 Hello,

 I've found myself wondering if there is a use case for using the 
 iter_batches method in python as an iterator in a similar style to a 
 server-side cursor in Postgres. Right now you can use an iterator of 
 record batches, but I wondered if having some sort of python native 
 iterator might be worth it? Maybe a .to_pyiter() method that converts it 
 to a lazy & batched iterator of native python objects?

 Here is some example code that shows a similar result.

 from itertools import chain
 from typing import Tuple, Any

 def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> 
 Tuple[Any]:

 record_batches = parquet_file.iter_batches(batch_size=batch_size, 
 columns=columns)

 # convert from columnar format of pyarrow arrays to a row format 
 of python objects (yields tuples)
 yield from chain.from_iterable(zip(*map(lambda col: 
 col.to_pylist(), batch.columns)) for batch in record_batches)

 (or a gist if you prefer: 
 https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)

 I realize arrow is a columnar format, but I wonder if having the buffered 
 row reading as a lazy iterator is a common enough use case with parquet + 
 object storage being so common as a database alternative.

 Thanks,
 Grant

 --
 Grant Williams
 Machine Learning Engineer
 https://github.com/grantmwilliams/


Re: LLVM Question

2021-06-24 Thread Wes McKinney
clang-format is only used in development for formatting the codebase (and
checking that it's been formatted). Clang tools (format and tidy) aren't
necessary for building the library (Gandiva) that requires LLVM

On Thu, Jun 24, 2021 at 9:43 AM Weber, Eugene  wrote:

> Hi,
>
>
> I'm running Centos 7. In the Arrow C++ build documentation it states that
> you are using LLVM 8. I can install llvm-toolset-7.0 which contains
> llvm/clang version 7.0.1. To get a more recent version of llvm on Centos 7
> it would need to be built from source. I tried to do this a few times
> following various online "recipes", but failed. Not the clearest/simplest
> process I've seen.  I decided to just give version 7.0.1 a try.
>
>
> Attached is my build script and build log. It appears to build
> successfully, but it does give these messages:
>
> -- clang-tidy not found
>
> -- clang-format not found
>
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
>
>
> I tried adding the following to my build script:
>
> export CLANG_TIDY_BIN=/opt/rh/llvm-toolset-7/root/usr/bin
>
> export CLANG_FORMAT_BIN=/opt/rh/llvm-toolset-7.0/root/usr/bin
>
> export PATH=/opt/rh/llvm-toolset-7/root/usr/bin:$PATH
>
> But the messages remain.
>
> When I run unit tests, all 71 tests pass. So at this point there doesn't
> appear to be any issue using llvm/clang 7.0.1.
>
> Are the "not found" messages OK?
> Is it OK to use llvm/clang 7.0.1?
>
> Thanks,
>
> Gene
>
>
>
>
>


Re: Passing back and forth from Python and C++ with Pyarrow C++ extension and pybind11 (#10488)

2021-06-10 Thread Wes McKinney
hi Daniel — is pip-installed pyarrow your only version of Arrow or do
you have the C++ library installed separately?

here's the C++ libraries that I have installed bundled with pyarrow

-rwx--x--x   1 wesm  staff  26097016 Jun 10 13:28 libarrow.400.dylib
-rwx--x--x   1 wesm  staff   1685632 Jun 10 13:28 libarrow_dataset.400.dylib
-rwx--x--x   1 wesm  staff   8280960 Jun 10 13:28 libarrow_flight.400.dylib
-rwx--x--x   1 wesm  staff   1482232 Jun 10 13:28 libarrow_python.400.dylib
-rwx--x--x   1 wesm  staff117256 Jun 10 13:28
libarrow_python_flight.400.dylib
-rwx--x--x   1 wesm  staff   3135080 Jun 10 13:28 libparquet.400.dylib
-rwx--x--x   1 wesm  staff237240 Jun 10 13:28 libplasma.400.dylib

That libarrow_python* library is the one you need. If you are
depending on a separately-installed C++ library, you need to build
with -DARROW_PYTHON=ON

On Thu, Jun 10, 2021 at 12:25 AM Daniel Foreman  wrote:
>
> Hello,
>
> This is a continuation of issue #10488 on the Apache Arrow Github.  Thank you 
> for your quick response!
>
> I have added
>
> target_link_libraries(helperfuncs PRIVATE arrow_python_shared)
>
> to my cmakelists file, but it appears that cmake is unable to find this 
> library, while it can find both the arrow_shared and arrow_static libraries.  
> The documentation doesn't appear to mention arrow_python_shared either.  Am I 
> missing this library?  I downloaded pyarrow using python pip install into a 
> python virtual-environment.  Does apache arrow need to be cloned from github 
> to get this library?
>
> I can't seem to find libraries under the names arrow_shared or arrow_static 
> in my pyarrow module directory, and it seems these are an umbrella name for a 
> variety of libraries.  Is this the case?  If not, where should I look to find 
> them?


Re: Specific key order in dictionary_encode

2021-06-09 Thread Wes McKinney
There's already an (old) issue for this

https://issues.apache.org/jira/browse/ARROW-1574

Here's another related issue

https://issues.apache.org/jira/browse/ARROW-4097

I agree both of these would be useful

On Wed, Jun 9, 2021 at 6:10 AM Joris Van den Bossche
 wrote:
>
> Hi Albert,
>
> That's a good question. AFAIK this is currently not possible (or at
> least it's not directly integrated in the public C++/Python kernel),
> but this sounds like a useful functionality to add (eg also to cover
> pandas functionality to convert to categorical dtype with given
> categories). Would you like to open a JIRA for it?
>
> Joris
>
> On Wed, 9 Jun 2021 at 09:58, Albert Villanova del Moral
>  wrote:
> >
> > Hello,
> >
> > I would like to know if it is possible to pass the keys (and in a specific 
> > order) to dictionary_encode, so that the resulting dictionary contains 
> > these keys in that order.
> >
> > Thanks.
> > Regards,
> > Albert.
> >
> > --
> > Albert Villanova del Moral
> > Machine Learning Engineer @Hugging Face


Re: [Python] Read dictionary values of DictionaryArray without reading the whole file

2021-06-03 Thread Wes McKinney
It isn't possible with the current API, but all of the library
machinery exists for you to be able to obtain this without
extraordinary pain (speaking as one of the people who participated in
the direct-read/write of arrow::DictionaryArray implementation). You
would need to do some work on the C++ library to externalize just the
dictionary data page.

On Thu, Jun 3, 2021 at 2:55 PM Juan Galvez  wrote:
>
> Hello,
>
> I have a large parquet file written by pandas with categorical columns (which 
> are read into Arrow as DictionaryArray). I want to get the value of the 
> categories in Python (called "dictionary" values in Arrow) without having to 
> read any other data from the file into memory other than metadata. Is this 
> possible?
>
> Thank you,
> -Juan
>


Re: Filtering list/map arrays

2021-05-21 Thread Wes McKinney
I think we would want to implement a scalar "list_isin" function as a
core C++ function, so the type signature looks like this:

(Array>, Scalar) -> Array

I couldn't find an issue like this with a quick Jira search so I created

https://issues.apache.org/jira/browse/ARROW-12849

On Fri, May 21, 2021 at 8:06 AM Elad Rosenheim  wrote:
>
> Hi!
>
> One of the gaps I currently have in Funnel Rocket 
> (https://github.com/DynamicYieldProjects/funnel-rocket) is supporting nested 
> columns, as in: given a Parquet file with a column of type List(int64), be 
> able to find rows where the list holds a specific int element.
>
> Right now, the need is fortunately limited to lists of primitives (mostly 
> int) and maps of string->string, rather than any arbitrary complexity.
>
> Currently, I load Parquet files via pyarrow, then call to_pandas() and run 
> multiple filters on the DataFrame.
>
> After reading Uwe's blog post 
> (https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html)
>  and looking at the Fletcher project (https://github.com/xhochy/fletcher), 
> seems the "proper" way to do it would be:
>
> * Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray made 
> of ListArrays. Not even sure what the operator should be for lookup in a list 
> - should I treat a list_series==123 as "for each list in this series, look 
> for the element 123 in it?".
>
>  * Potentially use a @jitclass for more performant lookup, as Uwe has 
> outlined.
>
> * For now, for any abstract method I'm not sure what to do with - start with 
> raising an exception, then run some unit tests based on my project's needs, 
> and see that they pass :-/
>
> * When calling Table.to_pandas(), supply a type mapper argument to map the 
> specific supported types to the appropriate extension class.
>
> * If it seems to work, figure out if I've missed something important in the 
> concrete classes :-/
>
> Am I getting this right, more or less?
>
> Thanks a lot,
> Elad


Re: why I can't use `arrow::py::import_pyarrow()' in c++

2021-05-18 Thread Wes McKinney
You have to initialize the Python interpreter when using it embedded in a
C++ application.

On Tue, May 18, 2021 at 8:36 AM auderson  wrote:

> Thanks for your reply!
>
> After I linked arrow_python_shared, calling import_pyarrow now caused a
> SEGFAULT.
>
> I've posted my code on stackoverflow c++ - arrow::py::import_pyarrow()
> cause a SEGMENTATION FAULT - Stack Overflow
> 
>
> I'm pretty new to C++, and maybe there are protocols I did't follow?
> -- Original --
> *From:* "user" ;
> *Date:* Tue, May 18, 2021 09:15 PM
> *To:* "user";
> *Subject:* Re: why I can't use `arrow::py::import_pyarrow()' in c++
>
> You have to link libarrow_python (arrow_python_shared) where this symbol
> is found.
>
> On Tue, May 18, 2021 at 3:09 AM auderson  wrote:
>
>> Hi,
>>
>> I'm using arrow-cpp 4.0.0 installed from conda. I also have pyarrow 4.0.0
>> installed.
>>
>> In my test.cpp file, import_pyarrow() will throw an error: undefined
>> reference to `arrow::py::import_pyarrow()'
>>
>> #mini_example.cpp
>>
>> #include "mini_example.h"
>> #include 
>>
>> int main() {
>> arrow::py::import_pyarrow();
>> }
>>
>>  #mini_example.h
>>
>>
>> #include 
>>
>> int main() {
>> arrow::py::import_pyarrow();
>> }
>>
>>
>> #CMakeLists.txt
>>
>> cmake_minimum_required(VERSION 3.10.0)
>> project(TEST)
>> set(CMAKE_CXX_STANDARD 17)
>>
>> list(APPEND CMAKE_PREFIX_PATH "/home/auderson/miniconda3/lib/cmake/arrow")
>> find_package(Arrow REQUIRED)
>>
>>
>> include_directories(.
>> /home/auderson/miniconda3/include
>> /home/auderson/miniconda3/include/python3.8)
>>
>> add_executable(TEST
>> mini_example.cpp
>> )
>>
>> target_link_libraries(${PROJECT_NAME} PRIVATE 
>> /home/auderson/miniconda3/lib/libpython3.8.so)
>> target_link_libraries(${PROJECT_NAME} PRIVATE arrow_shared)
>>
>> The full output:
>>
>> CMakeFiles/TEST.dir/mini_example.cpp.o: In function `main':
>> /tmp/tmp.vmULkzpuYF/mini_example.cpp:9: undefined reference to 
>> `arrow::py::import_pyarrow()'
>> collect2: error: ld returned 1 exit status
>>
>> But other functionalities works fine, like create an Array then build a 
>> Table from them. In fact I'm just stuck at the last step where I have to 
>> wrap the table so that they can be passed to cython.
>>
>> Can you find where I'm doing wrong? Thanks!
>>
>> Auderson
>>
>>
>>


Re: why I can't use `arrow::py::import_pyarrow()' in c++

2021-05-18 Thread Wes McKinney
You have to link libarrow_python (arrow_python_shared) where this symbol is
found.

On Tue, May 18, 2021 at 3:09 AM auderson  wrote:

> Hi,
>
> I'm using arrow-cpp 4.0.0 installed from conda. I also have pyarrow 4.0.0
> installed.
>
> In my test.cpp file, import_pyarrow() will throw an error: undefined
> reference to `arrow::py::import_pyarrow()'
>
> #mini_example.cpp
>
> #include "mini_example.h"
> #include 
>
> int main() {
> arrow::py::import_pyarrow();
> }
>
>  #mini_example.h
>
>
> #include 
>
> int main() {
> arrow::py::import_pyarrow();
> }
>
>
> #CMakeLists.txt
>
> cmake_minimum_required(VERSION 3.10.0)
> project(TEST)
> set(CMAKE_CXX_STANDARD 17)
>
> list(APPEND CMAKE_PREFIX_PATH "/home/auderson/miniconda3/lib/cmake/arrow")
> find_package(Arrow REQUIRED)
>
>
> include_directories(.
> /home/auderson/miniconda3/include
> /home/auderson/miniconda3/include/python3.8)
>
> add_executable(TEST
> mini_example.cpp
> )
>
> target_link_libraries(${PROJECT_NAME} PRIVATE 
> /home/auderson/miniconda3/lib/libpython3.8.so)
> target_link_libraries(${PROJECT_NAME} PRIVATE arrow_shared)
>
> The full output:
>
> CMakeFiles/TEST.dir/mini_example.cpp.o: In function `main':
> /tmp/tmp.vmULkzpuYF/mini_example.cpp:9: undefined reference to 
> `arrow::py::import_pyarrow()'
> collect2: error: ld returned 1 exit status
>
> But other functionalities works fine, like create an Array then build a Table 
> from them. In fact I'm just stuck at the last step where I have to wrap the 
> table so that they can be passed to cython.
>
> Can you find where I'm doing wrong? Thanks!
>
> Auderson
>
>
>


Re: Boost

2021-05-04 Thread Wes McKinney
That doesn't sound great. Can you open a Jira about this? Offline /
air-gapped installations from source are a use case we'd like to
accommodate within reason, so if it's something we can fix easily,
let's try

On Mon, May 3, 2021 at 11:23 PM Matt Youill  wrote:
>
> Using ARROW_BOOST_URL works. Subsequently tried using local copies for all 
> arrow dependencies, and all seems good... except for ORC. Lots of things 
> break if I try ARROW_ORC_URL pointing at a local tarball. It's only a "nice 
> to have" at the moment, so have just turned off (-DARROW_ORC=OFF)
>
> On 4/5/21 7:59 am, Neal Richardson wrote:
>
> That's right, I noticed that 1.71.0 wasn't on sourceforge when we were moving 
> off of bintray URLs--that's why we bumped up to 1.75 at that time (also 
> included in arrow 4.0).
>
> Neal
>
> On Mon, May 3, 2021 at 2:34 PM Niranda Perera  
> wrote:
>>
>> @Neal It seems like there sourceforge artifacts are missing for boost 
>> 1.71.0. But I tried 1.72.0 and it worked.
>> export 
>> ARROW_BOOST_URL="https://sourceforge.net/projects/boost/files/boost/1.72.0/boost_1_72_0.tar.gz/download;
>>
>> From Cylon project's POV we are in the process of upgrading to arrow 4.0 
>> from 2.0 ATM.
>>
>> Best
>>
>>
>> On Mon, May 3, 2021 at 11:06 AM Neal Richardson 
>>  wrote:
>>>
>>> Bintray was shut down Saturday (May 1). The surest fix for this is to 
>>> upgrade to the latest (4.0.0) release of arrow, which does not have any 
>>> references to bintray. If you need to be on an older version, you can 
>>> download the correct boost source tarball (they're hosting on sourceforge 
>>> now, I believe) and set the environment variable ARROW_BOOST_URL to point 
>>> to the local file.
>>>
>>> Neal
>>>
>>> On Mon, May 3, 2021 at 8:01 AM Niranda Perera  
>>> wrote:
>>>>
>>>> I also noticed the same error since yesterday (Sunday).
>>>>
>>>> On Mon, May 3, 2021 at 10:44 AM Wes McKinney  wrote:
>>>>>
>>>>> Can you create a Jira issue and provide more information about exactly
>>>>> what's going wrong?
>>>>>
>>>>> On Mon, May 3, 2021 at 2:10 AM Matt Youill  
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I started getting a strange build failure for thrift this afternoon (no
>>>>> > install target). Digging into it, it looks like boost is the culprit.
>>>>> > Two of the three mirrors for boost are failing and the last doesn't look
>>>>> > like what it should. Namely, these fail to download:
>>>>> >
>>>>> > https://dl.bintray.com/ursalabs/arrow-boost/boost_1_71_0.tar.gz
>>>>> >
>>>>> > https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.gz
>>>>> >
>>>>> > And this one (that successfully downloads) doesn't look right...
>>>>> >
>>>>> > https://github.com/boostorg/boost/archive/boost-1.71.0.tar.gz
>>>>> >
>>>>> > AFAICT the arrow build expects it to contain headers, but headers = 
>>>>> > false.
>>>>> >
>>>>> > Matt
>>>>> >
>>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Niranda Perera
>>>> https://niranda.dev/
>>>> @n1r44
>>>>
>>
>>
>> --
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44
>>


Re: State of plasma

2021-05-04 Thread Wes McKinney
hi Jacopo — absent developer-maintainers, it is de facto deprecated
since the previous developer-maintainers (who work on the Ray project)
have forked away and abandoned it. If someone wants to resume
development and maintenance of the codebase (or fund work on it,
please contact me if you want to fund it!), that would be great.

On Tue, May 4, 2021 at 10:07 AM Jacopo Gobbi  wrote:
>
> Hi everybody,
>
> My name is Jacopo, I am a software engineer at Orchest (orchest.io). We
> are currently making use of the plasma store to do in-memory data
> passing, and use pyarrow for some serialization.
> Some months ago, in the dev mailing list, there have been talks of the
> plasma store getting deprecated. I see that Arrow 4.0.0 has been
> released and that the plasma-store is still there, I could not find
> exhaustive information in the docs and the README about the current
> state of the plasma-store and what to expect in the future.
>
> Could someone provide some insight as to what the current plans are for
> plasma in pyarrow?
>
> Regards,
> Jacopo


Re: Boost

2021-05-03 Thread Wes McKinney
Can you create a Jira issue and provide more information about exactly
what's going wrong?

On Mon, May 3, 2021 at 2:10 AM Matt Youill  wrote:
>
> Hi,
>
> I started getting a strange build failure for thrift this afternoon (no
> install target). Digging into it, it looks like boost is the culprit.
> Two of the three mirrors for boost are failing and the last doesn't look
> like what it should. Namely, these fail to download:
>
> https://dl.bintray.com/ursalabs/arrow-boost/boost_1_71_0.tar.gz
>
> https://dl.bintray.com/boostorg/release/1.71.0/source/boost_1_71_0.tar.gz
>
> And this one (that successfully downloads) doesn't look right...
>
> https://github.com/boostorg/boost/archive/boost-1.71.0.tar.gz
>
> AFAICT the arrow build expects it to contain headers, but headers = false.
>
> Matt
>
>


Re: [Announce][Rust] JIRA Issues migrated to github issues

2021-04-26 Thread Wes McKinney
Thanks for doing this Andrew.

On Mon, Apr 26, 2021 at 8:38 AM Andrew Lamb  wrote:

> I have migrated over all JIRA issues that were marked as "Rust" or
> "Rust-DataFusion" to new issues in the https://github.com/apache/arrow-rs
> and https://github.com/apache/arrow-datafusion repos respectively.
>
> There are now no open JIRA issues [1] for the Rust implementation.
>
> My script moved the titles, descriptions, tags, and comments but didn't
> attempt to port the formatting, so some of the github issues are kind of
> messy. I also did not try to map jira usernames to github usernames.
>
> Please feel free to move / correct / close any others I have missed.
>
> If anyone has thoughts about how to prevent people from using JIRA
> accidentally to file issues (rather than github) please let us know.
>
> Andrew
>
>
> JQL: project = ARROW AND component in ("Rust - DataFusion", Rust, "Rust -
> Ballista") and status NOT IN (RESOLVED, CLOSED) ORDER BY created ASC
>
> [1]:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20component%20in%20(%22Rust%20-%20DataFusion%22%2C%20Rust%2C%20%22Rust%20-%20Ballista%22)%20and%20status%20NOT%20IN%20(RESOLVED%2C%20CLOSED)%20ORDER%20BY%20created%20ASC
>


Re: [Python] pyarrow.gandiva unavailable on Ubuntu?

2021-04-13 Thread Wes McKinney
It looks to me like you have the wheel installed, not the conda package.
Can you reproduce this on Ubuntu from a fresh conda environment?

On Tue, Apr 13, 2021 at 3:16 PM Xander Dunn  wrote:

> Typo. The issue remains present. From my Ubuntu machine just now:
> ```
> $ python
> >>> import pyarrow as pa
> >>> print(pa.__file__)
> /
> home/xander/anaconda3/envs/plutus_model/lib/python3.7/site-packages/pyarrow/__init__.py
> >>> import pyarrow.plasma
> >>> import pyarrow.gandiva
> Traceback (most recent call last):
>   File "", line 1, in 
> ModuleNotFoundError: No module named 'pyarrow.gandiva'
> ```
>
> The .py I'm executing on both machines is identical. Works on mac. Not
> found on Ubuntu.
>
>
> On Tue, Apr 13, 2021 at 1:01 PM, Micah Kornfield 
> wrote:
>
>> Hi Xander,
>> Was there autocorrect on this e-mail?  the second example shows "gondiva"
>> not "gandiva"
>>
>> -Micah
>>
>> On Tue, Apr 13, 2021 at 12:59 PM Xander Dunn  wrote:
>>
>>> On my local macOS 11.2.3:
>>> ```
>>> $ python --version
>>> Python 3.7.10
>>> $ pip --version
>>> pip 21.0.1 from
>>> /usr/local/anaconda3/envs/my_model/lib/python3.7/site-packages/pip (python
>>> 3.7)
>>> $ pip list | grep pyarrow
>>> pyarrow3.0.0
>>> $ which python
>>> /usr/local/anaconda3/envs/my_model/bin/python
>>> $ python
>>> >>> import pyarrow as pa
>>> >>> print(pa.__file__)
>>> /
>>> usr/local/anaconda3/envs/my_model/lib/python3.7/site-packages/pyarrow/__init__.py
>>> >>> import pyarrow.plasma
>>> >>> import pyarrow.gandiva as ga
>>> >>> print(ga.__file__)
>>> /
>>> usr/local/anaconda3/envs/my_model/lib/python3.7/site-packages/pyarrow/gandiva.cpython-37m-darwin.so
>>> ```
>>>
>>> On my Ubuntu 14.04 instance:
>>> ```
>>> $ python --version
>>> Python 3.7.10
>>> $ pip --version
>>> pip 21.0.1 from
>>> /home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/pip
>>> (python 3.7)
>>> $ pip list | grep pyarrow
>>> pyarrow3.0.0
>>> $ which python
>>> /home/xander/anaconda3/envs/my_model/bin/python
>>> $ python
>>> >>> import pyarrow as pa
>>> >>> print(pa.__file__)
>>> /
>>> home/xander/anaconda3/envs/my_model/lib/python3.7/site-packages/pyarrow/__init__.py
>>> >>> import pyarrow.plasma
>>> >>> import pyarrow.gondiva
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>> ModuleNotFoundError: No module named 'pyarrow.gondiva'
>>> ```
>>> You can see that pyarrow.gondiva is found on mac but not on Ubuntu. Same
>>> Python version. Same pyarrow version. I installed both of them with `conda
>>> install -c conda-forge pyarrow==3.0.0`.
>>>
>>> On Mac, I see the expected Cython file and library:
>>> ```
>>> $ l
>>> /usr/local/anaconda3/envs/my_model/lib/python3.7/site-packages/pyarrow/ |
>>> grep gandiva
>>> -rwxrwxr-x   2 xander  staff   221K Apr  1 12:44
>>> gandiva.cpython-37m-darwin.so
>>> -rw-rw-r--   2 xander  staff17K Jan 18 14:00 gandiva.pyx
>>> ```
>>>
>>> On Ubuntu, I see only the Cython file:
>>> ```
>>> $ l ~/anaconda3/envs/my_model/lib/python3.7/site-packages/pyarrow/ |
>>> grep gandiva
>>> -rw-rw-r-- 1 xander xander  17K Apr 13 12:28 gandiva.pyx
>>> ```
>>>
>>> Is this expected? Should I be able to import pyarrow.gandiva on Ubuntu?
>>> Everything is run on Ubuntu so if I make use of pyarrow.gandiva I'll need
>>> to figure out how to call it.
>>>
>>> It's mentioned here that it was removed from Python wheels but should
>>> still be available in the conda install:
>>> https://issues.apache.org/jira/browse/ARROW-10154. I'm not finding it
>>> in my Ubuntu conda install.
>>>
>>> Thanks,
>>> Xander
>>>
>>
>


Re: [Python] Run multiple pc.compute functions on chunks in single pass

2021-04-07 Thread Wes McKinney
We are working on implementing a streaming aggregation to be available in
Python but it probably won’t be available until the 5.0 release. I am not
sure solving this problem efficiently is possible at 100GB scale with the
tools currently available in pyarrow.

On Wed, Apr 7, 2021 at 12:41 PM Suresh V  wrote:

> Hi .. I am trying to compute aggregates on large datasets (100GB) stored
> in parquet format. Current approach is to use scan/fragement to load chunks
> iteratively into memory and would like to run the equivalent of following
> on each chunk using pc.compute functions
>
> df.groupby(['a', 'b', 'c']).agg(['sum', 'count', 'min', 'max'])
>
> My understanding is that pc.compute needs to scan the entire array for
> each of the functions. Please let me know if that is not the case and how
> to optimize it.
>
> Thanks
>


Re: OSerror:Unable to load libjvm

2021-03-22 Thread Wes McKinney
I have never developed or tested the HDFS integration on Windows (we don't
test it in our CI either), so we would need to see if there is someone
reading who has used it successfully to try to help, or a developer who
wants to dig in and try to get it to work themselves (fixing anything that
pops up along the way).

On Mon, Mar 22, 2021 at 7:10 AM 황세규  wrote:

> Hello dear.
>
> My name is Joseph Hwang. I am a developer in South Korea.
>
> I try to develop hadoop file system client application with pyarrow 3 on
> windows 10. First, my development environment are like below,
>
>
>
> OS : Windows 10
>
> Language : Anaconda 2020.11 <202011>
>
> IDE : eclipse
>
>
>
> And my environment variables are
>
>
>
> JAVA_HOME : C:\Program Files\Java\jdk-11.0.10 <11010>
>
> HADOOP_HOME : C:\hadoop-3.3.0
>
> ARROW_LIBHDFS_DIR : C:\hadoop-3.3.0\lib\native
>
> CLASSPATH = 'hdfs classpath --glob'
>
>
>
> These are my short python codes with pyarrow
>
>
>
> from pyarrow import fs
>
> hdfs = fs.HadoopFileSystem('localhost', port=9000)
>
>
>
> But I can not connect to my hadoop file system. The brought error is
>
>
>
> hdfs = fs.HadoopFileSystem('localhost', port=9000)
>
>   File "pyarrow\_hdfs.pyx", line 83, in
> pyarrow._hdfs.HadoopFileSystem.__init__
>
>   File "pyarrow\error.pxi", line 122, in
> pyarrow.lib.pyarrow_internal_check_status
>
>   File "pyarrow\error.pxi", line 99, in pyarrow.lib.check_status
>
> OSError: Unable to load libjvm:
>
>
>
> I think my codes have some problems with java configuration but I have no
> idea how to correct this error.
>
> Kindly inform me of your advise to correct this error. Thank you for
> reading my e-mail.
>
> Best regards.
>


Re: Are Arrow, Flight and Plasma suitable for my use case?

2021-03-19 Thread Wes McKinney
The topic of putting tensors in an Arrow record batch column has come
up many times, the problem is only waiting for a champion to propose a
solution and implement it (particularly in the C++ side, it would be
pretty straightforward to implement this as an extension type on top
of binary arrays). If someone would like to fund this work, feel free
to get in touch with me offline.

On Fri, Mar 19, 2021 at 3:20 AM Fernando Herrera
 wrote:
>
> Hi Matias,
>
> If you are going to do tensor operations, then you could use the Arrow tensor
> representation.
>
> https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html
>
> However, I don't think the data stored in the tensor will be compressed. It 
> will be
> orderly stored so you can share the tensors with other processes.
>
> I hope that helps
> Fernando
>
> On Fri, Mar 19, 2021 at 8:52 AM Matias Guijarro  
> wrote:
>>
>> Hi !
>>
>> I recently learned about Apache Arrow, and as a preliminary study I would
>> like to know if it can be a good choice for my use case, or if I have to
>> look
>> for another technology (or to craft something specific on my own !).
>>
>> I could not really find answers to my questions in the FAQ or reading
>> articles and blogs, but I may have missed some information so I apologize
>> in advance if my questions have already been answered.
>>
>> Arrow is all about storing columnar data. What can be the content of the
>> elements in a column ?
>>
>> In my case, I have scalar values (numbers), 1D arrays and 2D arrays.
>> The 2D arrays can be quite big (4000x4000 float 32 for example).
>> So, we could imagine long tables, hundred thousands of lines, containing
>> a mix of those data types.
>>
>> I wonder if Arrow stays efficient for such kind of data ? In particular,
>> rows of 2D data arrays in a column may be difficult to handle with the
>> same level of optimization ? (just guessing)
>>
>> Is there some compression in Arrow ? I am thinking about blosc kind of
>> compression (like in the dead "bcolz" project - by the way someone already
>> wondered about Arrow + Blosc: https://github.com/Blosc/bcolz/issues/300)
>>
>> Another use case I have, is to be able for multiple processes on the same
>> computer to access the Arrow in-memory store ; it seems to me Plasma
>> does this job but I wonder about the trade-offs ?
>>
>> Thanks in advance for your advices - any help would be highly appreciated !
>>
>> Cheers,
>> Matias.
>>
>>
>>
>>
>>
>>


Re: [Python] Why is the access to values in ChunkedArray O(n_chunks) ?

2021-03-16 Thread Wes McKinney
Is there a Jira tracking this performance improvement? At minimum
getting to O(log k) indexing time where k is the number of chunks
would be a good goal

On Mon, Mar 15, 2021 at 8:05 PM Micah Kornfield  wrote:
>
> One more micro optimization would be to use interpolation search instead of 
> binary search (haven't checked if this is what the compute code does)
>
> On Monday, March 15, 2021, Weston Pace  wrote:
>>
>> Since most of the time is probably spent loading each length from RAM
>> (the lengths aren't contiguous and given the size of the chunks they
>> are probably pretty far apart) you can even get significant speedup
>> just by using a contiguous array of lengths in your index.  Note, this
>> can be combined with Antoine's suggestion.  I did a quick
>> micro-benchmark and by the time I got to 32 chunks it was already 5x
>> slower.  With a contiguous array of lengths it was still within 20% of
>> the original time.
>>
>> On Mon, Mar 15, 2021 at 8:14 AM Antoine Pitrou  wrote:
>> >
>> > On Mon, 15 Mar 2021 11:02:09 -0700
>> > Micah Kornfield  wrote:
>> > > >
>> > > > Do you know if the iteration is done on the python side, or on the C++
>> > > > side ?
>> > >
>> > > It appears to be in cython [1] which looking at the definition, I would
>> > > expect to compile down to pretty straightforward C code.
>> > >
>> > > May I create a post on JIRA about adding an indexing structure for
>> > > > ChunkedArray ?
>> > >
>> > > Yes, please do, I'm sure others will have thoughts on the best way to
>> > > incorporate this (it might also be a good first contribution to the 
>> > > project
>> > > if you are interested).  I think also providing some context from this
>> > > thread on the relative slowness would be good (the bottleneck still might
>> > > be something else, that others more familiar with the code could point 
>> > > to).
>> >
>> > Just for the record, there is already something like this in the
>> > compute layer.  The general idea can probably be reused.
>> >
>> > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >


Re: Implicit cast in PyArrow JSON schema inference (i.e. Integer -> String)

2021-03-15 Thread Wes McKinney
You can pass an explicit schema in the ParseOptions, but I don't know
if it will see "string" in the schema and promote integers (if not,
could you open a Jira about this?). Otherwise I'm not sure that
automatic "loose" type inference is a good default behavior (though
possibly something that could be opted into).

On Mon, Mar 15, 2021 at 4:04 PM Pavol Knapek  wrote:
>
> Hi guys,
>
> I'm trying to use the `pyarrow.json.read_json('input.json')` command - to 
> load a JSON file, infer the schema, and return a new `pyarrow.Table` instance.
>
> So, given an input:
> {"col1": "1"}
> {"col1": 1}
>
> I'd expect the output `pyarrow.Table` to have a schema {col1: string}, with 
> an implicit cast of Integer(s) to String(s).
>
> (As it gets inferred in a similar way i.e. by Apache Spark)
>
> But instead, an exception gets raised:
> ArrowInvalid: JSON parse error: Column(/col1) changed from string to number 
> in row 1
>
> Is there some way to let the infer-process know it can safely cast all types 
> to a super-type, if possible (i.e. Integer -> String, Object -> String, 
> Anything -> String, ...)?
>
> Thanks
>
> Best
> --
> Pavol Knapek
> mobile CA: +1 604 314 6164
> mobile CZ: +420 774 293 243
> mobile SK: +421 917 557 263
> e-mail: knapek.pa...@gmail.com
> http://linkedin.com/in/pavolknapek


Re: pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker

2021-03-07 Thread Wes McKinney
If you are able to provide a file that reproduces the error, that
would also be very helpful (and we can open a Jira issue to track the
problem)

On Fri, Mar 5, 2021 at 10:19 PM Micah Kornfield  wrote:
>
> Hi Ruben,
> I'm not an expert here, but is it possible the CSV has newlines inside quotes 
> or some oddity?  There are a lot of configuration options for Read CSV and 
> you might want to validate that the defaults are at the most conservative 
> settings.
>
> -Micah
>
> On Fri, Mar 5, 2021 at 12:40 PM Ruben Laguna  wrote:
>>
>> Hi,
>>
>> I'm getting "CSV parser got out of sync with chunker", any idea on how to 
>> troubleshoot this?
>> If I feed the original file it fails after 1477218 rows
>> if I remove the first line after the header then it fails after 2919443 rows
>> if I remove the first 2 lines after the header  then it fails after 55339 
>> rows
>> if I remove the first 3 lines after the header then it fails after 8200437 
>> rows
>> if I remove the first 4 line after the header then if fails after 1866573 
>> rows
>> To me it doesn't make sense, the failure shows at different, seemly random 
>> places.
>>
>> What can be causing this?  source code below->
>>
>>
>>
>> Traceback (most recent call last):
>>   File "pa_inspect.py", line 15, in 
>> for b in reader:
>>   File "pyarrow/ipc.pxi", line 497, in __iter__
>>   File "pyarrow/ipc.pxi", line 531, in 
>> pyarrow.lib.RecordBatchReader.read_next_batch
>>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
>> pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
>> in
>>
>>
>> import pyarrow as pa
>> from pyarrow import csv
>> import pyarrow.parquet as pq
>>
>> # 
>> http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
>> # 
>> http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html
>> reader = csv.open_csv('inspect.csv')
>>
>>
>> # ParquetWriter : 
>> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
>> # RecordBat
>> # 
>> http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>> crow = 0
>> with pq.ParquetWriter('inspect.parquet', reader.schema) as writer:
>> for b in reader:
>> print(b.num_rows,b.num_columns)
>> crow = crow + b.num_rows
>> print(crow)
>> writer.write_table(pa.Table.from_batches([b]))
>>
>> --
>> /Rubén


Re: [C++] - How to extract indices of nested MapArray

2021-03-03 Thread Wes McKinney
I think C++14 is fine for optional dependencies and shouldn't block
any development work right now. Note that we should be able to upgrade
to require a minimum of C++14 as soon as April or May of this year
since we will stop having to support one of the last gcc < 5
toolchains (for R 3.5 IIUC)

On Wed, Mar 3, 2021 at 5:41 PM Yeshwanth Sriram  wrote:
>
> Hi Micah,
>
> Thank you for the detailed response. Apologize for not responding earlier.
>
> a.) Looked at the latencies with and without filtering based on just foreach 
> and the latency is dominated by the parquet/write operation. So I’m going to 
> go with what I have which already provides substantial improvement for my use 
> case.
>
> b.) Would like to contribute for implement ANY over booleans in Arrow/compute 
> kernel. Waiting for permission to come through.
>
> I’m also interested in contributing to Azure/ADLS filesystem but the library 
> I was looking at is c++14 here https://github.com/Azure/azure-sdk-for-cpp . 
> Is c++14 no-go as a dependency in Arrow (even conditional ?)
>
> Thank you
> Yesh
>
> On Feb 28, 2021, at 2:09 PM, Micah Kornfield  wrote:
>
> Hi  Yeshwanth,
> I think you can do the first part of the filtering using the Equals kernel 
> and IsIn kernel on the child arrays of the Map.  I took a quick look but I 
> don't think that there is anything implemented that would allow you to map 
> the resulting bitmaps to the parent lists. It seems that we would want to add 
> an "Any" function for List that returns a Bool array if any of the 
> elements are true. There is already one for flat Boolean Arrays [1] but I 
> don't think that is useful here.
>
> So I think the logic that you would ultimately want in pseudo-code:
>
> children_bitmap = Equals(map.key, "some string") && IsIn(map.struct.id, 
> [[“aaa”, “bee”, “see”])
> list = MakeList(map.offsets, children_bitmap)
> final_selection = Any(list)
>
> Is the new Kernel something you would be interested in contributing?
>
> -Micah
>
> [1] https://github.com/apache/arrow/pull/8294
>
> On Sun, Feb 28, 2021 at 9:05 AM Yeshwanth Sriram  
> wrote:
>>
>> Using C++//Arrow to filter out large parquet files and I’m able to do this 
>> successfully. The current poc implementation is based on nested for/loops 
>> which I would like to avoid this and instead use built-in filter/take 
>> functions or some recommendations  to extract (take functions ?) arrays of 
>> indices or booleans to filter out rows.
>>
>> The input (data) array/column type is MapArray[key:String, 
>> value:StructArray[id:String, …]]
>>
>> The input filter is a {filter_key: “some string”, filter_ids: [“aaa”, “bee”, 
>> “see”, ..] }
>>   - Where filter_key, and filter_ids is to match contents of input MapArray
>>
>> The output I’m looking for is either array of booleans or indices of input 
>> array that match the input filer.
>>
>> Thank you
>
>


Re: Regarding Flight multiple endpoints

2021-03-02 Thread Wes McKinney
The answers to these questions depend greatly on the particular
implementation of Flight — they're left to the developer creating the
Flight implementation to decide what makes sense. Simple
implementations may just have one endpoint for a client to consume (a
point-to-point transfer) whereas others may involve multiple
endpoints.

On Mon, Mar 1, 2021 at 10:30 PM Priyanshu Mittal  wrote:
>
> Hi,
>
> Suppose we have multiple streams of data, then should we create multiple 
> flights for these streams or should we push all the streams in the same 
> flight?
> And also when should we use multiple endpoints like we should upload a single 
> stream to multiple endpoints or different endpoints are used for different 
> streams?
> I was trying to get what is the right way to use the arrow flight?
>
> Furthermore, I didn't find any practical implementation of it.
> So can anyone please provide the sample implementation or documentation for 
> this?
>
> Thanks,
> Priyanshu


Re: Python Plasma Store Best Practices

2021-03-02 Thread Wes McKinney
Also to be clear, if someone wants to maintain it, they are more than
welcome to do so.

On Tue, Mar 2, 2021 at 11:49 AM Sam Shleifer  wrote:

> Thanks, had no idea!
>
>
> On Tue, Mar 02, 2021 at 12:00 PM, Micah Kornfield 
> wrote:
>
>> Hi Sam,
>> I think the lack of responses might be because Plasma is not being
>> actively maintained.  The original authors have forked it into the Ray
>> project.
>>
>> I'm sorry I don't have the expertise to answer your questions.
>>
>> -Micah
>>
>> On Mon, Mar 1, 2021 at 6:48 PM Sam Shleifer  wrote:
>>
>>> Partial answers are super helpful!
>>> I'm happy to break this up if it's too much for 1 question @moderators
>>> Sam
>>>
>>>
>>>
>>> On Sat, Feb 27, 2021 at 1:27 PM, Sam Shleifer 
>>> wrote:
>>>
 Hi!
 I am trying to use plasma store to reduce the memory usage of a pytorch
 dataset/dataloader combination, and had 4  questions. I don’t think any of
 them require pytorch knowledge. If you prefer to comment inline there is a
 quip with identical content and prettier formatting here
 https://quip.com/3mwGAJ9KR2HT

 *1)* My script starts the plasma-store from python with 200 GB:

 nbytes = (1024 ** 3) * 200
 _server = subprocess.Popen(["plasma_store", "-m", str(nbytes), "-s",
 path])
 where nbytes is chosen arbitrarily. From my experiments it seems that
 one should start the store as large as possible within the limits of
 dev/shm . I wanted to verify whether this is actually the best practice (it
 would be hard for my app to know the storage needs up front) and also
 whether there is an automated way to figure out how much storage to
 allocate.

 *2)* Does plasma store support simultaneous reads? My code, which has
 multiple clients all asking for the 6 arrays from the plasma-store
 thousands of times, was segfaulting with different errors, e.g.
 Check failed: RemoveFromClientObjectIds(object_id, entry, client) == 1
 until I added a lock around my client.get

 if self.use_lock: # Fix segfault
 with FileLock("/tmp/plasma_lock"):
 ret = self.client.get(self.object_id)
 else:
 ret = self.client.get(self.object_id)

 which fixes.

 Here is a full traceback of the failure without the lock
 https://gist.github.com/sshleifer/75145ba828fcb4e998d5e34c46ce13fc
 Is this expected behavior?

 *3)* Is there a simple way to add many objects to the plasma store at
 once? Right now, we are considering changing,

 oid = client.put(array)
 to
 oids = [client.put(x) for x in array]

 so that we can fetch one entry at a time. but the writes are much
 slower.

 * 3a) Is there a lower level interface for bulk writes?
 * 3b) Or is it recommended to chunk the array and have different python
 processes write simultaneously to make this faster?

 *4)* Is there a way to save/load the contents of the plasma-store to
 disk without loading everything into memory and then saving it to some
 other format?

 Replication

 Setup instructions for fairseq+replicating the segfault:
 https://gist.github.com/sshleifer/bd6982b3f632f1d4bcefc9feceb30b1a
 My code is here: https://github.com/pytorch/fairseq/pull/3287

 Thanks!
 Sam

>>>
>


Re: pyarrow: write table where columns share the same dictionary

2021-02-25 Thread Wes McKinney
I'm not sure if it's possible at the moment, but it SHOULD be made
possible. See ARROW-5340

On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
 wrote:
>
> Hello,
>
> I have a pandas DataFrame with many string columns (>30,000), and they share 
> a low-cardinality set of values (e.g. size 100). I'd like to convert this to 
> an Arrow table of dictionary encoded columns (let's say int16 for the index 
> cols), but with just one shared dictionary of strings.
> This is to avoid ending up with >30,000 tiny dictionaries on the wire, which 
> doesn't even load in e.g. Java (due to a stackoverflow error).
>
> Despite my efforts, I haven't really been able to achieve this with the 
> public API's I could find. Does anyone have an idea? I'm using pyarrow 3.0.0.
>
> For a mickey mouse example, I'm looking at e.g.
>
> df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})
>
> and would like a Table with dictionary-encoded columns a and b, both 
> nullable, that both refer to the same dictionary with id=0 (or whatever id) 
> containing ['foo', 'bar', 'quux'].
>
> Thanks,
> -Joris.
>
>
>
>
>
>
>


Re: [c++] Help with serializing and IPC with dictionary arrays

2021-02-18 Thread Wes McKinney
I believe you have to extend the ipc::MessageReader interface, have you
looked at the details in

https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/client.cc#L425

? (there is analogous code handling the Put side in server.cc) The idea is
that you feed the stream of IPC messages and the dictionary
accounting/record batch reconstruction is handled internally.

On Thu, Feb 18, 2021 at 12:14 PM Dawson D'Almeida <
dawson.dalme...@snowflake.com> wrote:

> Hi Wes,
>
> We have our own implementation of something like Flight for flexibility of
> use.
>
> The main thing that I am trying to figure out is how to get the dictionary
> record batches properly deserialized on the server side. On the client
> side, I can deserialize them properly using the dictionarymemo directly
> from the record batch we create, but on the other side I do not have access
> to the same dictionarymemo. How is this passed in Flight? I have been
> trying to find this in the source code but haven't yet.
>
> Thanks,
> Dawson
>
> On Fri, Feb 12, 2021 at 3:34 PM Wes McKinney  wrote:
>
>> hi Dawson — you need to follow the IPC stream protocol, e.g. what
>> RecordBatchStreamWriter or RecordBatchStreamReader are doing
>> internally. Is there a reason you cannot use these interfaces
>> (particularly their internal bits, which are also used to implement
>> Flight where messages are split across different elements of a gRPC
>> stream)?
>>
>> I'm not sure that I would advise you to deal with dictionary
>> disassembly and reconstruction on your own unless it's your only
>> option. That said if you look in the unit test suite you should be
>> able to find examples of where DictionaryBatch IPC messages are
>> reconstructed manually, and then used to reconstitute a RecordBatch
>> IPC message using the arrow::ipc::ReadRecordBatch API. We can try to
>> help you look in the right place, let us know.
>>
>> Thanks,
>> Wes
>>
>> On Fri, Feb 12, 2021 at 2:58 PM Dawson D'Almeida
>>  wrote:
>> >
>> > I am trying to create a record batch containing any number of
>> dictionary and/or normal arrow arrays, serialize the record batch into
>> bytes (a normal std::string), and send it via grpc to another server
>> process. On that end we receive the arrow bytes and deserialize using the
>> bytes and the schema.
>> >
>> > Is there a standard way to serialize/deserialize these dictionary
>> arrays? It seems like all of the info is packaged correctly into the record
>> batch.
>> >
>> > I've looked through a lot of the c++ apache arrow source and test code
>> but I can't find how to approach our use case.
>> >
>> > The current failure is:
>> > Field with memory address 140283497044320 not found
>> > from the returns status from arrow::ipc::ReadRecordBatch
>> >
>> > Thanks,
>> > --
>> > Dawson d'Almeida
>> > Software Engineer
>> >
>> > MOBILE  +1 360 499 1852
>> > EMAIL  dawson.dalme...@snowflake.com
>> >
>> >
>> > Snowflake Inc.
>> > 227 Bellevue Way NE
>> > Bellevue, WA, 98004
>>
>
>
> --
> Dawson d'Almeida
> Software Engineer
>
> MOBILE  +1 360 499 1852
> EMAIL  dawson.dalme...@snowflake.com 
>
>
> Snowflake Inc.
> 227 Bellevue Way NE
> Bellevue, WA, 98004
>


Re: [Python] Saving ChunkedArray to disk and reading with flight

2021-02-18 Thread Wes McKinney
On the "This is slower and less memory efficient than `memmap` by about
15%." -- if you can show us more precisely what code you have written that
will help us advise you. In principle if you are using pyarrow.memory_map
the performance / memory use shouldn't be significantly different

On Wed, Feb 17, 2021 at 9:57 PM Micah Kornfield 
wrote:

> Hi Sam,
> Could you elaborate on what advantages you were hoping to benefit from
> Arrow?  It seems like the process you describe is probably close to optimal
> (I have limited knowledge of np.memmap). And there could be alternative
> suggestions based on the exact shape of your data and how you want to
> process it.  I added some more comments inline below.
>
> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>
> In Arrow it seems like the natural fit here is to use a ListArray wrapped
> around the numpy arrays. This would add back in the indices/offsets.
>
> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>
> How are you reading back the file?  Are you using MemoryMappedFile [1]?
>
> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>
> ChunkedArray's aren't a first class citizen in the Arrow File Format
> specification.  Working through tables that get converted to RecordBatches
> when saving is all that is supported.
>
>
> 2) Is it unreasonable to think this will use less memory than np.memmap?
>
> I'm not familiar with np.memmap, so I can't really say.
>
>
> [1] https://arrow.apache.org/docs/python/generated/pyarrow
>
>
>
> On Wed, Feb 17, 2021 at 7:11 PM Sam Shleifer  wrote:
>
>> *My goal*
>> I have a list of numpy arrays of uneven length. From the docs, I guess
>> the right format for this is ChunkedArray
>> I want to save my list to disk in one process, and then start many new
>> processes (a pytorch dataloader) that are able to read chunks from the file
>> with low memory overhead.
>> The current solution is to flatten the array, keep a list of the
>> lengths/offsets, store the flattened array in  `np.memmap`, then have each
>> process slice into the memmap at the right index.
>> It seems that with arrow, we can at least delete the list of
>> lengths/offsets.
>>
>> *What I have tried:*
>> padding each entry in the list to a fixed length, and saving pa.Table to
>> pa.NativeFile. Each process reads it's own pa.Table. This is slower and
>> less memory efficient than `memmap` by about 15%.
>>
>> *My questions:*
>> 1) Are there any examples online that do this sort of operation? I can't
>> find how to save chunked array to disk, or a python Flight example after a
>> few googles.
>> 2) Is it unreasonable to think this will use less memory than np.memmap?
>>
>> Thanks in advance!
>> Sam
>>
>>


Re: [c++] Help with serializing and IPC with dictionary arrays

2021-02-12 Thread Wes McKinney
hi Dawson — you need to follow the IPC stream protocol, e.g. what
RecordBatchStreamWriter or RecordBatchStreamReader are doing
internally. Is there a reason you cannot use these interfaces
(particularly their internal bits, which are also used to implement
Flight where messages are split across different elements of a gRPC
stream)?

I'm not sure that I would advise you to deal with dictionary
disassembly and reconstruction on your own unless it's your only
option. That said if you look in the unit test suite you should be
able to find examples of where DictionaryBatch IPC messages are
reconstructed manually, and then used to reconstitute a RecordBatch
IPC message using the arrow::ipc::ReadRecordBatch API. We can try to
help you look in the right place, let us know.

Thanks,
Wes

On Fri, Feb 12, 2021 at 2:58 PM Dawson D'Almeida
 wrote:
>
> I am trying to create a record batch containing any number of dictionary 
> and/or normal arrow arrays, serialize the record batch into bytes (a normal 
> std::string), and send it via grpc to another server process. On that end we 
> receive the arrow bytes and deserialize using the bytes and the schema.
>
> Is there a standard way to serialize/deserialize these dictionary arrays? It 
> seems like all of the info is packaged correctly into the record batch.
>
> I've looked through a lot of the c++ apache arrow source and test code but I 
> can't find how to approach our use case.
>
> The current failure is:
> Field with memory address 140283497044320 not found
> from the returns status from arrow::ipc::ReadRecordBatch
>
> Thanks,
> --
> Dawson d'Almeida
> Software Engineer
>
> MOBILE  +1 360 499 1852
> EMAIL  dawson.dalme...@snowflake.com
>
>
> Snowflake Inc.
> 227 Bellevue Way NE
> Bellevue, WA, 98004


Re: pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

2021-02-11 Thread Wes McKinney
We simply don't have conversions nor type inference implemented for
inner elements with dimension greater than 1. You're welcome to
propose this as a new feature / enhancement by opening a Jira issue.

On Thu, Feb 11, 2021 at 10:53 AM Bhavitvya Malik
 wrote:
>
> Sure, here it is:
>
> data = np.zeros((10,8), dtype=np.uint8)
> out = pa.array(list(data))
> out.type  # ListType(list)
>
> data = np.zeros((3,4,6), dtype=np.uint8)
> out = pa.array(list(data))  # Throws error ArrowInvalid: Can only convert 
> 1-dimensional array values
>
> Even though it's working on 2D numpy arrays perfectly, it doesn't work on 
> N-Dimensional numpy arrays (where N > 2). Why is it so?
>
>
> On Thu, 11 Feb 2021 at 21:18, Wes McKinney  wrote:
>>
>> Can you provide more detail about what you are trying? You've showed
>> some exception here but haven't showed the exact code that results in
>> those exceptions
>>
>> On Thu, Feb 11, 2021 at 4:34 AM Bhavitvya Malik
>>  wrote:
>> >
>> > Hi,
>> > It's a follow up question for #9462. Rewriting the issue here:
>> >
>> >> I came to know that pyarrow has this limitation of not storing 
>> >> N-dimensional array. After looking into this issue, I decided to 
>> >> represent a N-dimensional array as a list of arrays i.e.
>> >> data = np.zeros((5, 3), dtype=np.uint8)
>> >> data = list(data)
>> >> inorder to preserve the dtype but when it comes to typecasting and 
>> >> writing it into array (from list) pyarrow.array(data, type=type) it gives 
>> >> the following error:
>> >> pyarrow.lib.ArrowInvalid: Could not convert [0 0 0] with type 
>> >> numpy.ndarray: tried to convert to int
>> >> Is there any way to avoid this issue? I just want to preserve the dtype 
>> >> from numpy array before converting it to list so that while writing it to 
>> >> pyarrow array format I can recognise its dtype and subsequently write it 
>> >> in that numpy dtype format.
>> >
>> >
>> > I tried it with a 3D numpy array and it gave me this error even though 
>> > it's working fine with 2D numpy arrays. Can you please look into this?
>> >
>> > pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values
>> >
>> >
>> > My current pyarrow version is 2.0.0 and i tried it with pyarrow==3.0.0 too
>> >
>> >
>> >
>> > Thanks,
>> > Bhavitvya


Re: pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values

2021-02-11 Thread Wes McKinney
Can you provide more detail about what you are trying? You've showed
some exception here but haven't showed the exact code that results in
those exceptions

On Thu, Feb 11, 2021 at 4:34 AM Bhavitvya Malik
 wrote:
>
> Hi,
> It's a follow up question for #9462. Rewriting the issue here:
>
>> I came to know that pyarrow has this limitation of not storing N-dimensional 
>> array. After looking into this issue, I decided to represent a N-dimensional 
>> array as a list of arrays i.e.
>> data = np.zeros((5, 3), dtype=np.uint8)
>> data = list(data)
>> inorder to preserve the dtype but when it comes to typecasting and writing 
>> it into array (from list) pyarrow.array(data, type=type) it gives the 
>> following error:
>> pyarrow.lib.ArrowInvalid: Could not convert [0 0 0] with type numpy.ndarray: 
>> tried to convert to int
>> Is there any way to avoid this issue? I just want to preserve the dtype from 
>> numpy array before converting it to list so that while writing it to pyarrow 
>> array format I can recognise its dtype and subsequently write it in that 
>> numpy dtype format.
>
>
> I tried it with a 3D numpy array and it gave me this error even though it's 
> working fine with 2D numpy arrays. Can you please look into this?
>
> pyarrow.lib.ArrowInvalid: Can only convert 1-dimensional array values
>
>
> My current pyarrow version is 2.0.0 and i tried it with pyarrow==3.0.0 too
>
>
>
> Thanks,
> Bhavitvya


Re: pyarrow==3.0.0 installation issue

2021-02-06 Thread Wes McKinney
I think you need to upgrade your setuptools so that you can use
manylinux2010/2014 packages.

On Sat, Feb 6, 2021 at 9:23 AM Tanveer Ahmad - EWI  wrote:
>
> Hi,
>
>
> I am able to install pyarrow 1.0.0 and 2.0.0 through pip command but same 
> command fails for 3.0.0.
>
> I have reinstalled re packages but still this issue exist.
>
>
> tahmad@TUD256255:~$ pip3 install pyarrow==1.0.0
> Collecting pyarrow==1.0.0
>   Downloading 
> https://files.pythonhosted.org/packages/a2/7b/66b099f91911bf3660f70ab353f7592dd59e8d6c58c7f0582b6d8884464a/pyarrow-1.0.0-cp36-cp36m-manylinux1_x86_64.whl
>  (16.6MB)
> 100% || 16.6MB 93kB/s
> Collecting numpy>=1.14 (from pyarrow==1.0.0)
>   Using cached 
> https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl
> Installing collected packages: numpy, pyarrow
> Successfully installed numpy-1.19.5 pyarrow-1.0.0
> tahmad@TUD256255:~$ pip3 install pyarrow==2.0.0
> Collecting pyarrow==2.0.0
>   Downloading 
> https://files.pythonhosted.org/packages/f9/a0/f2941d8274435f403698aee63da0d171552a9acb348d37c7e7ff25f1ae1f/pyarrow-2.0.0-cp36-cp36m-manylinux1_x86_64.whl
>  (16.9MB)
> 100% || 16.9MB 88kB/s
> Collecting numpy>=1.14 (from pyarrow==2.0.0)
>   Using cached 
> https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl
> Installing collected packages: numpy, pyarrow
> Successfully installed numpy-1.19.5 pyarrow-2.0.0
> tahmad@TUD256255:~$ pip3 install pyarrow==3.0.0
> Collecting pyarrow==3.0.0
>   Using cached 
> https://files.pythonhosted.org/packages/62/d3/a482d8a4039bf931ed6388308f0cc0541d0cab46f0bbff7c897a74f1c576/pyarrow-3.0.0.tar.gz
> Collecting numpy>=1.16.6 (from pyarrow==3.0.0)
>   Using cached 
> https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl
> Building wheels for collected packages: pyarrow
>   Running setup.py bdist_wheel for pyarrow ... error
>   Complete output from command /usr/bin/python3 -u -c "import setuptools, 
> tokenize;__file__='/tmp/pip-build-nxybi474/pyarrow/setup.py';f=getattr(tokenize,
>  'open', open)(__file__);code=f.read().replace('\r\n', 
> '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d 
> /tmp/tmp_uoobnqdpip-wheel- --python-tag cp36:
>   running bdist_wheel
>   running build
>   running build_py
>   creating build
>   creating build/lib.linux-x86_64-3.6
>   creating build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/serialization.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/types.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/__init__.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/plasma.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/json.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/csv.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/benchmark.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/parquet.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/jvm.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/orc.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/util.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/compat.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/ipc.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/hdfs.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/filesystem.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/cuda.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/fs.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/cffi.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/_generated_version.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/flight.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/dataset.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/compute.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/pandas_compat.py -> build/lib.linux-x86_64-3.6/pyarrow
>   copying pyarrow/feather.py -> build/lib.linux-x86_64-3.6/pyarrow
>   creating build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/test_scalars.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/test_table.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/test_tensor.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/conftest.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/test_gandiva.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/__init__.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
>   copying pyarrow/tests/test_filesystem.py -> 
> build/lib.linux-x86_64-3.6/pyarrow/tests
> 

Re: [Python] Trying to store nested dict in Table gives unexpected behavior

2021-01-31 Thread Wes McKinney
hi Partha,

I believe you have mixed up struct and map types. When you pass a
list-of-pydicts to Arrow, it infers a struct type for the dicts by
default, which means that all of the observed keys will be represented
in every entry (with null values if they are not present), so here
it's something like list>>.

If you want a map type (where each dict has different entries), you
have to write down the map type you want explicitly and pass that when
constructing the Arrow array object. What you want is
list>> (I think)

- Wes


On Fri, Jan 29, 2021 at 9:23 AM PARTHA DUTTA  wrote:
>
> I may be doing something wrong here, so any help would be greatly 
> appreciated. I am trying to store a nested python dict into an Arrow table, 
> and I am getting some unexpected results. This is sample code:
>
> import copy
> import pyarrow as pa
> import random
>
> def test_it():
> arr = []
> for f in range(5):
> num_maps = random.randrange(4) + 1
> print("Number of maps = {}".format(num_maps))
> mdict = {}
> mdict["CORE"] = {}
> for r in range(num_maps):
> mdict["CORE"][str(r)] = {"status": "realized"}
> arr.append(copy.deepcopy(mdict))
> tbl = pa.Table.from_pydict({"_map": arr})
> print(tbl.to_pydict())
>
> test_it()
>
>
> This is the output of the code:
>
> Number of maps = 1
> Number of maps = 1
> Number of maps = 2
> Number of maps = 3
> Number of maps = 2
> {'_map': [{'CORE': {'0': {'status': 'realized'}, '1': None, '2': None}}, 
> {'CORE': {'0': {'status': 'realized'}, '1': None, '2': None}}, {'CORE': {'0': 
> {'status': 'realized'}, '1': {'status': 'realized'}, '2': None}}, {'CORE': 
> {'0': {'status': 'realized'}, '1': {'status': 'realized'}, '2': {'status': 
> 'realized'}}}, {'CORE': {'0': {'status': 'realized'}, '1': {'status': 
> 'realized'}, '2': None}}]}
>
> It seems that when the table is created, it is filling in empty dict values 
> such that the number of elements is completely equal. This is not what I 
> wanted, and I am wondering if this is a feature, or am I missing something 
> such that my intended output would not contain "null" vales.
>
> Thanks,
> Partha
> --
> Partha Dutta
> partha.du...@gmail.com


Re: [Python] HDFS write fails when size of file is higher than 6gb

2021-01-26 Thread Wes McKinney
It appears that writes over 2GB are implemented incorrectly.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277

the tSize type in libhdfs is an int32_t. So that static cast is truncating data

https://issues.apache.org/jira/browse/ARROW-11391

I would recommend breaking the work into smaller pieces as a workaround

On Tue, Jan 26, 2021 at 1:45 AM Сергей Красовский  wrote:
>
> Hello Arrow team,
>
> I have an issue with writing files with size > 6143mb to HDFS. Exception is:
>
>> Traceback (most recent call last):
>>   File "exp.py", line 22, in 
>> output_stream.write(open(source, "rb").read())
>>   File "pyarrow/io.pxi", line 283, in pyarrow.lib.NativeFile.write
>>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
>> OSError: HDFS Write failed, errno: 22 (Invalid argument)
>
>
> The code below works for files with size <= 6143mb.
>
> Hadoop version: 3.1.1.3.1.4.0-315
> Python version: 3.6.10
> Pyarrow version: 2.0.0
> System: Ubuntu 16.04.7 LTS
>
> I try to understand what happens under the hood of 
> pyarrow.lib.NativeFile.write. Is there any limitation from pyarrow side, 
> incompatibility with hadoop version or some settings issue on my side.
>
> If you have any input I would highly appreciate it.
>
> The python script to upload a file:
>
>> import os
>> import pyarrow as pa
>>
>> os.environ["JAVA_HOME"]=""
>> os.environ['ARROW_LIBHDFS_DIR'] = "/libhdfs.so"
>>
>> connected = pa.hdfs.connect(host="",port=8020)
>>
>> destination = "hdfs://:8020/user/tmp/6144m.txt"
>> source = "/tmp/6144m.txt"
>>
>> with connected.open(destination, "wb") as output_stream:
>> output_stream.write(open(source, "rb").read())
>>
>> connected.close()
>
>
> How to create a 6gb file:
>
>> truncate -s 6144M 6144m.txt
>
>
> Thanks a lot,
> Sergey


Re: [python] not an arrow file

2021-01-25 Thread Wes McKinney
hi Ken -- it looks like you aren't calling "writer->Close()" after
writing the last record batch. I think that will fix the issue

On Mon, Jan 25, 2021 at 12:48 PM Teh, Kenneth M.  wrote:
>
> Hi Wes,
>
> My C++ code is attached. I tried to also read it from C++ by opening the disk 
> file as a MemoryMappedFile and get the same error when I make a 
> RecordBatchReader on the mmap'ed file, ie, "not an Arrow file".
>
> There must be some magical sequence of writes needed to make the file kosher.
>
> Thanks for helping.
>
> Ken
>
> p.s. I read your blog about relocating to Nashville.  Was my stomping grounds 
> back in the 80s. Memories.
>
> 
> From: Wes McKinney 
> Sent: Sunday, January 24, 2021 11:41 AM
> To: user@arrow.apache.org 
> Subject: Re: [python] not an arrow file
>
> Can you show your C++ code?
>
> On Sun, Jan 24, 2021 at 8:10 AM Teh, Kenneth M.  wrote:
>
> Just started with arrow...
>
> I wrote a record batch to a file using ipc::MakeFileWriter to create a writer 
> and writer->WriteRecordBatch in a C++ program and tried to read it in python 
> with:
>
> [] import pyarrow as pa
> [] reader = pa.ipc.open_file("myfile")
>
>
> It raises the ArrowInvalid with the message "not an arrow file".
>
> If I write it out as a Table in feather format, I can read it in python. But 
> I want to write large files on the order of 100GB or more and then read them 
> back into python as pandas dataframes or something similar.
>
> So, I switched to using an ipc writer.
>
> Can something point me in the right direction?  Thanks.
>
> Ken


Re: [C++]How to use BufferedInputStream in parquet::arrow::OpenFile()?

2021-01-24 Thread Wes McKinney
Note that the buffered stream option in parquet::ReaderProperties is
what you want here

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L74

On Fri, Jan 22, 2021 at 10:34 PM Micah Kornfield  wrote:
>
> InputStream and RandomAccessFile are inherently different types that support 
> different operations.  RandomAccessFile can emulate and InputStream but not 
> vice-versa.  Parquet files have a footer that requires reading first, so an 
> InputStream cannot be used.
>
> On Wed, Dec 16, 2020 at 6:30 PM annsshadow  wrote:
>>
>> Hi~all
>>
>> I try to use BufferedInputStream to reduce the overhead of some small read 
>> from the network.
>>
>> The pseudo codes are below:
>>
>>
>> ```
>>
>> //get the buffered input stream
>>
>> auto buffered_result = arrow::io::BufferedInputStream::Create()
>>
>> _buffered_infile = buffered_result.ValueOrDie();
>>
>>
>> // follow the example codes, I want to open a parquet file like that
>>
>> // but it meets compiler error: could not convert from 
>> 'std::shared_ptr' to 
>> 'std::shared_ptr'
>>
>> PARQUET_THROW_NOT_OK(parquet::arrow::OpenFile(_buffered_infile, 
>> arrow::default_memory_pool(), &_reader_parquet));
>>
>>
>> //the declaration of OpenFile
>>
>> //Status OpenFile(std::shared_ptr<::arrow::io::RandomAccessFile> file, 
>> MemoryPool* pool,
>>
>> std::unique_ptr* reader)
>>
>> ```
>>
>>
>> How can I use it correctly?
>>
>> Thanks all~
>>
>>


Re: [python] not an arrow file

2021-01-24 Thread Wes McKinney
Can you show your C++ code?

On Sun, Jan 24, 2021 at 8:10 AM Teh, Kenneth M.  wrote:

> Just started with arrow...
>
> I wrote a record batch to a file using ipc::MakeFileWriter to create a
> writer and writer->WriteRecordBatch in a C++ program and tried to read it
> in python with:
>
> [] import pyarrow as pa
> [] reader = pa.ipc.open_file("myfile")
>
>
> It raises the ArrowInvalid with the message "not an arrow file".
>
> If I write it out as a Table in feather format, I can read it in python.
> But I want to write large files on the order of 100GB or more and then read
> them back into python as pandas dataframes or something similar.
>
> So, I switched to using an ipc writer.
>
> Can something point me in the right direction?  Thanks.
>
> Ken
>


Re: How to make a parquet dataset from an input file through Random access

2021-01-22 Thread Wes McKinney
If you have used *StreamWriter to create a buffer that you're sending
to another process, if you want to get it back into a record batch or
table, you need to read it with
pyarrow.ipc.open_stream(...).read_all() and then you can concatenate
the resulting tables with pyarrow.concat_tables (or use
Table.from_batches if you have a sequence of record batches)

Hope this helps

On Thu, Jan 21, 2021 at 6:19 AM Jonathan MERCIER
 wrote:
>
> Same question but more simple to understand.
>
> Using pyarrow and working with pieces of data by process (multi-process
> as workaround GIL limitation). What is the correct way to handle this task ?
>
> 1. each // process have to create create a list of records store them
> into a record batch and return this batch
>
> 2. each // process have to create an output and writer buffer , create a
> list of records store them into a record batch and write this record
> batch into the stream writer. The process return the corresponding buffer ?
>
> with the answer (1) I see how to merge all of those batch but with
> solution (2) how to merge all buffer to one once each process has
> returned their buffer ?
>
>
>
> Thanks
>
>
> --
> Jonathan MERCIER
>
> Researcher computational biology
>
> PhD, Jonathan MERCIER
>
> Centre National de Recherche en Génomique Humaine (CNRGH)
>
> Bioinformatics (LBI)
>
> 2, rue Gaston Crémieux
>
> 91057 Evry Cedex
>
> Tel :(33) 1 60 87 34 88
>
> Email :jonathan.merc...@cnrgh.fr 
>


Re: Building PyArrow wheel for AWS Graviton2 (ARMv8) - Gist

2021-01-21 Thread Wes McKinney
hi Elad -- I think we'd be interested in having portable aarch64
wheels for Linux, see
https://issues.apache.org/jira/browse/ARROW-10349. Would you be able
to post this information on there to help with the process?

Thanks
Wes

On Thu, Jan 21, 2021 at 3:21 AM Elad Rosenheim  wrote:
>
> Hi,
>
> AWS are touting ~20% lower cost and (potentially) increased performance for 
> their own ARM-based line (cg6 instance family) compared to current-gen 
> instances (c5 family).
>
> Turns out that compiling a Python wheel for PyArrow on Amazon Linux 2 + 
> Graviton is pretty straightforward, here's a gist I've created: 
> https://gist.github.com/eladroz/b9437249a76de2b394d54e646d53ec5e
>
> My own (very non-comprehensive) test showed essentially similar performance 
> to current-gen Intel, but the price difference alone is interesting and in 
> big data workloads can be significant.
>
> Hope that helps anyone down the line; I'll probably try to run more 
> benchmarks.
>
> Elad Rosenheim


Re: compute::Take & ChunkedArrays

2021-01-17 Thread Wes McKinney
On Sun, Jan 17, 2021 at 8:59 AM Niranda Perera  wrote:
>
> Hi Wes,
>
> Thanks. On the top of my head, that was a similar algorithm I had in mind as 
> well.
> Is this the JIRA you were referring to? [1]
> I see that there are some improvements that have been done here [2].
>
> I guess bug reports like this [3] are also related to the same scenario.
>
> Is there anyone working on this?

If open Jira issues are not assigned to anyone you can assume that no
one is working on them.

>
> Best
>
> [1] https://issues.apache.org/jira/browse/ARROW-5454
> [2] https://github.com/apache/arrow/pull/8823
> [3] https://issues.apache.org/jira/browse/ARROW-10799
>
> On Fri, Jan 15, 2021 at 10:38 AM Wes McKinney  wrote:
>>
>> You can do that, but note that the implementation is currently not
>> efficient, see
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909
>>
>> Rather than pre-concatenating the chunks (which can easily fail) and
>> then invoking Take on the resulting concatenated Array, it would be
>> better to do a O(N log K) take on the chunks directly, where N is the
>> number of take indices and K is the number of chunks.
>>
>> For example, if you have chunks of size
>>
>> 10
>> 50
>> 100
>> 20
>>
>> then the algorithm computes the following offset table:
>>
>> 0
>> 10
>> 60
>> 160
>> 180
>>
>> Indices relative to the whole ChunkedArray are translated to (chunk
>> number, intrachunk index), for example:
>>
>> take with [5, 40, 100, 170] is translated by doing binary searches in
>> the offset table to:
>>
>> (chunk=0, relative_index=5)
>> (1, 30)
>> (2, 40)
>> (3, 10)
>>
>> Consecutive indices from the same chunk are batched together and then
>> Take is invoked on the respective chunk (with boundschecking disabled)
>> to select a chunk for the resulting output ChunkedArray.
>>
>> Might be helpful to copy this to the appropriate Jira (I'm sure there
>> is one already) to assist the person who implements this.
>>
>> Thanks,
>> Wes
>>
>> On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera
>>  wrote:
>> >
>> > Hi all,
>> >
>> > I was wondering how the Take API works with ChunkedArrays?
>> > ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50]
>> > so, if I want an element from each array, can I pass something like [10, 
>> > 60] as the indices?
>> >
>> > --
>> > Niranda Perera
>> > @n1r44
>> > +1 812 558 8884 / +94 71 554 8430
>> > https://www.linkedin.com/in/niranda
>
>
>
> --
> Niranda Perera
> @n1r44
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda


Re: unsubscribe

2021-01-15 Thread Wes McKinney
You have to e-mail user-unsubscr...@arrow.apache.org

On Fri, Jan 15, 2021 at 11:09 AM Sisneros, Dominic E (FAA)
 wrote:
>
>
>
> Dominic Sisneros
> FAA, WSA Engineering Services, AJW-2W13B
> Office: 801-320-2377
> Cell: 801-558-1966
>
> -Original Message-
> From: Wes McKinney 
> Sent: Friday, January 15, 2021 8:38 AM
> To: user@arrow.apache.org
> Subject: Re: compute::Take & ChunkedArrays
>
> You can do that, but note that the implementation is currently not efficient, 
> see
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909
>
> Rather than pre-concatenating the chunks (which can easily fail) and then 
> invoking Take on the resulting concatenated Array, it would be better to do a 
> O(N log K) take on the chunks directly, where N is the number of take indices 
> and K is the number of chunks.
>
> For example, if you have chunks of size
>
> 10
> 50
> 100
> 20
>
> then the algorithm computes the following offset table:
>
> 0
> 10
> 60
> 160
> 180
>
> Indices relative to the whole ChunkedArray are translated to (chunk number, 
> intrachunk index), for example:
>
> take with [5, 40, 100, 170] is translated by doing binary searches in the 
> offset table to:
>
> (chunk=0, relative_index=5)
> (1, 30)
> (2, 40)
> (3, 10)
>
> Consecutive indices from the same chunk are batched together and then Take is 
> invoked on the respective chunk (with boundschecking disabled) to select a 
> chunk for the resulting output ChunkedArray.
>
> Might be helpful to copy this to the appropriate Jira (I'm sure there is one 
> already) to assist the person who implements this.
>
> Thanks,
> Wes
>
> On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera  
> wrote:
> >
> > Hi all,
> >
> > I was wondering how the Take API works with ChunkedArrays?
> > ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50] so,
> > if I want an element from each array, can I pass something like [10, 60] as 
> > the indices?
> >
> > --
> > Niranda Perera
> > @n1r44
> > +1 812 558 8884 / +94 71 554 8430
> > https://www.linkedin.com/in/niranda


Re: compute::Take & ChunkedArrays

2021-01-15 Thread Wes McKinney
You can do that, but note that the implementation is currently not
efficient, see

https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_selection.cc#L1909

Rather than pre-concatenating the chunks (which can easily fail) and
then invoking Take on the resulting concatenated Array, it would be
better to do a O(N log K) take on the chunks directly, where N is the
number of take indices and K is the number of chunks.

For example, if you have chunks of size

10
50
100
20

then the algorithm computes the following offset table:

0
10
60
160
180

Indices relative to the whole ChunkedArray are translated to (chunk
number, intrachunk index), for example:

take with [5, 40, 100, 170] is translated by doing binary searches in
the offset table to:

(chunk=0, relative_index=5)
(1, 30)
(2, 40)
(3, 10)

Consecutive indices from the same chunk are batched together and then
Take is invoked on the respective chunk (with boundschecking disabled)
to select a chunk for the resulting output ChunkedArray.

Might be helpful to copy this to the appropriate Jira (I'm sure there
is one already) to assist the person who implements this.

Thanks,
Wes

On Mon, Jan 11, 2021 at 10:01 AM Niranda Perera
 wrote:
>
> Hi all,
>
> I was wondering how the Take API works with ChunkedArrays?
> ex: If we have a ChunkedArray[100] with Array1[50] and Array2[50]
> so, if I want an element from each array, can I pass something like [10, 60] 
> as the indices?
>
> --
> Niranda Perera
> @n1r44
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda


Re: unsubscribe

2021-01-14 Thread Wes McKinney
hi Sam -- you have to e-mail user-unsubscr...@arrow.apache.org

On Thu, Jan 14, 2021 at 1:35 AM Sam Wright  wrote:
>
>


Re: Plasma store implementation status across client libraries

2021-01-04 Thread Wes McKinney
hi Burke -- you have to e-mail user-unsubscr...@arrow.apache.org

On Mon, Jan 4, 2021 at 1:28 PM Burke Kaltenberger
 wrote:
>
> please remove me from your email list. Thank you
>
> On Mon, Jan 4, 2021 at 10:15 AM Neal Richardson  
> wrote:
>>
>> I believe Plasma only has Python bindings. FWIW it has not seen active 
>> development in quite a while.
>>
>> Neal
>>
>> On Mon, Jan 4, 2021 at 8:58 AM Chris Nuernberger  
>> wrote:
>>>
>>> Yes that makes sense.  I guess you also need something to broker shared 
>>> memory filenames/ids.  The database isn't in-memory, however, although I 
>>> know what you mean.  One huge advantage of mmap is you can have much larger 
>>> than memory storage act like in-memory storage; so the plasma store can be 
>>> roughly the size of your disk and larger your ram but your program, unless 
>>> it attempts to verbatim copy a column wouldn't know any better.
>>>
>>> Numerical larger-than-memory-but-in-memory redis indeed; that is an 
>>> interesting way to think of it.
>>>
>>> On Mon, Jan 4, 2021 at 9:45 AM Thomas Browne  wrote:

 Interesting and agreed. I guess this a big advantage of the "on the wire" 
 unserialised format - just read it in and it's already native. I'll go 
 this way possibly.

 However I also note the beginnings of more advanced functionality in the 
 Plasma store, for example, notification API on buffer seal (ie when 
 something changes, all clients can be notified).

 https://arrow.apache.org/docs/python/generated/pyarrow.plasma.PlasmaClient.html#pyarrow.plasma.PlasmaClient.subscribe

 I'm assuming the plasma store will add functionality over time, and if 
 this is the case, having all client libraries implement it means I can 
 almost have a redis-like column-store specialising in numerical 
 computation (which would be awesome), and for which i don't need to write 
 my own functionality for each client library.

 A numerical in-memory database, if you will.

 On 04/01/2021 15:55, Chris Nuernberger wrote:

 Julia, Python, and R all have some support for mmap operations.

 On Mon, Jan 4, 2021 at 8:55 AM Chris Nuernberger  
 wrote:
>
> Could simply saving the arrow file in streaming mode to shared memory and 
> then mmap-ing the result in each language solve your problem ?  Plasma 
> seems to me to be a layer on top of basic mmap operations; as long as you 
> have shared memory and mmap then you can have multiple processes talking 
> to the same logical block of memory.
>
> On Mon, Jan 4, 2021 at 8:27 AM Thomas Browne  wrote:
>>
>> I am hoping to use the Apache Arrow project for cross-language numerical
>> computation, and for that the shared-memory idea is very powerful. Am I
>> correct that the Plasma Store is the enabling technology for this,
>> especially for soft real-time computation (ie not moving to parquet or
>> any file-based sharing system)?
>>
>> Is that the case? And if so, then I'm wondering which client libraries,
>> other than Python (and I assume C[++]), implement the Plasma Store. This
>> table doesn't feature a row for Plasma:
>>
>> https://arrow.apache.org/docs/status.html
>>
>> and I can't seem to find any reference to the Plasma store in the Julia,
>> R, or Javascript libraries.
>>
>> https://arrow.apache.org/docs/r/
>>
>> https://arrow.apache.org/docs/js/
>>
>> https://arrow.juliadata.org/stable/
>>
>>
>> Thank you,
>>
>> Thomas
>>
>>
>
>
> --
> First Talent Search & Placement
> Burke Kaltenberger | Founder
> 408.458.0071


Re: Sharing a C++-level pointer between Python and R

2021-01-03 Thread Wes McKinney
I didn't see any Jira issues about this so I opened
https://issues.apache.org/jira/browse/ARROW-11120

On Sat, Jan 2, 2021 at 8:36 PM Wes McKinney  wrote:
>
> Hi Thomas — no worries. Plasma is for brokering shared memory between 
> processes. Here we are talking about passing data structures between the 
> Python and R interpreter running in the same process.
>
> On Sat, Jan 2, 2021 at 6:46 PM Thomas Browne  wrote:
>>
>> Apols new user here - I thought this kind of use case is what the plasma
>> store was for?
>>
>> On 03/01/2021 00:20, Wes McKinney wrote:
>> > We can go R to/from Python from an R perspective with reticulate
>> >
>> > https://arrow.apache.org/docs/r/articles/python.html
>> >
>> > There are unit tests attesting to this working:
>> >
>> > https://github.com/apache/arrow/blob/master/r/tests/testthat/test-python.R
>> >
>> > However, I don't know of anyone trying to get interop from a Python
>> > perspective with rpy2; there may be a small amount of plumbing needed
>> > to get it working, others may know more.
>> >
>> > On Sat, Jan 2, 2021 at 5:20 PM Laurent Gautier  wrote:
>> >> Hi,
>> >>
>> >> I am looking at sharing a pointer between Python and R. For example 
>> >> create an Arrow object with Python, perform initial filtering, and then 
>> >> pass a shared pointer to R through rpy2 (meaning that an R6 object is 
>> >> created from this pointer and R package arrow).
>> >>
>> >> I found in the source a file that suggest thats Python-to-R is either 
>> >> planned or may be even already functional: 
>> >> https://github.com/apache/arrow/blob/master/r/src/py-to-r.cpp
>> >> However, I did not find documentation about it.
>> >>
>> >> Would anyone here know more about this?
>> >>
>> >> Best,
>> >>
>> >> Laurent
>> >>
>> >>


Re: Sharing a C++-level pointer between Python and R

2021-01-02 Thread Wes McKinney
Hi Thomas — no worries. Plasma is for brokering shared memory between
processes. Here we are talking about passing data structures between the
Python and R interpreter running in the same process.

On Sat, Jan 2, 2021 at 6:46 PM Thomas Browne  wrote:

> Apols new user here - I thought this kind of use case is what the plasma
> store was for?
>
> On 03/01/2021 00:20, Wes McKinney wrote:
> > We can go R to/from Python from an R perspective with reticulate
> >
> > https://arrow.apache.org/docs/r/articles/python.html
> >
> > There are unit tests attesting to this working:
> >
> >
> https://github.com/apache/arrow/blob/master/r/tests/testthat/test-python.R
> >
> > However, I don't know of anyone trying to get interop from a Python
> > perspective with rpy2; there may be a small amount of plumbing needed
> > to get it working, others may know more.
> >
> > On Sat, Jan 2, 2021 at 5:20 PM Laurent Gautier 
> wrote:
> >> Hi,
> >>
> >> I am looking at sharing a pointer between Python and R. For example
> create an Arrow object with Python, perform initial filtering, and then
> pass a shared pointer to R through rpy2 (meaning that an R6 object is
> created from this pointer and R package arrow).
> >>
> >> I found in the source a file that suggest thats Python-to-R is either
> planned or may be even already functional:
> https://github.com/apache/arrow/blob/master/r/src/py-to-r.cpp
> >> However, I did not find documentation about it.
> >>
> >> Would anyone here know more about this?
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
>


Re: Sharing a C++-level pointer between Python and R

2021-01-02 Thread Wes McKinney
We can go R to/from Python from an R perspective with reticulate

https://arrow.apache.org/docs/r/articles/python.html

There are unit tests attesting to this working:

https://github.com/apache/arrow/blob/master/r/tests/testthat/test-python.R

However, I don't know of anyone trying to get interop from a Python
perspective with rpy2; there may be a small amount of plumbing needed
to get it working, others may know more.

On Sat, Jan 2, 2021 at 5:20 PM Laurent Gautier  wrote:
>
> Hi,
>
> I am looking at sharing a pointer between Python and R. For example create an 
> Arrow object with Python, perform initial filtering, and then pass a shared 
> pointer to R through rpy2 (meaning that an R6 object is created from this 
> pointer and R package arrow).
>
> I found in the source a file that suggest thats Python-to-R is either planned 
> or may be even already functional: 
> https://github.com/apache/arrow/blob/master/r/src/py-to-r.cpp
> However, I did not find documentation about it.
>
> Would anyone here know more about this?
>
> Best,
>
> Laurent
>
>


Re: Optimising pandas relational ops with pyarrow

2021-01-01 Thread Wes McKinney
Note that many of us think it's important to have canonical
implementations of important algorithms (aggregate / hash aggregate,
joins, sorts, etc.) in the Apache project and available to e.g.
pyarrow users, as opposed to having to direct them to a third party
project. I've been unable to do this work myself given my other
responsibilities, but I will be continuing to direct funding /
engineering time from my organization toward these goals. I hope that
others from the community can join in to help out to make the work go
faster.

On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov  wrote:
>
> Hi, thanks for the pointers. We tried cylondata already. We find it hard to 
> build, some lack of tests for Java, seems like sort and filter not supported 
> yet...
> We are short on time that is why we can’t afford to build own ci/cd for 
> cylondata...
> Project looks very promising and for now it’s a huge technical risk for us.
>
>
> On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon  wrote:
>>
>> Checkout https://cylondata.org/.
>>
>> We have also worked on this problem in both sequential and distributed 
>> execution mode. An early DataFrame API is also available.
>>
>> [1]. https://cylondata.org/docs/python
>> [2]. https://cylondata.org/docs/python_api_docs
>>
>>
>> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger  
>> wrote:
>>>
>>> Ivan,
>>>
>>> The Clojure dataset abstraction does not copy the data, uses mmap, and is 
>>> generally extremely fast for aggregate group-by operations. Just FYI.
>>>
>>>
>>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov  wrote:

 Hi!
 I plan to:
 -  join
 - group by
 - filter
 data using pyarrow (new to it). The idea is to get better performance and 
 memory utilisation ( apache arrow columnar compression) compared to pandas.
 Seems like pyarrow has no support for joining two Tables / Dataset by key 
 so I have to fallback to pandas.
 I don’t really follow how pyarrow <-> pandas integration works. Will 
 pandas rely on apache arrow data structure? I’m fine with using only these 
 flat types for columns to avoid "corner cases"
 - string
 - int
 - long
 - decimal

 I have a feeling that pandas will copy all data from apache arrow and 
 double the size (according to the doc). Did I get it right?
 What is the right way to join, groupBy and filter several "Tables" / 
 "Datasets" utilizing pyarrow (underlying apache arrow) power?

 Thank you!
>>
>> --
>> Vibhatha Abeykoon


Re: Bug Report:grpc tls assertion failed when using pyarrow(2.0.0) flight client get data

2020-11-20 Thread Wes McKinney
Can you please also open a Jira issue?

On Fri, Nov 20, 2020 at 1:36 AM 梁彬彬  wrote:
>
> Hello,
> When I using pyarrow flight client get remote data, I encounter the 
> "assertion failed" problem:
>
> It seems the grpc bug. And it is similar with the previous problem: 
> https://issues.apache.org/jira/browse/ARROW-7689 .
> And the record of grpc is : https://github.com/grpc/grpc/issues/20311
>
> In my project, the version of pyarrow is 2.0.0, and the version of grpcio is 
> 1.33.2. Both are the latest version.
>
>
> When I change the version of pyarrow from 2.0.0 to 0.17.1, the problem 
> disappeared.
>
> I am confused why the latest version of pyarrow will cause the bug? And the 
> old version will not?
> Is this a bug?
>
> Thanks a lot.


Re: Modifying Arrow record batch

2020-11-17 Thread Wes McKinney
There are certain narrow cases where in-place modification of an IPC
message may be possible, but in general new messages would have to be
produced

On Tue, Nov 17, 2020 at 3:06 AM Saloni Udani  wrote:
>
> Hello,
> We have a use case where we use Arrow format for data transfer between 
> different data enrichment processors. So one processor prepares an Arrow 
> batch which is read and enriched by the other. Now as Arrays are immutable in 
> Arrow what is the suggested way to modify an Arrow record batch (adding new 
> column/adding new values to existing column/updating value of an existing 
> column)? Is there a way to do this with an existing Arrow record batch or it 
> has to be rewritten from scratch every time?
>
>
> Regards
> Saloni Udani


Re: pyarrow.Table equivalent to pandas.DataFrame.itertuples ?

2020-11-15 Thread Wes McKinney
We haven't implemented anything like this, but it wouldn't be a
stretch to implement an efficient "row accessor" class in Cython to do
this. Feel free to open some Jira issues about it

On Sun, Nov 15, 2020 at 4:45 PM Luke  wrote:
>
> I am reading in a parquet file and I want to loop over each of the rows and 
> right now I am converting to a pandas dataframe and then using 
> pandas.DataFrame.itertuples to access each row. Is there a way to do this all 
> in pyarrow  and not convert to pandas?  Just looking at ways to optimize.
>
> One thing I have tried is to convert the pyarrow table to a list of 
> dictionaries that are then looped over, but that is much slower in my case 
> than the conversion to pandas and using itertuples.  I was surprised it was 
> slower.
>
> What I mean by that is given pyarrow table t (read from 
> pyarrow.parquet.read_table)
>
> new_list = []
> for i in range(t.num_rows):
> new_dict = dict()
> for k in t.column_names:
> new_dict[k] = t[k][i].as_py()
> new_list.append(new_dict)
>
> now loop over new_list.
>
> The method pyarrow.Table.to_pydict() would be helpful to make the above more 
> concise but I need it oriented like pandas.DataFrame.to_dict('records').
>
> I get this might not be implemented yet, just asking in case I am missing how 
> to do this natively in arrow.
>
> Thanks,
> Luke


Re: [Flight] TiDB Flight connector

2020-11-15 Thread Wes McKinney
We are actively discussing "FlightSQL" to provide a standardized
Flight-based interface to a SQL database

https://lists.apache.org/thread.html/rc4717b78f09bbf7a69347b6c126849e17323c491338fc73457cf7558%40%3Cdev.arrow.apache.org%3E

I think it would make sense for you to use this rather than building
something custom -- that way once FlightSQL ships to production,
people will be able to access TiDB out of the box with a FlightSQL
client (from one of the Arrow open source packages)

On Sat, Nov 14, 2020 at 8:41 AM Michiel De Backker  wrote:
>
> Hi all,
>
> We're exploring adding an Arrow Flight connector to TiDB at 
> pingcap/tidb#21056. The idea is that it may live alongside the existing MYSQL 
> connector. It's quite early but I already wanted to reach out.
> One point I'd like some clarification on: Since TiDB is a SQL database, I'm 
> wondering if there are already standards for sending queries over the Flight 
> protocol?
>
> With kind regards,
> Backkem


Re: Tabular ID query (subframe selection based on an integer ID)

2020-11-12 Thread Wes McKinney
In my setup here I did:

import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np

num_rows = 10_000_000
data = np.random.randn(num_rows)

df = pd.DataFrame({'data{}'.format(i): data
   for i in range(100)})

df['key'] = np.random.randint(0, 100, size=num_rows)

rb = pa.record_batch(df)
t = pa.table(df)

I found that the performance of filtering a record batch is very similar:

In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)

In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A few obvious notes:

* Evidently, these code paths haven't been greatly optimized, so
someone ought to take a look at this
* Everything here is single-threaded in Arrow-land. The end-goal for
all of this is to parallelize everything (predicate evaluation,
filtering) on the CPU thread pool

On Wed, Nov 11, 2020 at 4:27 PM Vibhatha Abeykoon  wrote:
>
> Adding to the performance scenario, I also implemented some operators on top 
> of the Arrow compute API.
> I also observed similar performance when compared to Numpy and Pandas.
>
> But underneath Pandas what I observed was the usage of numpy ops,
>
> (https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195,
> https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999)
>
> @Wes
>
> So this would mean that Pandas may have similar performance to Numpy in 
> filtering cases. Is this a correct assumption?
>
> But the filter compute function itself was very fast. Most time is spent on 
> creating the mask when there are multiple columns.
> For about 10M records I observed 1.5 ratio of execution time between 
> Arrow-compute based filtering method vs Pandas.
>
> The performance gap is it due to vectorization or some other factor?
>
>
> With Regards,
> Vibhatha Abeykoon
>
>
> On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs  wrote:
>>
>> Ugh, let me reformat that since the PonyMail browser interface thinks ">>>" 
>> is a triply quoted message.
>>
>> <<< t = pa.Table.from_pandas(df0)
>> <<< t
>> pyarrow.Table
>> timestamp: int64
>> index: int32
>> value: int64
>> <<< import pyarrow.compute as pc
>> <<< def select_by_index(table, ival):
>>  value_index = table.column('index')
>>  index_type = value_index.type.to_pandas_dtype()
>>  mask = pc.equal(value_index, index_type(ival))
>>  return table.filter(mask)
>> <<< %timeit t2 = select_by_index(t, 515)
>> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit t2 = select_by_index(t, 3)
>> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit df0[df0['index'] == 515]
>> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>> <<< %timeit df0[df0['index'] == 3]
>> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
>>  np.count_nonzero(df0['index'] == 3),
>>  np.count_nonzero(df0['index'] == 515)))
>> ALL:1225000, 3:20, 515:195
>> <<< df0.memory_usage()
>> Index128
>> timestamp980
>> index490
>> value980
>> dtype: int64
>>


Re: Tabular ID query (subframe selection based on an integer ID)

2020-11-11 Thread Wes McKinney
You should be able to use the kernels available in pyarrow.compute to
do this -- there might be a few that are missing, but if you can't
find what you need please open a Jira issue so it goes into the
backlog

On Wed, Nov 11, 2020 at 11:43 AM Jason Sachs  wrote:
>
> I do a lot of the following operation:
>
> subframe = df[df['ID'] == k]
>
> where df is a Pandas DataFrame with a small number of columns but a 
> moderately large number of rows (say 200K - 5M). The columns are usually 
> simple... for example's sake let's call them int64 TIMESTAMP, uint32 ID, 
> int64 VALUE.
>
> I am moving the source data to Parquet format. I don't really care whether I 
> do this in PyArrow or Pandas, but I need to perform these subframe selections 
> frequently and would like to speed them up. (The idea being, load the data 
> into memory once, and then expect to perform subframe selection anywhere from 
> 10 - 1000 times to extract appropriate data for further processing.)
>
> Is there a suggested method? Any ideas?
>
> I've tried
>
> subframe = df.query('ID == %d' % k)
>
> and flirted with the idea of using Gandiva as per 
> https://blog.christianperone.com/2020/01/gandiva-using-llvm-and-arrow-to-jit-and-evaluate-pandas-expressions/
>  but it looks a bit rough + I had to manually tweak the types of literal 
> constants to support something other than a float64.


Re: Arrow C++ API - memory management

2020-11-09 Thread Wes McKinney
The memory should automatically be freed by any object / shared_ptr /
unique_ptr destruction. On Linux we use a background jemalloc thread
by default so it may not be freed immediately but it should not be
held indefinitely. In any case if you can reproduce the issue
consistently we'd be glad to take a look, please open a Jira issue and
provide as much information as you can to make it easy for us to
reproduce

On Mon, Nov 9, 2020 at 9:41 AM Maciej Skrzypkowski
 wrote:
>
> OK, thanks for the answer.
>
> mArrowTable is "std::shared_ptr mArrowTable" so should be 
> managed properly by the shared pointer. I've narrowed down the problem to 
> code like this:
>
> void LoadCSVData::ReadArrowTableFromCSV( const std::string & filePath )
> {
> auto tableReader = CreateTableReader( filePath );
> //ReadArrowTableUsingReader( *tableReader );
> }
>
> std::shared_ptr LoadCSVData::CreateTableReader( 
> const std::string & filePath )
> {
> arrow::MemoryPool* pool = arrow::default_memory_pool();
> auto tableReader = arrow::csv::TableReader::Make( pool, OpenCSVFile( 
> filePath ),
>   *PrepareReadOptions(), 
> *PrepareParseOptions(), *PrepareConvertOptions() );
> if ( !tableReader.ok() )
> {
> throw BadParametersException( std::string( "CSV file reader error: " 
> ) + tableReader.status().ToString() );
> }
> return *tableReader;
> }
>
> Still memory is getting filled while calling ReadArrowTableFromCSV many 
> times. Is the arrow's memory pool freed while destruction of TableReader? Or 
> should I free it explicitly?
>
>
> On 09.11.2020 15:01, Wes McKinney wrote:
>
> We'd prefer to answer questions on the mailing list or Jira (if
> something looks like a bug).
>
> There isn't enough detail on the SO question to understand what other
> things might be going on, but you are never destroying
> this->mArrowTable which is holding on to allocated memory. If the
> memory use keeps going up through repeated calls to the CSV reader
> that sounds like a possible leak, so we would need to see more
> details, including about your platform.
>
> On Mon, Nov 9, 2020 at 2:33 AM Maciej Skrzypkowski
>  wrote:
>
> Hi All!
>
> I don't understand memory management in C++ Arrow API. I have some
> memory leaks while using it. I've created Stackoverflow question, maybe
> someone would answer it:
> https://stackoverflow.com/questions/64742588/how-to-manage-memory-while-reading-csv-using-apache-arrow-c-api
> .
>
> Thanks,
> Maciej Skrzypkowski
>


Re: Arrow C++ API - memory management

2020-11-09 Thread Wes McKinney
We'd prefer to answer questions on the mailing list or Jira (if
something looks like a bug).

There isn't enough detail on the SO question to understand what other
things might be going on, but you are never destroying
this->mArrowTable which is holding on to allocated memory. If the
memory use keeps going up through repeated calls to the CSV reader
that sounds like a possible leak, so we would need to see more
details, including about your platform.

On Mon, Nov 9, 2020 at 2:33 AM Maciej Skrzypkowski
 wrote:
>
> Hi All!
>
> I don't understand memory management in C++ Arrow API. I have some
> memory leaks while using it. I've created Stackoverflow question, maybe
> someone would answer it:
> https://stackoverflow.com/questions/64742588/how-to-manage-memory-while-reading-csv-using-apache-arrow-c-api
> .
>
> Thanks,
> Maciej Skrzypkowski
>


Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?

2020-11-04 Thread Wes McKinney
Seems a bit buggy, can you open a Jira issue? Thanks

On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs  wrote:
>
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an 
> embedded 00 byte. Is this a known bug?
>
> (py3) C:\>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: 
> Anaconda, Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ..b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
>data
> 0   b''
> 1   b''
> 2   b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5   b''
> 6   b'half'
> 7   b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
>   0
> 0   b''
> 1   b''
> 2   b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5   b'\x00Baz!'
> 6  b'half\x00baked'
> 7   b''
> >>>


Re: Compressing parquet metadata?

2020-11-04 Thread Wes McKinney
 You mean the key-value metadata at the schema/field-level? That can
be binary (it gets base64-encoded when written to Parquet)

On Wed, Nov 4, 2020 at 10:22 AM Jason Sachs  wrote:
>
> OK. If I take the manual approach, do parquet / arrow care whether metadata 
> is binary or not?
>
> On 2020/11/04 14:16:37, Wes McKinney  wrote:
> > There is not to my knowledge.
> >
> > On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs  wrote:
> > >
> > > Is there any built-in method to compress parquet metadata? From what I 
> > > can tell, the main table columns are compressed, but not the metadata.
> > >
> > > I have metadata which includes 100-200KB of text (JSON format) that is 
> > > easily compressible... is there any alternative to doing it myself?
> >


Re: Compressing parquet metadata?

2020-11-04 Thread Wes McKinney
There is not to my knowledge.

On Tue, Nov 3, 2020 at 5:55 PM Jason Sachs  wrote:
>
> Is there any built-in method to compress parquet metadata? From what I can 
> tell, the main table columns are compressed, but not the metadata.
>
> I have metadata which includes 100-200KB of text (JSON format) that is easily 
> compressible... is there any alternative to doing it myself?


Re: Pyarrow join/ merge capabilities

2020-11-02 Thread Wes McKinney
Right, it doesn't, but this work is in scope for the Apache project
and I expect that we will eventually have this functionality within
pyarrow.

On Mon, Nov 2, 2020 at 8:12 AM Niranda Perera  wrote:
>
> Hi,
>
> Arrow itself doesn't provide these capabilities AFAIK. But there are other 
> frameworks that do joins/ merges on arrow data structures such as BlazingSQL, 
> Cylon etc.
>
> Best
>
> On Sun, Nov 1, 2020 at 12:23 PM Harshit Gupta  wrote:
>>
>> Does pyarrow Allow for join or merge capabilities on the columns or indexes?
>
>
>
> --
> Niranda Perera
> @n1r44
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda


Re: pyarrow C++ API

2020-10-28 Thread Wes McKinney
Good to hear. We have an RAII-type helper that we use in C++ to make
it easier to acquire and release the GIL in functions that need it

https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/common.h#L74

On Wed, Oct 28, 2020 at 2:08 PM James Thomas  wrote:
>
> Thanks, Wes. Wrapping my C++ function with
>
> PyGILState_STATE state = PyGILState_Ensure();
> ...
> PyGILState_Release(state);
>
> fixed the issue.
>
> On Wed, Oct 28, 2020 at 6:21 AM Wes McKinney  wrote:
>>
>> I haven't tried myself but my guess the problem is that your C++
>> function does not acquire the GIL. When ctypes invokes a native
>> function, it releases the GIL.
>>
>> On Wed, Oct 28, 2020 at 4:29 AM James Thomas  
>> wrote:
>> >
>> > Hi,
>> >
>> > I am trying to run the following simple example after pip installing 
>> > pandas and pyarrow:
>> >
>> > ---cube.cpp---
>> > #include 
>> > #include 
>> > #include 
>> >
>> > extern "C" void print_is_array(PyObject *);
>> >
>> > void print_is_array(PyObject *obj) {
>> >   arrow::py::import_pyarrow();
>> >   printf("is_array: %d\n", arrow::py::is_array(obj));
>> > }
>> >
>> > ---cube.py---
>> > import ctypes
>> > import pandas as pd
>> > import pyarrow as pa
>> >
>> > c_lib = ctypes.CDLL("./libcube.so")
>> > df = pd.DataFrame({"a": [1, 2, 3]})
>> > table = pa.Table.from_pandas(df)
>> > c_lib.print_is_array(ctypes.py_object(table))
>> >
>> > ---build.sh---
>> > #!/bin/bash
>> > python3 -c 'import pyarrow; pyarrow.create_library_symlinks()'
>> > INC=$(python3 -c 'import pyarrow; print(pyarrow.get_include())')
>> > LIB=$(python3 -c 'import pyarrow; print(pyarrow.get_library_dirs()[0])')
>> > g++ -I$INC -I/usr/include/python3.6m -fPIC cube.cpp -shared -o libcube.so 
>> > -L$LIB -larrow -larrow_python
>> >
>> > When I run build.sh and then do python3 cube.py, I am seeing a segfault at 
>> > the import_pyarrow() statement in cube.cpp. Am I doing something wrong 
>> > here?
>> >
>> > Thanks,
>> > James


Re: pyarrow C++ API

2020-10-28 Thread Wes McKinney
I haven't tried myself but my guess the problem is that your C++
function does not acquire the GIL. When ctypes invokes a native
function, it releases the GIL.

On Wed, Oct 28, 2020 at 4:29 AM James Thomas  wrote:
>
> Hi,
>
> I am trying to run the following simple example after pip installing pandas 
> and pyarrow:
>
> ---cube.cpp---
> #include 
> #include 
> #include 
>
> extern "C" void print_is_array(PyObject *);
>
> void print_is_array(PyObject *obj) {
>   arrow::py::import_pyarrow();
>   printf("is_array: %d\n", arrow::py::is_array(obj));
> }
>
> ---cube.py---
> import ctypes
> import pandas as pd
> import pyarrow as pa
>
> c_lib = ctypes.CDLL("./libcube.so")
> df = pd.DataFrame({"a": [1, 2, 3]})
> table = pa.Table.from_pandas(df)
> c_lib.print_is_array(ctypes.py_object(table))
>
> ---build.sh---
> #!/bin/bash
> python3 -c 'import pyarrow; pyarrow.create_library_symlinks()'
> INC=$(python3 -c 'import pyarrow; print(pyarrow.get_include())')
> LIB=$(python3 -c 'import pyarrow; print(pyarrow.get_library_dirs()[0])')
> g++ -I$INC -I/usr/include/python3.6m -fPIC cube.cpp -shared -o libcube.so 
> -L$LIB -larrow -larrow_python
>
> When I run build.sh and then do python3 cube.py, I am seeing a segfault at 
> the import_pyarrow() statement in cube.cpp. Am I doing something wrong here?
>
> Thanks,
> James


Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Wes McKinney
Sure, anything is possible if you want to write the code to do it. You
could create a CompressedRecordBatch class where you only decompress a
field/column when you need it.

On Thu, Oct 22, 2020 at 4:05 PM Daniel Nugent  wrote:
>
> The biggest problem with mapped arrow data is that it's only possible with 
> uncompressed Feather files. Is there ever a possibility that compressed files 
> could be mappable (I know that you'd have to decompress a given RecordBatch 
> to actually work with it, but Feather files should be comprised of many 
> RecordBatches, right?)
>
> -Dan Nugent
>
>
> On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney  wrote:
>>
>> I'm not sure where the conflict in what's written online is, but by
>> virtue of being designed such that data structures do not require
>> memory buffers to be RAM resident (i.e. can reference memory maps), we
>> are set up well to process larger-than-memory datasets. In C++ at
>> least we are putting the pieces in place to be able to do efficient
>> query execution on on-disk datasets, and it may already be possible in
>> Rust with DataFusion.
>>
>> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger  
>> wrote:
>> >
>> > There are ways to handle datasets larger than memory.  mmap'ing one or 
>> > more arrow files and going from there is a pathway forward here:
>> >
>> > https://techascent.com/blog/memory-mapping-arrow.html
>> >
>> > How this maps to other software ecosystems I don't know but many have mmap 
>> > support.
>> >
>> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka  
>> > wrote:
>> >>
>> >> I believe it would be good if you define your use case.
>> >>
>> >> I do handle larger than memory datasets with pyarrow with the use of
>> >> dataset.scan but my use case is very specific as I am repartitioning
>> >> and cleaning a bit large datasets.
>> >>
>> >> BR,
>> >>
>> >> Jacek
>> >>
>> >> czw., 22 paź 2020 o 20:39 Jacob Zelko  napisał(a):
>> >> >
>> >> > Hi all,
>> >> >
>> >> > Very basic question as I have seen conflicting sources. I come from the 
>> >> > Julia community and was wondering if Arrow can handle 
>> >> > larger-than-memory datasets? I saw this post by Wes McKinney here 
>> >> > discussing that the tooling is being laid down:
>> >> >
>> >> > Table columns in Arrow C++ can be chunked, so that appending to a table 
>> >> > is a zero copy operation, requiring no non-trivial computation or 
>> >> > memory allocation. By designing up front for streaming, chunked tables, 
>> >> > appending to existing in-memory tabler is computationally inexpensive 
>> >> > relative to pandas now. Designing for chunked or streaming data is also 
>> >> > essential for implementing out-of-core algorithms, so we are also 
>> >> > laying the foundation for processing larger-than-memory datasets.
>> >> >
>> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >> >
>> >> > And then in the docs I saw this:
>> >> >
>> >> > The pyarrow.dataset module provides functionality to efficiently work 
>> >> > with tabular, potentially larger than memory and multi-file datasets:
>> >> >
>> >> > A unified interface for different sources: supporting different sources 
>> >> > and file formats (Parquet, Feather files) and different file systems 
>> >> > (local, cloud).
>> >> > Discovery of sources (crawling directories, handle directory-based 
>> >> > partitioned datasets, basic schema normalization, ..)
>> >> > Optimized reading with predicate pushdown (filtering rows), projection 
>> >> > (selecting columns), parallel reading or fine-grained managing of tasks.
>> >> >
>> >> > Currently, only Parquet and Feather / Arrow IPC files are supported. 
>> >> > The goal is to expand this in the future to other file formats and data 
>> >> > sources (e.g. database connections).
>> >> >
>> >> > ~ Tabular Datasets
>> >> >
>> >> > The article from Wes was from 2017 and the snippet on Tabular Datasets 
>> >> > is from the current documentation for pyarrow.
>> >> >
>> >> > Could anyone answer this question or at least clear up my confusion for 
>> >> > me? Thank you!
>> >> >
>> >> > --
>> >> > Jacob Zelko
>> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> >> > Corning Community College - Engineering Science A.S. '17
>> >> > Cell Number: (607) 846-8947


Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Wes McKinney
I'm not sure where the conflict in what's written online is, but by
virtue of being designed such that data structures do not require
memory buffers to be RAM resident (i.e. can reference memory maps), we
are set up well to process larger-than-memory datasets. In C++ at
least we are putting the pieces in place to be able to do efficient
query execution on on-disk datasets, and it may already be possible in
Rust with DataFusion.

On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger  wrote:
>
> There are ways to handle datasets larger than memory.  mmap'ing one or more 
> arrow files and going from there is a pathway forward here:
>
> https://techascent.com/blog/memory-mapping-arrow.html
>
> How this maps to other software ecosystems I don't know but many have mmap 
> support.
>
> On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka  
> wrote:
>>
>> I believe it would be good if you define your use case.
>>
>> I do handle larger than memory datasets with pyarrow with the use of
>> dataset.scan but my use case is very specific as I am repartitioning
>> and cleaning a bit large datasets.
>>
>> BR,
>>
>> Jacek
>>
>> czw., 22 paź 2020 o 20:39 Jacob Zelko  napisał(a):
>> >
>> > Hi all,
>> >
>> > Very basic question as I have seen conflicting sources. I come from the 
>> > Julia community and was wondering if Arrow can handle larger-than-memory 
>> > datasets? I saw this post by Wes McKinney here discussing that the tooling 
>> > is being laid down:
>> >
>> > Table columns in Arrow C++ can be chunked, so that appending to a table is 
>> > a zero copy operation, requiring no non-trivial computation or memory 
>> > allocation. By designing up front for streaming, chunked tables, appending 
>> > to existing in-memory tabler is computationally inexpensive relative to 
>> > pandas now. Designing for chunked or streaming data is also essential for 
>> > implementing out-of-core algorithms, so we are also laying the foundation 
>> > for processing larger-than-memory datasets.
>> >
>> > ~ Apache Arrow and the “10 Things I Hate About pandas”
>> >
>> > And then in the docs I saw this:
>> >
>> > The pyarrow.dataset module provides functionality to efficiently work with 
>> > tabular, potentially larger than memory and multi-file datasets:
>> >
>> > A unified interface for different sources: supporting different sources 
>> > and file formats (Parquet, Feather files) and different file systems 
>> > (local, cloud).
>> > Discovery of sources (crawling directories, handle directory-based 
>> > partitioned datasets, basic schema normalization, ..)
>> > Optimized reading with predicate pushdown (filtering rows), projection 
>> > (selecting columns), parallel reading or fine-grained managing of tasks.
>> >
>> > Currently, only Parquet and Feather / Arrow IPC files are supported. The 
>> > goal is to expand this in the future to other file formats and data 
>> > sources (e.g. database connections).
>> >
>> > ~ Tabular Datasets
>> >
>> > The article from Wes was from 2017 and the snippet on Tabular Datasets is 
>> > from the current documentation for pyarrow.
>> >
>> > Could anyone answer this question or at least clear up my confusion for 
>> > me? Thank you!
>> >
>> > --
>> > Jacob Zelko
>> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
>> > Corning Community College - Engineering Science A.S. '17
>> > Cell Number: (607) 846-8947


Re: Is there a `write_record_batch` method corresonding to `pa.ipc.read_record_batch`?

2020-10-22 Thread Wes McKinney
Use RecordBatch.serialize to do this.

On Wed, Oct 21, 2020 at 11:18 PM Micah Kornfield 
wrote:

> Hi Shawn,
> This method exists in the C++ implementation [1], so it is likely
> reasonable to expose some form in python (I couldn't find it either in
> pyarrow).  This should be reasonable straight-forward (you could follow the
> path taken for read implementation) if it is something you wanted to
> contribute.
>
> Thanks,
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/3694794bdfd0677b95b8c95681e392512f1c9237/cpp/src/arrow/ipc/writer.h#L169
>
> On Mon, Oct 12, 2020 at 12:38 AM Shawn Yang 
> wrote:
>
>> I want to write a record batch as ipc message separately without writing
>> a schema. In my case, the schema is known to peers ahead of time. I noticed
>> arrow java already has this method
>> `org.apache.arrow.vector.ipc.message.MessageSerializer#serialize(org.apache.arrow.vector.ipc.WriteChannel,
>> org.apache.arrow.vector.ipc.message.ArrowRecordBatch)`
>>
>


Re: Arrow C Data Interface

2020-10-19 Thread Wes McKinney
hi Pasha,

Copying dev@.

You can see how DuckDB interacts with the pyarrow data structures by
the C interface here, maybe it's helpful

https://github.com/cwida/duckdb/blob/master/tools/pythonpkg/duckdb_python.cpp

We haven't defined a Python API (either C API level or Python API
level) so that objects can advertise that they support the Arrow C
interface -- it's a separate issue from the C interface itself (which
doesn't have anything specifically to do with Python), and I agree it
would probably be a good idea to have a standard way that we codify
and document .

Thanks
Wes

On Mon, Oct 19, 2020 at 12:34 PM Pasha Stetsenko  wrote:
>
> Hi everybody,
>
> I've been reading http://arrow.apache.org/docs/format/CDataInterface.html, 
> which has been
> "... inspired by the Python buffer protocol", and i can't find any details on 
> how to connect this
> protocol with other libraries/applications.
>
> Here's what I mean: with the python buffer protocol, i can create a new type 
> and set its
> `tp_as_buffer` field to a `PyBufferProcs` structure. This way any other 
> library can call
> `PyObject_CheckBuffer()` on my object to check whether or not it supports the 
> buffer interface,
> and then `PyObject_GetBuffer()` to use that interface.
>
> I could not find the corresponding mechanisms in the Arrow C data interface. 
> For example, consider the "Exporting a simple int32 array" tutorial in the 
> article above. After creating
> `export_int32_type()`, `release_int32_type()`, `export_int32_array()`, 
> `release_int32_array()`
> -- how do i announce to the world that these functions are available? 
> Conversely, if i want to
> talk to an Arrow Table via this interface -- where do i find the endpoints 
> that return
> `ArrowSchema` and `ArrowArray` structures?
>
> (I understand that there is an additional, more complicated API for accessing 
> arrow objects http://arrow.apache.org/docs/python/extending.html, but this 
> seems to be a completely different
> API than what CDataInterface describes).


Re: write_feather, new_file, and compression

2020-10-08 Thread Wes McKinney
On Wed, Oct 7, 2020 at 9:33 PM Jonathan Yu  wrote:
>
> Hello there,
>
> I am using Arrow to store data on disk temporarily, so disk space is not a 
> problem (I understand that Parquet is preferable for more efficient disk 
> storage). It seems that Arrow's memory mapping/zero copy capabilities would 
> provide better performance given this use case.
>
> Here are my questions:
>
> 1. For new applications, should we prefer the pa.ipc.new_file interface over 
> write_feather? My understanding from reading [0] is that 
> pa.feather.write_feather is an API provided for backward compatibility, and 
> with compression disabled, it seems to produce files of the same size (the 
> files appear to be identical) as the RecordBatchFileWriter.
>

You can use either, neither API is deprecated nor planning to be.

> 2. Does compression affect the need to make copies? I imagine that 
> compressing the file means that the code to use the file cannot be zero-copy 
> anymore.
>

Right, when using compression by definition zero copy is not possible.

> 3. When using pandas to analyze the data, is there a way to load the data 
> using memory mapping, and if so, would this be expected to improve 
> deserialization performance and memory utilization if multiple processes are 
> reading the same table data simultaneously? Assume that I'm running on a 
> modern server-class SSD.
>

No, pandas doesn't support memory mapping.

> Thank you!
>
> Jonathan
>
> [0] https://arrow.apache.org/faq/#what-about-the-feather-file-format


Re: [C++] Flight performance

2020-09-30 Thread Wes McKinney
Since the TCP/networking stack is different between Linux and Windows,
I would certainly expect the performance to be *different* but I'm not
sure of the expected magnitude of the difference. I'm copying dev@ in
case someone knows more about gRPC and TCP performance on Windows.
There might be some default gRPC settings which should be set
differently on Windows to achieve better performance out of the box

On Wed, Sep 30, 2020 at 3:36 AM Louis C  wrote:
>
> Hello,
>
> I compiled the flight benchmarks on windows (using vcpkg and adding some 
> options to the port file because it is not done by default), version 1.0.0.
> While running it I only get about 700 MB/s max with the default parameters. 
> My CPU is i7-4790K, with 32Gb of Ram (ddr3) at 1600 Mhz.
> However I see that in the blog post concerning Flight, the performance which 
> is reached is about 3GB/s.
> Does such a difference in performance seems normal to you ?
> Example of output :
> >arrow-flight-benchmark.exe
> Using standalone server: false
> Server running with pid 27500
> Testing method: DoGet
> Server host: localhost
> Server port: 31337
> Server host: localhost
> Server port: 31337
> Batch size: 131040
> Batches read: 9768
> Bytes read: 128000
> Nanos: 2737466000
> Speed: 445.924 MB/s
> Throughput: 3568.26 batches/s
> Latency: 948.912 usec/batch
> I also tested on similar machines, with similar performances (I reach 1.2GB/s 
> max on more recent computers (windows) )
>
> Cheers,
> Louis C


Re: [Python] dataset filter performance and partitioning

2020-09-25 Thread Wes McKinney
hi Matt,

This is because `arrow::compute::IsIn` is being called on each batch
materialized by the datasets API so the internal kernel state is being
set up and torn down for every batch. This means that a large hash
table is being set up and torn down many times rather than only
created once as with pandas.

https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/filter.cc#L1223

This is definitely a performance problem that must be fixed at some point.

https://issues.apache.org/jira/browse/ARROW-10097

Thanks,
Wes

On Fri, Sep 25, 2020 at 10:49 AM Matthew Corley  wrote:
>
> I don't want to cloud this discussion, but I do want to mention that when I 
> tested isin filtering with a high cardinality (on the order of many millions) 
> set on a parquet dataset with a single rowgroup (so filtering should have to 
> happen after data load), it performed much worse in terms of runtime and peak 
> memory utilization than waiting to do the filtering after converting the 
> Table to a pandas DataFrame.  This surprised me given all the copying that 
> has to occur in the latter case.
>
> I don't have my exact experiment laying around anymore to give concrete 
> numbers, but it might be worth investigating while the isin filter code is 
> under consideration.
>
> On Fri, Sep 25, 2020 at 6:55 AM Josh Mayer  wrote:
>>
>> Thanks Joris, that info is very helpful. A few follow up questions, you 
>> mention that:
>>
>> > ... it actually needs to parse the statistics of all row groups of all 
>> > files to determine which can be skipped ...
>>
>> Is that something that is only done once (and perhaps stored inside a 
>> dataset object in some optimized form) or performed on every to_table call?
>>
>> In the case that I am creating a dataset from a common metadata file is it 
>> possible to attach manual partitioning information (using field expressions 
>> on to each file), similar to how it is done in the manual dataset creation 
>> case 
>> (https://arrow.apache.org/docs/python/dataset.html#manual-specification-of-the-dataset)?
>>
>> Josh
>>
>> On Fri, Sep 25, 2020 at 8:34 AM Joris Van den Bossche 
>>  wrote:
>>>
>>> Using a small toy example, the "isin" filter is indeed not working for 
>>> filtering row groups:
>>>
>>> >>> table = pa.table({"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
>>> >>> np.arange(20)})
>>> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> >>> dataset = ds.dataset("test_filter_string.parquet")
>>> # get the single file fragment (dataset consists of one file)
>>> >>> fragment = list(dataset.get_fragments())[0]
>>> >>> fragment.ensure_complete_metadata()
>>>
>>> # check that we do have statistics for our row groups
>>> >>> fragment.row_groups[0].statistics
>>> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}
>>>
>>> # I created the file such that there are 4 row groups (each with a unique 
>>> value in the name column)
>>> >>> fragment.split_by_row_group()
>>> [,
>>>  ,
>>>  ,
>>>  ]
>>>
>>> # simple equality filter works as expected -> only single row group left
>>> >>> filter = ds.field("name") == "a"
>>> >>> fragment.split_by_row_group(filter)
>>> []
>>>
>>> # isin filter does not work
>>> >>> filter = ds.field("name").isin(["a", "b"])
>>> >>> fragment.split_by_row_group(filter)
>>> [,
>>>  ,
>>>  ,
>>>  ]
>>>
>>> While filtering with "isin" on partition columns is working fine. I opened 
>>> https://issues.apache.org/jira/browse/ARROW-10091 to track this as a 
>>> possible enhancement.
>>> Now, to explain why for partitions this is an "easier" case: the partition 
>>> information gets translated into an equality expression, with your example 
>>> an expression like "name == 'a' ", while the statistics give a 
>>> bigger/lesser than expression, such as "(name > 'a') & (name < 'a')" (from 
>>> the min/max). So for the equality it is more trivial to compare this with 
>>> an "isin" expression like "name in ['a', 'b']" (for the min/max expression, 
>>> we would need to check the special case where min/max is equal).
>>>
>>> Joris
>>>
>>> On Fri, 25 Sep 2020 at 14:06, Joris Van den Bossche 
>>>  wrote:

 Hi Josh,

 Thanks for the question!

 In general, filtering on partition columns will be faster than filtering 
 on actual data columns using row group statistics. For partition-based 
 filtering, the scanner can skip full files based on the information from 
 the file path (in your example case, there are 11 files, for which it can 
 select 4 of them to actually read), while for row-group-based filtering, 
 it actually needs to parse the statistics of all row groups of all files 
 to determine which can be skipped, which is typically more information to 
 process compared to the file paths.

 That said, there are some oddities I noticed:

 - As I mentioned, I expect partition-based filtering to faster, but not 
 that much faster (certainly in a case with a limited 

Re: Implementation independent __arrow_array__

2020-09-12 Thread Wes McKinney
Adding dev@

The is one purpose of the Arrow C data interface, which was developed
after the __arrow_array__ protocol, and worth investigating

https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst

On Sat, Sep 12, 2020 at 2:16 PM Marc Garcia  wrote:
>
> Hi there,
>
> I'm writing a document analyzing different options for a Python dataframe 
> exchange protocol. And I wanted to ask a question regarding the 
> __arrow_array__ protocol.
>
> I checked the code, and looks like the producer is expected to be sending an 
> Arrow array, and the consumer just receives it. This is the code I'm 
> checking, I guess it's the right one: 
> https://github.com/apache/arrow/blob/master/python/pyarrow/array.pxi#L110
>
> Compared to the array interface (the NumPy buffer protocol), it works a bit 
> differently. In the NumPy one, the producer exposes the pointer, the size... 
> So, the producer doesn't need to depend on NumPy or any other library, and 
> then the consumer can simply use `numpy.array(obj)` and generate the NumPy 
> array. Or if other implementations support the protocol (not sure if they 
> do), they could call something like `tensorflow.Tensor(obj)`, and NumPy would 
> not be used at all.
>
> Am I understanding correctly the `__arrow_array__` protocol? And if I am, is 
> there anything else similar to the NumPy protocol that can be used to 
> exchange data without relying on a particular implementation?
>
> Thanks in advance!


Re: initialization on first call to pyarrow.array()?

2020-08-28 Thread Wes McKinney
The Arrow C++ libraries do some other one-time static initialization
so we should find out if it's all due to importing pandas or something
else

On Fri, Aug 28, 2020 at 1:48 AM Joris Van den Bossche
 wrote:
>
> Hi Max,
>
> I assume (part of) the slowdown comes from trying to import pandas. If I add 
> an "import pandas" to your script, the difference with the first run is much 
> smaller (although still a difference).
>
> Inside the array function, we are lazily importing pandas to check if the 
> input is a pandas object. I suppose that in theory, if the input is a numpy 
> array, we should also be able to avoid this pandas import (maybe switching 
> the order of some checks).
>
> Best,
> Joris
>
> On Fri, 28 Aug 2020 at 01:10, Max Grossman  wrote:
>>
>> Hi all,
>>
>> Say I've got a simple program like the following that converts a numpy array 
>> to a pyarrow array several times in a row, and times each of those 
>> conversions:
>>
>> import pyarrow
>> import numpy as np
>> import time
>>
>> arr = np.random.rand(1)
>>
>> t1 = time.time()
>> pyarrow.array(arr)
>> t2 = time.time()
>> pyarrow.array(arr)
>> t3 = time.time()
>> pyarrow.array(arr)
>> t4 = time.time()
>> pyarrow.array(arr)
>> t5 = time.time()
>>
>> I'm noticing that the first call to pyarrow.array() is taking ~0.3-0.5 s 
>> while the rest are nearly instantaneous (1e-05s).
>>
>> Does anyone know what might be causing this? My assumption is some one-time 
>> initialization of pyarrow on the first call to the library, in which case 
>> I'd like to see if there's some way to explicitly trigger that 
>> initialization earlier in the program. But also curious to hear if there is 
>> a different explanation.
>>
>> Right now I'm working around this by just calling pyarrow.array([]) at 
>> application start up -- I realize this doesn't actually eliminate the added 
>> time, but it does move it out of the critical section for any benchmarking 
>> runs.
>>
>> Thanks,
>>
>> Max


Re: pyarrow + pybind11 segfault under Linux

2020-08-23 Thread Wes McKinney
There are a lot of unchecked Statuses in your code. I would suggest
checking them all and additionally adding a (checked!) call to
Validate() or ValidateFull() to make sure that everything is well
formed (it seems like it should be, but this is a pre-requisite before
debugging further)

On Sun, Aug 23, 2020 at 1:27 AM Yue Ni  wrote:
>
> Hi there,
>
> I tried to create a Python binding our Apache Arrow C++ based program, and 
> used pybind11 and pyarrow wrapping code to do it. For some reason, the code 
> works on macOS however it causes segfault under Linux.
>
> I created a minimum test case to reproduce this behavior, is there anyone who 
> can help to take a look at what may go wrong here?
>
> Here is the C++ code for creating the binding (it simply generates a fixed 
> size array and puts it into record batch and then creates a table)
> ```
> pybind11::object generate(const int32_t count) {
>   shared_ptr array;
>   arrow::Int64Builder builder;
>   for (auto i = 0; i < count; i++) {
> auto _ = builder.Append(i);
>   }
>   auto _ = builder.Finish();
>   auto record_batch = RecordBatch::Make(
>   arrow::schema(vector{arrow::field("int_value", arrow::int64())}), 
> count, vector{array});
>   auto table = 
> arrow::Table::FromRecordBatches(vector{record_batch}).ValueOrDie();
>   auto result = arrow::py::import_pyarrow();
>   auto wrapped_table = pybind11::reinterpret_borrow(
>   pybind11::handle(arrow::py::wrap_table(table)));
>   return wrapped_table;
> }
> ```
>
> Here is the python code that uses the binding (it calls the binding to 
> generate a 100-length single column table, and print the number of rows and 
> table schema).
> ```
> table = binding.generate(100)
> >>> print(table.num_rows) # this works correctly
> 100
> >>> print(table.shape) # this works correctly
> (100, 1)
> >>> print(table.num_columns) # this works correctly
> 1
> >>> print(table.column_names) # this prints an empty list, which is 
> >>> incorrect, but the program still runs
> ['']
> >>> print(table.columns) # this causes the segfault
> Segmentation fault (core dumped)
> ```
>
> The same code works completely fine and correct on macOS (Apple clang 11, 
> Python 3.7.5, arrow 1.0.0 C++ lib, pyarrow 1.0.0), but it doesn't work on 
> Debian bullseye (gcc 10.2.0, Python 3.8.5, arrow 1.0.1 C++ lib, pyarrow 
> 1.0.1). I tried switching to some combinations of Python 3.7.5 and 
> arrow/pyarrow 1.0.0 as well, but none of them works for me.
>
> I got the core dump and use gdb for some simple debugging, and it seems the 
> segfault happened when pyarrow tried to call `pyarrow_wrap_data_type` when 
> doing `Field.init`.
>
> Here is the core dump:
> ```
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `python3'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x7f604ea484cc in __pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type () 
> from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> [Current thread is 1 (Thread 0x7f60550bc740 (LWP 2205))]
> (gdb) where
> #0  0x7f604ea484cc in __pyx_f_7pyarrow_3lib_pyarrow_wrap_data_type () 
> from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #1  0x7f604eb05df0 in 
> __pyx_f_7pyarrow_3lib_5Field_init(__pyx_obj_7pyarrow_3lib_Field*, 
> std::shared_ptr const&) () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #2  0x7f604ea35d80 in __pyx_f_7pyarrow_3lib_pyarrow_wrap_field () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #3  0x7f604ea68c8f in __pyx_pw_7pyarrow_3lib_6Schema_28_field(_object*, 
> _object*) () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #4  0x7f604ea69595 in __Pyx_PyObject_CallOneArg(_object*, _object*) () 
> from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #5  0x7f604ea755de in 
> __pyx_pw_7pyarrow_3lib_6Schema_7__getitem__(_object*, _object*) () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #6  0x7f604ea476f0 in __pyx_sq_item_7pyarrow_3lib_Schema(_object*, long) 
> () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #7  0x7f604ead554e in __pyx_pw_7pyarrow_3lib_5Table_55_column(_object*, 
> _object*) () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #8  0x7f604ea69595 in __Pyx_PyObject_CallOneArg(_object*, _object*) () 
> from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #9  0x7f604ea8c8df in 
> __pyx_getprop_7pyarrow_3lib_5Table_columns(_object*, void*) () from 
> /usr/local/lib/python3.8/dist-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so
> #10 0x0051bafa in ?? ()
> #11 

Re: returning an arrow::Array from C++ to Python through pybind11

2020-08-15 Thread Wes McKinney
Are you using the arrow::py::wrap_array function? You can follow some
other successful pybind11 projects that use the pyarrow C/C++ API. You
have to also call the import_pyarrow() function

https://github.com/blue-yonder/turbodbc/blob/0369d1329a0ea39982a4d8d169b8dd3f473e6689/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp#L338

On Fri, Aug 14, 2020 at 4:24 PM Max Grossman  wrote:
>
> Hi all,
>
> I've written a C++ library that uses arrow for its in-memory data
> structures. I'd like to also add some Python APIs on top of this
> library using pybind11, so that I can grab pyarrow wrappers of my C++
> Arrow Arrays, convert them to numpy arrays, and then pass them in to
> scikit-learn (or other Python libraries) without copying data around.
>
> As an example, on the C++ side I've got a 1D vector class that wraps
> an arrow array and has a method to convert the arrow array into a
> pyarrow array:
>
> PyObject* get_local_pyarrow_array() {
> return 
> arrow::py::wrap_array(std::dynamic_pointer_cast arrow::FixedSizeBinaryArray>(_arr));
> }
>
> I've got some pybind11 registration code that registers the class and
> that method:
>
> py::class_>(m, "ShmemML1DD")
> .def("get_local_pyarrow_array",
> ::get_local_pyarrow_array);
>
> And then I've got some Python code that calls this method (and which I
> hope gets a pyarrow array as the return value):
>
> arr = dist_arr.get_local_pyarrow_array()
>
> Note that these are arrays that I'm constructing in C++ code and want
> to expose to Python, so I don't already have a pre-existing pyarrow
> instance to use. I'm trying to create a new one around my C++ arrays,
> so that Python code can start manipulating those C++ arrays.
>
> When I build and run all this, I just get told "Unable to convert
> function return value to a Python type!":
>
> Traceback (most recent call last):
>   File "/global/homes/j/jmg3/shmem_ml/example/python_wrapper.py", line
> 15, in 
> random.rand(vec)
>   File "/global/homes/j/jmg3/shmem_ml/src/shmem_ml/random.py", line 8, in rand
> arr = dist_arr.get_local_pyarrow_array()
> TypeError: Unable to convert function return value to a Python type!
> The signature was
> (self: shmem_ml.core.ShmemML1DD) -> _object
> Traceback (most recent call last):
>   File "/global/homes/j/jmg3/shmem_ml/example/python_wrapper.py", line
> 15, in 
> random.rand(vec)
>   File "/global/homes/j/jmg3/shmem_ml/src/shmem_ml/random.py", line 8, in rand
> arr = dist_arr.get_local_pyarrow_array()
> TypeError: Unable to convert function return value to a Python type!
> The signature was
> (self: shmem_ml.core.ShmemML1DD) -> _object
>
> I'm new to pybind11, so I suspect this may not be a problem with my
> arrow usage as much as it is with my pybind11 usage. I wanted to ask
> if there's a better way to be doing this that's recommended for
> pyarrow applications. It seems there are cython examples in the docs,
> would the suggestion be to drop pybind11 and write a wrapper of my C++
> class in cython?
>
> Thanks for any suggestions,
>
> Max


Re: [C++] [Python] How to serialize and send C++ arrow types to a Python client to deserialize?

2020-08-12 Thread Wes McKinney
Note that we additionally have been planning to deprecate
pyarrow.serialize and pyarrow.deserialize. I expect they'll be
deprecated in the next major release (2.0.0)

On Mon, Aug 10, 2020 at 1:00 PM Micah Kornfield  wrote:
>
> Hi Barbara
>>
>>  I also need this C++ client to serialize/deserialize the data in the same 
>> format that our existing Python client does with pyarrow, so that serialized 
>> data sent from the C++ client can be read from the Python client and vice 
>> versa.
>
> How are you serializing data in python? If you are using pyarrow.serialize 
> [1] then I would suggest moving to serializing an IPC stream in memory [2] 
> and using that.  As the docs in pyarrow.serialize mention it is experimental, 
> this means it has no guarantees for forward and backward compatibility.
>
> Moving to this method would allow you use the RecordBatchReader from C++ [3]
>
>> After a bit of reading and research, I suspect that I should be using the 
>> arrow::py library, but was hoping to get more guidance on this.
>
>
> It is not entirely clear without a code sample, but if you are using the C++ 
> python libraries then you need to ensure the the Python interpreter and arrow 
> python module are initialized [4].
>
> Hope this helps.
>
> -Micah
>
> [1] 
> https://arrow.apache.org/docs/python/generated/pyarrow.serialize.html#pyarrow.serialize
> [2] https://arrow.apache.org/docs/python/ipc.html#using-streams
> [3] 
> https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReaderE
> [4] 
> https://github.com/apache/arrow/blob/2bd2fc45c45cb0edc8800eb53721231b56a65113/cpp/src/arrow/python/util/test_main.cc#L29
>
> On Mon, Aug 10, 2020 at 10:32 AM Barbara BOYAJIAN 
>  wrote:
>>
>> Hello,
>>
>> I'm currently looking to use Arrow in the following use case. I am writing a 
>> C++ client, where I need to send serialized Arrow data to Redis, and 
>> deserialize Arrow data that is received from Redis. I'm using boost::asio to 
>> communicate with Redis, and am able to send/receive buffers via unix and tcp 
>> sockets. I also need this C++ client to serialize/deserialize the data in 
>> the same format that our existing Python client does with pyarrow, so that 
>> serialized data sent from the C++ client can be read from the Python client 
>> and vice versa.
>>
>> I wish to be able to apply the above use case to send/receive 
>> arrow::Tensors, arrow::Tables, and arrow::Arrays.
>>
>> After a bit of reading and research, I suspect that I should be using the 
>> arrow::py library, but was hoping to get more guidance on this.
>>
>> So far, I have created a C++ arrow::Table manually, wrapped it using 
>> arrow::py::wrap_table, and have tried to use arrow::SerializeObject(...) to 
>> serialize it. However, my approach is not working as the memory address for 
>> the variable that is meant to hold the serialized object is 0x0.
>>
>> Thank you very much in advance for your help.
>>
>>


  1   2   3   >