Changing CA configurations under Python

2021-03-19 Thread Daniel Nugent
At the moment, if you load up the pyarrow.fs module, it initializes the ca 
configurations from the ssl module.

I have a situation where I need to change those at runtime. At the moment, it 
looks like these can only be set once when the pyarrow.fs module is loaded. Is 
it possible to modify these values after the initialization has occurred? It’s 
fine that they are global, I just have to load a Certificate file at a later 
point.

Thanks,

-Dan Nugent


Re: Are the Parquet file statistics correct in the following example?

2021-02-15 Thread Daniel Nugent
Ok. Thanks for the suggestions. I'll see if I can use the finer grained
writing to handle this.

I filed ARROW-11634 a bit before you responded because it did seem like a
bug. Hope that's sufficient for tracking.

-Dan Nugent


On Tue, Feb 16, 2021 at 1:00 AM Micah Kornfield 
wrote:

> Hi Dan,
> This seems suboptimal to me as well (and we should probably open a JIRA to
> track a solution).  I think the problematic code is [1] since we don't
> appear to update statistics for the actual indices but simply the overall
> dictionary (and of course there is that TODO)
>
> There are a couple of potential workarounds:
> 1. Try to make finer grained tables, with smaller dictionaries and use the
> fine grained writing API [2].  This still might not work (it could cause
> the fallback to dense if the object lifecycles aren't correct).
> 2.  Before writing, cast the column to dense (not dictionary encoded) (you
> might want to still iterate the table in chunks in this case to avoid
> excessive memory usage due to the loss of dictionary encoding compactness).
>
> Hope this helps.
>
> -Micah
>
> [1]
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L1492
> [2]
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
>
> On Sat, Feb 13, 2021 at 12:28 AM Nugent, Daniel 
> wrote:
>
>> Pyarrow version is 3.0.0
>>
>>
>>
>> Naively, I would expect the max and min to not just reflect the max and
>> min value of the dictionary for each row group, but the max and min value
>> of the actual values in the rowgroup.
>>
>>
>>
>> I looked at the Parquet spec which seems to reflect this as it refers to
>> the statistics applying to the logical type of the column, but I may be
>> misunderstanding.
>>
>>
>>
>> This is just a toy example, of course. The real data I'm working with is
>> quite a bit larger and ordered on the column this applies to, so being able
>> to use the statistics for predicate pushdown would be ideal.
>>
>>
>>
>> If pyarrow.parquet.write_table is not the preferred way to write Parquet
>> files out from Arrow data and there is a more germane method, I'd
>> appreciate being elucidated. I'd also appreciate any workaround suggestions
>> for the time being.
>>
>>
>>
>> Thank you,
>>
>> -Dan Nugent
>>
>>
>>
>> >>> import pyarrow as pa
>>
>> >>> import pyarrow.parquet as papq
>>
>> >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>
>> >>> t = pa.table({"col":d})
>>
>> >>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>
>> >>> f = papq.ParquetFile('sample.parquet')
>>
>> >>> (f.metadata.row_group(0).column(0).statistics.min,
>> f.metadata.row_group(0).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> (f.metadata.row_group(1).column(0).statistics.min,
>> f.metadata.row_group(1).column(0).statistics.max)
>>
>> ('A', 'B')
>>
>> >>> f.read_row_groups([0]).column(0)
>>
>> 
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>> [
>>
>>   "A",
>>
>>   "B"
>>
>> ]
>>
>>   -- indices:
>>
>> [
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   ...
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0,
>>
>>   0
>>
>> ]
>>
>> ]
>>
>> >>> f.read_row_groups([1]).column(0)
>>
>> 
>>
>> [
>>
>>
>>
>>   -- dictionary:
>>
>> [
>>
>>   "A",
>>
>>   "B"
>>
>> ]
>>
>>   -- indices:
>>
>> [
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   ...
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1,
>>
>>   1
>>
>> ]
>>
>> ]
>>
>> ##
>>
>> The information contained in this communication is confidential and
>>
>> may contain information that is privileged or exempt from disclosure
>>
>> under applicable law. If you are not a named addressee, please notify
>>
>> the sender immediately and delete this email from your system.
>>
>> If you have received this communication, and are not a named
>>
>> recipient, you are hereby notified that any dissemination,
>>
>> distribution or copying of this communication is strictly prohibited.
>> ##
>>
>>


Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

2021-01-26 Thread Daniel Nugent
Right, I recongize that using mmap directly isn't necessarily the most
straightforward, which is why I suggested using a RAM disk with
uncompressed arrow files. It saves the trouble of having to deal with
passing addresses around and puts a nice file system API on top of any
dataset operations that arrow already supports that you might want to do
(you can get a reasonable approach to appends using this, for example).

But if you've already got an on-disk, uncompressed arrow buffer that's
bigger than memory, the arrow api should take care of using the mmap system
calls to load it into memory (at least I think this is currently supported
for all the arrow libraries? You may have to double check. I know it's in
C/C++/Python for sure, probably Rust and I think R?).

Then you're only dealing with virtual allocations and you can load that
larger than memory file in as many analytics packages as you like and there
will only be one copy of any portion of that file in memory at any given
time.

-Dan Nugent


On Tue, Jan 26, 2021 at 1:15 PM Thomas Browne  wrote:

> Yes I think the term "zero copy" was confusing to me. It doesn't quite do
> what it says on the tin since if I understand correctly the term still
> allows for an actually copy still to occur, it's just a direct binary copy
> without a [de]serialisation process.
>
> I hear you on plasma.
>
> On the issue of MAP_SHARED, got it, but that means I'm having to talk C
> from other languages.
>
> I think Jorge's answer (
> https://arrow.apache.org/docs/format/CDataInterface.html) is pretty good
> though. Good enough for me anyway. Thanks everyone.
> On 26/01/2021 18:09, Daniel Nugent wrote:
>
> I think you might be a bit confused about what zero copy means if that’s
> what you’re concerned about. If you have a bigger than memory file, then
> Plasma wasn’t going to help since its design always involved copying the
> arrow buffers to memory.
>
> If you have larger than memory arrow files in the first place, just open
> them using mmap (should be automatically done for non-compressed arrow
> files).
>
> --
> -Dan Nugent
> On Jan 26, 2021, 13:07 -0500, Thomas Browne 
> , wrote:
>
> don't I lose the benefit of mmapping huge files with a ramdisk? Cos the
> file has to now fit on my ramdisk.
>
> Personally working with financial tick data which can be enormous.
> On 26/01/2021 18:00, Daniel Nugent wrote:
>
> Is there a problem with just using a RAM disk as the method for sharing
> the arrow buffers? It just seems easier and less finicky than a separate
> API to program against.
>
> It also makes storing the data permanently a lot  more straightforward, I
> think.
>
> --
> -Dan Nugent
> On Jan 26, 2021, 12:47 -0500, Thomas Browne 
> , wrote:
>
> So one of the big advantages of Arrow is the common format in memory, on
> the wire, across languages.
>
> I get that this makes it very easy and fast to transfer data between
> nodes, and between languages, which will all share the in-memory format
> and therefore the (often expensive) serialisation step is removed.
>
> However, is it true that one of the core objectives of the project is
> also to allow shared memory objects across different languages on the
> same node? For example, a fast C-based ingest system constantly
> populates a pyarrow buffer, which can be read directly by any other
> application on that node, through pointer sharing?
>
> If this is a core objective, what is the canonical way for brokering the
> "pointers" to this data between languages? Is it the Plasma store? And
> if so, are there plans for Plasma to move be implemented in other client
> languages?
>
> In short. Is Plasma (or if not Plasma, the functionality it provides
> implemented some other way), a core objective of the project?
>
> Or instead is Flight supposed to be used between languages on the same
> node, and if so, does Flight provide true zero-copy (ie - the same
> buffer, not copying the buffer) if run between processes on the same node?
>
> Many thanks.
>
>


Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

2021-01-26 Thread Daniel Nugent
I think you might be a bit confused about what zero copy means if that’s what 
you’re concerned about. If you have a bigger than memory file, then Plasma 
wasn’t going to help since its design always involved copying the arrow buffers 
to memory.

If you have larger than memory arrow files in the first place, just open them 
using mmap (should be automatically done for non-compressed arrow files).

--
-Dan Nugent
On Jan 26, 2021, 13:07 -0500, Thomas Browne , wrote:
> don't I lose the benefit of mmapping huge files with a ramdisk? Cos the file 
> has to now fit on my ramdisk.
>
> Personally working with financial tick data which can be enormous.
> On 26/01/2021 18:00, Daniel Nugent wrote:
> > Is there a problem with just using a RAM disk as the method for sharing the 
> > arrow buffers? It just seems easier and less finicky than a separate API to 
> > program against.
> >
> > It also makes storing the data permanently a lot  more straightforward, I 
> > think.
> >
> > --
> > -Dan Nugent
> > On Jan 26, 2021, 12:47 -0500, Thomas Browne , wrote:
> > > So one of the big advantages of Arrow is the common format in memory, on
> > > the wire, across languages.
> > >
> > > I get that this makes it very easy and fast to transfer data between
> > > nodes, and between languages, which will all share the in-memory format
> > > and therefore the (often expensive) serialisation step is removed.
> > >
> > > However, is it true that one of the core objectives of the project is
> > > also to allow shared memory objects across different languages on the
> > > same node? For example, a fast C-based ingest system constantly
> > > populates a pyarrow buffer, which can be read directly by any other
> > > application on that node, through pointer sharing?
> > >
> > > If this is a core objective, what is the canonical way for brokering the
> > > "pointers" to this data between languages? Is it the Plasma store? And
> > > if so, are there plans for Plasma to move be implemented in other client
> > > languages?
> > >
> > > In short. Is Plasma (or if not Plasma, the functionality it provides
> > > implemented some other way), a core objective of the project?
> > >
> > > Or instead is Flight supposed to be used between languages on the same
> > > node, and if so, does Flight provide true zero-copy (ie - the same
> > > buffer, not copying the buffer) if run between processes on the same node?
> > >
> > > Many thanks.


Re: Question the nature of the "Zero Copy" advantages of Apache Arrow

2021-01-26 Thread Daniel Nugent
Is there a problem with just using a RAM disk as the method for sharing the 
arrow buffers? It just seems easier and less finicky than a separate API to 
program against.

It also makes storing the data permanently a lot  more straightforward, I think.

--
-Dan Nugent
On Jan 26, 2021, 12:47 -0500, Thomas Browne , wrote:
> So one of the big advantages of Arrow is the common format in memory, on
> the wire, across languages.
>
> I get that this makes it very easy and fast to transfer data between
> nodes, and between languages, which will all share the in-memory format
> and therefore the (often expensive) serialisation step is removed.
>
> However, is it true that one of the core objectives of the project is
> also to allow shared memory objects across different languages on the
> same node? For example, a fast C-based ingest system constantly
> populates a pyarrow buffer, which can be read directly by any other
> application on that node, through pointer sharing?
>
> If this is a core objective, what is the canonical way for brokering the
> "pointers" to this data between languages? Is it the Plasma store? And
> if so, are there plans for Plasma to move be implemented in other client
> languages?
>
> In short. Is Plasma (or if not Plasma, the functionality it provides
> implemented some other way), a core objective of the project?
>
> Or instead is Flight supposed to be used between languages on the same
> node, and if so, does Flight provide true zero-copy (ie - the same
> buffer, not copying the buffer) if run between processes on the same node?
>
> Many thanks.


Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Daniel Nugent
The biggest problem with mapped arrow data is that it's only possible with
uncompressed Feather files. Is there ever a possibility that compressed
files could be mappable (I know that you'd have to decompress a given
RecordBatch to actually work with it, but Feather files should be comprised
of many RecordBatches, right?)

-Dan Nugent


On Thu, Oct 22, 2020 at 4:49 PM Wes McKinney  wrote:

> I'm not sure where the conflict in what's written online is, but by
> virtue of being designed such that data structures do not require
> memory buffers to be RAM resident (i.e. can reference memory maps), we
> are set up well to process larger-than-memory datasets. In C++ at
> least we are putting the pieces in place to be able to do efficient
> query execution on on-disk datasets, and it may already be possible in
> Rust with DataFusion.
>
> On Thu, Oct 22, 2020 at 2:11 PM Chris Nuernberger 
> wrote:
> >
> > There are ways to handle datasets larger than memory.  mmap'ing one or
> more arrow files and going from there is a pathway forward here:
> >
> > https://techascent.com/blog/memory-mapping-arrow.html
> >
> > How this maps to other software ecosystems I don't know but many have
> mmap support.
> >
> > On Thu, Oct 22, 2020 at 12:47 PM Jacek Pliszka 
> wrote:
> >>
> >> I believe it would be good if you define your use case.
> >>
> >> I do handle larger than memory datasets with pyarrow with the use of
> >> dataset.scan but my use case is very specific as I am repartitioning
> >> and cleaning a bit large datasets.
> >>
> >> BR,
> >>
> >> Jacek
> >>
> >> czw., 22 paź 2020 o 20:39 Jacob Zelko 
> napisał(a):
> >> >
> >> > Hi all,
> >> >
> >> > Very basic question as I have seen conflicting sources. I come from
> the Julia community and was wondering if Arrow can handle
> larger-than-memory datasets? I saw this post by Wes McKinney here
> discussing that the tooling is being laid down:
> >> >
> >> > Table columns in Arrow C++ can be chunked, so that appending to a
> table is a zero copy operation, requiring no non-trivial computation or
> memory allocation. By designing up front for streaming, chunked tables,
> appending to existing in-memory tabler is computationally inexpensive
> relative to pandas now. Designing for chunked or streaming data is also
> essential for implementing out-of-core algorithms, so we are also laying
> the foundation for processing larger-than-memory datasets.
> >> >
> >> > ~ Apache Arrow and the “10 Things I Hate About pandas”
> >> >
> >> > And then in the docs I saw this:
> >> >
> >> > The pyarrow.dataset module provides functionality to efficiently work
> with tabular, potentially larger than memory and multi-file datasets:
> >> >
> >> > A unified interface for different sources: supporting different
> sources and file formats (Parquet, Feather files) and different file
> systems (local, cloud).
> >> > Discovery of sources (crawling directories, handle directory-based
> partitioned datasets, basic schema normalization, ..)
> >> > Optimized reading with predicate pushdown (filtering rows),
> projection (selecting columns), parallel reading or fine-grained managing
> of tasks.
> >> >
> >> > Currently, only Parquet and Feather / Arrow IPC files are supported.
> The goal is to expand this in the future to other file formats and data
> sources (e.g. database connections).
> >> >
> >> > ~ Tabular Datasets
> >> >
> >> > The article from Wes was from 2017 and the snippet on Tabular
> Datasets is from the current documentation for pyarrow.
> >> >
> >> > Could anyone answer this question or at least clear up my confusion
> for me? Thank you!
> >> >
> >> > --
> >> > Jacob Zelko
> >> > Georgia Institute of Technology - Biomedical Engineering B.S. '20
> >> > Corning Community College - Engineering Science A.S. '17
> >> > Cell Number: (607) 846-8947
>


Re:RE: [EXTERNAL] How to understand and use the zero-copy between two processor?

2020-06-04 Thread Daniel Nugent
Sorry, I don’t rightly know what that part means. You can definitely map arrow 
IPC messages that are on disk in to memory in a zero copy way. It’s just the 
streaming part that I’m not sure about.

-Dan Nugent
On Jun 4, 2020, 08:27 -0400, yunfan , wrote:
> I just wonder wonder what the "zero-copy" means in arrow document.
> In my understanding,  copy memory is also necessary for arrow streaming 
> messaging.
>
> https://arrow.apache.org/
> "It also provides computational libraries and zero-copy streaming messaging 
> and interprocess communication"
>
>
>
>
> -- Original --
> From: "Nugent, Daniel";
> Date: Thu, Jun 4, 2020 11:53 AM
> To: "user@arrow.apache.org";
> Subject: RE: [EXTERNAL] How to understand and use the zero-copy between two 
> processor?
>
> Hi,
>
> I'm not 100% sure I know exactly what you want to achieve here, 
> unfortunately. If the message buffers are being streamed to a shared memory 
> backed file, then you can't use shared memory to continuously read them 
> because the mmap facility provides fixed size shared memory. You could use an 
> out of band signal to indicate that you need to re-map the stream storage 
> file, I guess, but that's not really a stream. You *could* read from the 
> file, but that's going to necessarily copy from the file handle, same as a 
> pipe. If you want to use the plasma object store, that can simplify the 
> process of moving individual RecordBatches of a Table into shared memory to 
> be used between processes. Unfortunately, the plasma store does have the 
> limitation that it currently cannot "adopt" shared memory in any way, so one 
> initial copy into the store is necessary.
>
> To go back to the shared memory + OOB communication: That well may be 
> workable. The read cost for the shared memory backed mapped files will be 
> very low, so concatenating the RecordBatches back into a Table repeatedly may 
> not be a serious issue as long as there aren't *too* many RecordBatches to be 
> processed.
>
> Even given all of that, I don't know that Spark has yet implemented their 
> Dataframes as Arrow array backed objects. There cannot be *true* zero copy 
> until that is the case amongst two systems.
>
> I hope that helps a little.
>
> -Dan Nugent
>
>
> From: yunfan 
> Sent: Wednesday, June 3, 2020 10:23 PM
> To: user 
> Subject: [EXTERNAL] How to understand and use the zero-copy between two 
> processor?
>
> In my understanding, I can write a file with shared-memory.  And open this 
> shared-memory file in other processor.
> But it can't used in streaming mode. Any way to use the zero-copy between two 
> processor?
> I find spark also use pipe to transform arrow bytes between java and python 
> procecssor.
>
>
>
> ##
> The information contained in this communication is confidential and
> may contain information that is privileged or exempt from disclosure
> under applicable law. If you are not a named addressee, please notify
> the sender immediately and delete this email from your system.
> If you have received this communication, and are not a named
> recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited.
> ##


Re: 'Plain' Dataset Python API doesn't memory map?

2020-05-03 Thread Daniel Nugent
Thanks Joris. That did the trick.

-Dan Nugent
On Apr 30, 2020, 10:01 -0400, Wes McKinney , wrote:
> For the record, as I've stated elsewhere I'm fairly sure, I don't
> agree with toggling memory mapping at the filesystem level. If a
> filesystem supports memory mapping, then a consumer of the filesystem
> should IMHO be able to request a memory map.
>
> On Thu, Apr 30, 2020 at 2:27 AM Joris Van den Bossche
>  wrote:
> >
> > Hi Dan,
> >
> > Currently, the memory mapping in the Datasets API is controlled by the 
> > filesystem. So to enable memory mapping for feather, you can do:
> >
> > import pyarrow.dataset as ds
> > from pyarrow.fs import LocalFileSystem
> >
> > fs = LocalFileSystem(use_mmap=True)
> > t = ds.dataset('demo', format='feather', filesystem=fs).to_table()
> >
> > Can you try if that is working for you?
> > We should better document this (and there is actually also some discussion 
> > about the best API for this, see 
> > https://issues.apache.org/jira/browse/ARROW-8156, 
> > https://issues.apache.org/jira/browse/ARROW-8307)
> >
> > Joris
> >
> > On Thu, 30 Apr 2020 at 01:58, Daniel Nugent  wrote:
> > >
> > > Hi,
> > >
> > > I'm trying to use the 0.17 dataset API to map in an arrow table in the 
> > > uncompressed feather format (ultimately hoping to work with data larger 
> > > than memory). It seems like it reads all the constituent files into 
> > > memory before creating the Arrow table object though.
> > >
> > > When I use the FeatherDataset API, it does appear to work map the files 
> > > and the Table is created based off of mapped data.
> > >
> > > Any hints at what I'm doing wrong? I didn't see any options relating to 
> > > memory mapping for the general datasets
> > >
> > > Here's the code for the plain dataset api call:
> > >
> > > from pyarrow.dataset import dataset as ds
> > > t = ds('demo', format='feather').read_table()
> > >
> > > Here's the code for reading using the FeatherDataset api:
> > >
> > > from pyarrow.feather import FeatherDataset as ds
> > > from pathlib import Path
> > > t = ds(list(Path('demo').iterdir())).read_table()
> > >
> > > Thanks!
> > >
> > > -Dan Nugent


'Plain' Dataset Python API doesn't memory map?

2020-04-29 Thread Daniel Nugent
Hi,

I'm trying to use the 0.17 dataset API to map in an arrow table in the
uncompressed feather format (ultimately hoping to work with data larger
than memory). It seems like it reads all the constituent files into memory
before creating the Arrow table object though.

When I use the FeatherDataset API, it does appear to work map the files and
the Table is created based off of mapped data.

Any hints at what I'm doing wrong? I didn't see any options relating to
memory mapping for the general datasets

Here's the code for the plain dataset api call:

from pyarrow.dataset import dataset as ds
t = ds('demo', format='feather').read_table()

Here's the code for reading using the FeatherDataset api:

from pyarrow.feather import FeatherDataset as ds
from pathlib import Path
t = ds(list(Path('demo').iterdir())).read_table()

Thanks!

-Dan Nugent


Arrow Format vs Feather v2

2020-04-23 Thread Daniel Nugent
Was just reading the 0.17 release notes (congratulations to the maintainers, 
btw), and was wondering if there could be some clarification on the language 
about file formats.

The notes mention that the compression support available for Feather 2 will be 
formalized in the Arrow format at a later time.

Does that mean that they will be formalized for in-memory and on the wire Arrow 
messages? Or that there will be another, separate from Feather 2, on-disk 
representation for Arrow called “Arrow file format” or something along those 
lines?

Thanks,

-Dan Nugent


Re: Attn: Wes, Re: Masked Arrays

2020-03-30 Thread Daniel Nugent
Thanks! Since I'm just using it to jump to Arrow, I think I'll stick with
it.

Do you have any feelings about why Numpy's masked arrays didn't gain favor
when many data representation formats explicitly support nullity (including
Arrow)? Is it just that not carrying nulls in computations forward is
preferable (that is, early filtering/value filling was easier)?

-Dan Nugent


On Mon, Mar 30, 2020 at 11:40 AM Wes McKinney  wrote:

> On Mon, Mar 30, 2020 at 8:31 AM Daniel Nugent  wrote:
> >
> > Didn’t want to follow up on this on the Jira issue earlier since it's
> sort of tangential to that bug and more of a usage question. You said:
> >
> > > I wouldn't recommend building applications based on them nowadays
> since the level of support / compatibility in other projects is low.
> >
> > In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
> >
> > Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
> >
>
> I mean just the numpy.ma module. Not many Python computing projects
> nowadays treat MaskedArray objects as first class citizens. Depending
> on what you need it may or may not be a problem. pyarrow supports
> ingesting from MaskedArray as a convenience, but it would not be
> common in my experience for a library's APIs to return MaskedArrays.
>
> > If this is too much of a numpy question rather than an arrow question,
> could you point me to where I can read up on masked array support or maybe
> what the right place to ask the numpy community about whether what I'm
> doing is appropriate or not.
> >
> > Thanks,
> >
> >
> > -Dan Nugent
>


Re: Attn: Wes, Re: Masked Arrays

2020-03-30 Thread Daniel Nugent
Shoot, sorry, there's a typo in there:

> converting from an internal format into Arrow in memory structures
appears zero cos

should be

> converting from numpy arrays into Arrow in memory structures appears zero
cost

-Dan Nugent


On Mon, Mar 30, 2020 at 9:31 AM Daniel Nugent  wrote:

> Didn’t want to follow up on this on the Jira issue earlier since it's sort
> of tangential to that bug and more of a usage question. You said:
>
> > I wouldn't recommend building applications based on them nowadays since
> the level of support / compatibility in other projects is low.
>
> In my case, I am using them since it seemed like a straightforward
> representation of my data that has nulls, the format I’m converting from
> has zero cost numpy representations, and converting from an internal format
> into Arrow in memory structures appears zero cost (or close to it) as well.
> I guess I can just provide the mask as an explicit argument, but my
> original desire to use it came from being able to exploit
> numpy.ma.concatenate in a way that saved some complexity in implementation.
>
> Since Arrow itself supports masking values with a bitfield, is there
> something intrinsic to the notion of array masks that is not well
> supported? Or do you just mean the specific numpy MaskedArray class?
>
> If this is too much of a numpy question rather than an arrow question,
> could you point me to where I can read up on masked array support or maybe
> what the right place to ask the numpy community about whether what I'm
> doing is appropriate or not.
>
> Thanks,
>
>
> -Dan Nugent
>


Attn: Wes, Re: Masked Arrays

2020-03-30 Thread Daniel Nugent
Didn’t want to follow up on this on the Jira issue earlier since it's sort of 
tangential to that bug and more of a usage question. You said:

> I wouldn't recommend building applications based on them nowadays since the 
> level of support / compatibility in other projects is low.

In my case, I am using them since it seemed like a straightforward 
representation of my data that has nulls, the format I’m converting from has 
zero cost numpy representations, and converting from an internal format into 
Arrow in memory structures appears zero cost (or close to it) as well. I guess 
I can just provide the mask as an explicit argument, but my original desire to 
use it came from being able to exploit numpy.ma.concatenate in a way that saved 
some complexity in implementation.

Since Arrow itself supports masking values with a bitfield, is there something 
intrinsic to the notion of array masks that is not well supported? Or do you 
just mean the specific numpy MaskedArray class?

If this is too much of a numpy question rather than an arrow question, could 
you point me to where I can read up on masked array support or maybe what the 
right place to ask the numpy community about whether what I'm doing is 
appropriate or not.

Thanks,


-Dan Nugent


Why are dictionary indices signed?

2020-03-25 Thread Daniel Nugent
Can’t find much in the way of reasoning in documentation for this. Closest is 
some issues about unsigned ints not being implemented a while back.

Has this become a deep dependency or is it still just an accident of 
implementation? It appears that only the unsigned indices are used.

-Dan Nugent


Re: Access to parquet elements in Row via Rust API

2020-03-18 Thread Daniel Nugent
I believe the Rust API is still nascent. But you can get that by looking 
through the Metadata. It’s a bit nastily nested though. The type descent path 
is File > ParquetMetaData > RowGroupMetaData > ColumnChunkMetaData > 
ColumnDescriptor > 

You are currently required to track the mappings between index and column name 
yourself.

-Dan Nugent
On Mar 18, 2020, 17:49 -0400, Sebastian Fischmeister , 
wrote:
> Hi,
>
> I'm trying to write a simple Rust program that accesses a parquet file, looks 
> for some values, and prints them.
>
> I took the example from [1] and can pick up individual column values in each 
> row by directly addressing them (e.g., record.get_long(147)) ). However, 
> there are two problems with that:
>
> 1) It's unclear whether 147 is really the column I want to read.
> 2) If there's a change in the provided parquet file, all the absolute numbers 
> may change.
>
> I was hoping that there's a way to address an element through a hashmap as in 
> record.get("foo"). Does the Rust API currently support this?
>
> Thanks,
> Sebastian
>
> [1] https://docs.rs/parquet/0.16.0/parquet/file/index.html


Re: Question about memoryviews and array construction

2020-03-07 Thread Daniel Nugent
Great!

If you could provide a smidgen of guidance about where to start making this 
change, I would be happy to give it a shot.

Thanks,

-Dan Nugent
On Mar 7, 2020, 09:18 -0500, Wes McKinney , wrote:
> hi Dan,
>
> Yes, we should support constructing StringArray directly from
> memoryview as we do with bytes and unicode -- you're the first person
> to ask about this so far. I opened
> https://issues.apache.org/jira/browse/ARROW-8026. This should not be a
> huge amount of work so would be a good first contribution to the
> project
>
> Thanks
>
> Wes
>
> On Fri, Mar 6, 2020 at 8:29 PM Nugent, Daniel  wrote:
> >
> > Hi,
> >
> >
> >
> > I have a short program which I’m wondering about the sensibility of. Could 
> > anyone let me know if this is reasonable or not:
> >
> >
> >
> > > > > import pyarrow as pa, third_party_library
> >
> > > > > memory_views = third_party_library.get_strings()
> >
> > > > > memory_views
> >
> > [, ,  > 0x7f1745cc0a10>, ]
> >
> > > > > pa.array(memory_views,pa.string())
> >
> > Traceback (most recent call last):
> >
> > File "", line 1, in 
> >
> > File "pyarrow/array.pxi", line 269, in pyarrow.lib.array
> >
> > File "pyarrow/array.pxi", line 38, in pyarrow.lib._sequence_to_array
> >
> > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> >
> > pyarrow.lib.ArrowTypeError: Expected a string or bytes object, got a 
> > 'memoryview' object
> >
> > > > > pa.array(map(bytes,memory_views),pa.string())
> >
> > 
> >
> > [
> >
> > "this",
> >
> > "is",
> >
> > "a",
> >
> > "sample"
> >
> > ]
> >
> >
> >
> > I have a big list of byte sequences being provided to me as memoryviews 
> > from a third party library. I’d like to create an Arrow StringArray from 
> > them as efficiently as possible. Having to map and consequently copy them 
> > through a bytes constructor seems not great (and the memoryview tobytes 
> > function appears to just call the bytes constructor, afaict).
> >
> >
> >
> > To me, it seemed like pa.array should be able to use the memoryview objects 
> > directly in order to construct the StringArray, but it seems like Arrow 
> > wants them copied into fresh byte objects first. I don’t know if I 
> > understand why and was ultimately wondering if it’s a reasonable thing to 
> > desire.
> >
> >
> >
> > Thanks in advance,
> >
> > -Dan Nugent
> >
> >
> >
> >
> > ##
> >
> > The information contained in this communication is confidential and
> >
> > may contain information that is privileged or exempt from disclosure
> >
> > under applicable law. If you are not a named addressee, please notify
> >
> > the sender immediately and delete this email from your system.
> >
> > If you have received this communication, and are not a named
> >
> > recipient, you are hereby notified that any dissemination,
> >
> > distribution or copying of this communication is strictly prohibited.
> >
> > ##


Mailing List Web Archives not updating

2020-02-24 Thread Daniel Nugent
Is anyone else experiencing this issue? Neither the dev nor user mailing list 
have updated since Sunday February 16

I’m not sure exactly who to contact about this.

--
-Dan Nugent


Re: Reading large csv file with pyarrow

2020-02-18 Thread Daniel Nugent
Exposing streaming csv reads would be useful independent of the datasets api 
for ETL processes.
On Feb 18, 2020, 03:25 -0500, Wes McKinney , wrote:
> Yes, that looks right. There will need to be corresponding work in
> Python to make this available (probably through the datasets API)
>
> On Mon, Feb 17, 2020 at 12:35 PM Daniel Nugent  wrote:
> >
> > Arrow-3410 maybe?
> > On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote:
> >
> > I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> > files. Such an API is not yet available in Python. This is also
> > required for the C++ Datasets API. If there are not one or more JIRA
> > issues about this I suggest that we open some to capture the use cases
> >
> > On Fri, Feb 14, 2020 at 3:16 PM filippo medri  
> > wrote:
> >
> >
> > Hi,
> > by experimenting with arrow read_csv function to convert csv fie into 
> > parquet I found that it reads the data in memory.
> > On a side the ReadOptions class allows to specify a blocksize parameter to 
> > limit how much bytes to process at a time, but by looking at the memory 
> > usage my understanding is that the underlying Table is filled with all data.
> > Is there a way to at least specify a parameter to limit the read to a batch 
> > of rows? I see that I can skip rows from the beginning, but I am not 
> > finding a way to limit how many rows to read.
> > Which is the intended way to read a csv file that does not fit into memory?
> > Thanks in advance,
> > Filippo Medri


Re: Reading large csv file with pyarrow

2020-02-17 Thread Daniel Nugent
Arrow-3410 maybe?
On Feb 17, 2020, 07:47 -0500, Wes McKinney , wrote:
> I seem to recall discussions about 1 chunk-at-a-time reading of CSV
> files. Such an API is not yet available in Python. This is also
> required for the C++ Datasets API. If there are not one or more JIRA
> issues about this I suggest that we open some to capture the use cases
>
> On Fri, Feb 14, 2020 at 3:16 PM filippo medri  wrote:
> >
> > Hi,
> > by experimenting with arrow read_csv function to convert csv fie into 
> > parquet I found that it reads the data in memory.
> > On a side the ReadOptions class allows to specify a blocksize parameter to 
> > limit how much bytes to process at a time, but by looking at the memory 
> > usage my understanding is that the underlying Table is filled with all data.
> > Is there a way to at least specify a parameter to limit the read to a batch 
> > of rows? I see that I can skip rows from the beginning, but I am not 
> > finding a way to limit how many rows to read.
> > Which is the intended way to read a csv file that does not fit into memory?
> > Thanks in advance,
> > Filippo Medri