Uniform types in Arrow table columns (pyarrow.array) and the case of python dictionaries

2018-01-21 Thread simba nyatsanga
Hi Everyone,

I've got two questions that I'd like help with:

1. Pandas and numpy arrays can handle multiple types in a sequence eg. a
float and a string by using the dtype=object. From what I gather, Arrow
arrays enforce a uniform type depending on the type of the first
encountered element in a sequence. This looks like a deliberate choice and
I'd like to get a better understanding of the reason for ensuring this
conformity. Does making the data structure's type deterministic allow for
efficient pointer arithmetic when reading contiguous blocks and thus making
reading performant?

2. Pandas and numpy can also handle dictionary elements using the
dtype=object while pyarrow arrays don't. I'd like to understand the
reasoning behind the choice here as well.

Thanks again for taking my questions.

Kind Regards
Simba


Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-21 Thread Philipp Moritz
Note that for the Python bindings, the reference counting is done
automatically, see

https://github.com/apache/arrow/blob/master/python/pyarrow/plasma.pyx#L182

which is e.g. used as the base object for numpy arrays whose memory is
backed by the object store.

On Sun, Jan 21, 2018 at 4:21 PM, Robert Nishihara  wrote:

> Evicted objects are gone for good, although it would certainly be possible
> to add the ability to persist them to disk.
>
> The Plasma store does reference counting to figure out which clients are
> using which objects. Clients can "release" objects through the client API
> to decrement the reference count. The Plasma store also keeps track of when
> a client exits/dies and automatically gets rid of the reference counts for
> that client.
>
> On Sun, Jan 21, 2018 at 4:09 PM Mike Sam  wrote:
>
> > Great, thank you very much.
> >
> > What happens to the evicted objects? are they
> > gone for good or are they persisted locally?
> >
> > Also, what defines "objects that are not currently in use by any client"?
> > reference counting?
> >
> >
> >
> > On Sat, Jan 20, 2018 at 1:53 PM, Robert Nishihara <
> > robertnishih...@gmail.com
> > > wrote:
> >
> > > When Plasma is started up, you specify the total amount of memory it is
> > > allowed to use (in bytes) with the -m flag.
> > >
> > > When a Plasma client attempts to create a new object and there is not
> > > enough memory in the store, the store will evict a bunch of unused
> > objects
> > > to free up memory (objects that are not currently in use by any
> client).
> > > This is done in a least-recently-used fashion as defined in the
> eviction
> > > policy
> > > https://github.com/apache/arrow/blob/master/cpp/src/
> > > plasma/eviction_policy.h.
> > > In principle, this eviction policy could be made more configurable or a
> > > different eviction policy could be plugged in, though we haven't
> > > experimented with that much.
> > >
> > > If you want to manually delete an object from Plasma, that can be done
> > with
> > > the "Delete" command
> > > https://github.com/apache/arrow/blob/d135974a0d3dd9a9fbbb10da4c5dbc
> > > 65f9324234/cpp/src/plasma/client.h#L186,
> > > which is part of the C++ Plasma client API but has not been exposed
> > through
> > > Python yet.
> > >
> > > For now, if you want to make sure that an object will not be evicted
> > (e.g.,
> > > from the C++ Client API), you can call Get on the object ID and then it
> > > will not be evicted before you call Release from the same client.
> > >
> > > On Fri, Jan 19, 2018 at 5:17 PM Mike Sam  wrote:
> > >
> > > > Thank you, Robert, for your answer.
> > > >
> > > > Could you kindly further elaborate on number 1 as I am not
> > > > familiar with Plasma codebase yet?
> > > > Are you saying persistence is available out of the box? else what
> > > > specific things need to be added
> > > > to Plasma codebase to make this happen?
> > > >
> > > > Thank you,
> > > > Mike
> > > >
> > > >
> > > >
> > > > On Thu, Jan 18, 2018 at 11:43 PM, Robert Nishihara <
> > > > robertnishih...@gmail.com> wrote:
> > > >
> > > > > Hi Mike,
> > > > >
> > > > > 1. I think yes, though we'd need to turn off the automatic LRU
> > eviction
> > > > > that happens when the store fills up.
> > > > >
> > > > > 3. I think there are some edge cases and it depends what is in your
> > > > > DataFrame, but at least if it consists of numerical data then the
> two
> > > > > representations should use the same underlying data in shared
> memory.
> > > > >
> > > > > On Thu, Jan 18, 2018 at 11:37 PM Mike Sam 
> > > wrote:
> > > > >
> > > > > > I am interested to implement an arrow based persisted cache store
> > > and I
> > > > > > have a few related questions:
> > > > > >
> > > > > >1.
> > > > > >
> > > > > >Is it possible just to use Plasma for this goal?
> > > > > >(My understanding is that it is not persistable)
> > > > > >Else, what is the recommended way to do so?
> > > > > >2.
> > > > > >
> > > > > >Is feather the better file format for persistence to avoid
> > > > > >re-transcoding hot chunks?
> > > > > >3.
> > > > > >
> > > > > >When Pandas load data from plasma/arrow, is it doubling the
> > memory
> > > > > >usage? (One for the arrow representation, one for pandas
> > > > > representation)
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > Mike
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks,
> > > > Mike
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Mike
> >
>


Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-21 Thread Robert Nishihara
Evicted objects are gone for good, although it would certainly be possible
to add the ability to persist them to disk.

The Plasma store does reference counting to figure out which clients are
using which objects. Clients can "release" objects through the client API
to decrement the reference count. The Plasma store also keeps track of when
a client exits/dies and automatically gets rid of the reference counts for
that client.

On Sun, Jan 21, 2018 at 4:09 PM Mike Sam  wrote:

> Great, thank you very much.
>
> What happens to the evicted objects? are they
> gone for good or are they persisted locally?
>
> Also, what defines "objects that are not currently in use by any client"?
> reference counting?
>
>
>
> On Sat, Jan 20, 2018 at 1:53 PM, Robert Nishihara <
> robertnishih...@gmail.com
> > wrote:
>
> > When Plasma is started up, you specify the total amount of memory it is
> > allowed to use (in bytes) with the -m flag.
> >
> > When a Plasma client attempts to create a new object and there is not
> > enough memory in the store, the store will evict a bunch of unused
> objects
> > to free up memory (objects that are not currently in use by any client).
> > This is done in a least-recently-used fashion as defined in the eviction
> > policy
> > https://github.com/apache/arrow/blob/master/cpp/src/
> > plasma/eviction_policy.h.
> > In principle, this eviction policy could be made more configurable or a
> > different eviction policy could be plugged in, though we haven't
> > experimented with that much.
> >
> > If you want to manually delete an object from Plasma, that can be done
> with
> > the "Delete" command
> > https://github.com/apache/arrow/blob/d135974a0d3dd9a9fbbb10da4c5dbc
> > 65f9324234/cpp/src/plasma/client.h#L186,
> > which is part of the C++ Plasma client API but has not been exposed
> through
> > Python yet.
> >
> > For now, if you want to make sure that an object will not be evicted
> (e.g.,
> > from the C++ Client API), you can call Get on the object ID and then it
> > will not be evicted before you call Release from the same client.
> >
> > On Fri, Jan 19, 2018 at 5:17 PM Mike Sam  wrote:
> >
> > > Thank you, Robert, for your answer.
> > >
> > > Could you kindly further elaborate on number 1 as I am not
> > > familiar with Plasma codebase yet?
> > > Are you saying persistence is available out of the box? else what
> > > specific things need to be added
> > > to Plasma codebase to make this happen?
> > >
> > > Thank you,
> > > Mike
> > >
> > >
> > >
> > > On Thu, Jan 18, 2018 at 11:43 PM, Robert Nishihara <
> > > robertnishih...@gmail.com> wrote:
> > >
> > > > Hi Mike,
> > > >
> > > > 1. I think yes, though we'd need to turn off the automatic LRU
> eviction
> > > > that happens when the store fills up.
> > > >
> > > > 3. I think there are some edge cases and it depends what is in your
> > > > DataFrame, but at least if it consists of numerical data then the two
> > > > representations should use the same underlying data in shared memory.
> > > >
> > > > On Thu, Jan 18, 2018 at 11:37 PM Mike Sam 
> > wrote:
> > > >
> > > > > I am interested to implement an arrow based persisted cache store
> > and I
> > > > > have a few related questions:
> > > > >
> > > > >1.
> > > > >
> > > > >Is it possible just to use Plasma for this goal?
> > > > >(My understanding is that it is not persistable)
> > > > >Else, what is the recommended way to do so?
> > > > >2.
> > > > >
> > > > >Is feather the better file format for persistence to avoid
> > > > >re-transcoding hot chunks?
> > > > >3.
> > > > >
> > > > >When Pandas load data from plasma/arrow, is it doubling the
> memory
> > > > >usage? (One for the arrow representation, one for pandas
> > > > representation)
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Mike
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Mike
> > >
> >
>
>
>
> --
> Thanks,
> Mike
>


Re: Exploring the possibility of creating a persistent cache by arrow/plasma

2018-01-21 Thread Mike Sam
Great, thank you very much.

What happens to the evicted objects? are they
gone for good or are they persisted locally?

Also, what defines "objects that are not currently in use by any client"?
reference counting?



On Sat, Jan 20, 2018 at 1:53 PM, Robert Nishihara  wrote:

> When Plasma is started up, you specify the total amount of memory it is
> allowed to use (in bytes) with the -m flag.
>
> When a Plasma client attempts to create a new object and there is not
> enough memory in the store, the store will evict a bunch of unused objects
> to free up memory (objects that are not currently in use by any client).
> This is done in a least-recently-used fashion as defined in the eviction
> policy
> https://github.com/apache/arrow/blob/master/cpp/src/
> plasma/eviction_policy.h.
> In principle, this eviction policy could be made more configurable or a
> different eviction policy could be plugged in, though we haven't
> experimented with that much.
>
> If you want to manually delete an object from Plasma, that can be done with
> the "Delete" command
> https://github.com/apache/arrow/blob/d135974a0d3dd9a9fbbb10da4c5dbc
> 65f9324234/cpp/src/plasma/client.h#L186,
> which is part of the C++ Plasma client API but has not been exposed through
> Python yet.
>
> For now, if you want to make sure that an object will not be evicted (e.g.,
> from the C++ Client API), you can call Get on the object ID and then it
> will not be evicted before you call Release from the same client.
>
> On Fri, Jan 19, 2018 at 5:17 PM Mike Sam  wrote:
>
> > Thank you, Robert, for your answer.
> >
> > Could you kindly further elaborate on number 1 as I am not
> > familiar with Plasma codebase yet?
> > Are you saying persistence is available out of the box? else what
> > specific things need to be added
> > to Plasma codebase to make this happen?
> >
> > Thank you,
> > Mike
> >
> >
> >
> > On Thu, Jan 18, 2018 at 11:43 PM, Robert Nishihara <
> > robertnishih...@gmail.com> wrote:
> >
> > > Hi Mike,
> > >
> > > 1. I think yes, though we'd need to turn off the automatic LRU eviction
> > > that happens when the store fills up.
> > >
> > > 3. I think there are some edge cases and it depends what is in your
> > > DataFrame, but at least if it consists of numerical data then the two
> > > representations should use the same underlying data in shared memory.
> > >
> > > On Thu, Jan 18, 2018 at 11:37 PM Mike Sam 
> wrote:
> > >
> > > > I am interested to implement an arrow based persisted cache store
> and I
> > > > have a few related questions:
> > > >
> > > >1.
> > > >
> > > >Is it possible just to use Plasma for this goal?
> > > >(My understanding is that it is not persistable)
> > > >Else, what is the recommended way to do so?
> > > >2.
> > > >
> > > >Is feather the better file format for persistence to avoid
> > > >re-transcoding hot chunks?
> > > >3.
> > > >
> > > >When Pandas load data from plasma/arrow, is it doubling the memory
> > > >usage? (One for the arrow representation, one for pandas
> > > representation)
> > > >
> > > > --
> > > > Thanks,
> > > > Mike
> > > >
> > >
> >
> >
> >
> > --
> > Thanks,
> > Mike
> >
>



-- 
Thanks,
Mike


[jira] [Created] (ARROW-2016) [Python] Fix up ASV benchmarking setup and document procedure for use

2018-01-21 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2016:
---

 Summary: [Python] Fix up ASV benchmarking setup and document 
procedure for use
 Key: ARROW-2016
 URL: https://issues.apache.org/jira/browse/ARROW-2016
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


We need to start writing more microbenchmarks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)