[jira] [Created] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed

2018-07-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2805:
-

 Summary: [Python] TensorFlow import workaround not working with 
tensorflow-gpu if CUDA is not installed
 Key: ARROW-2805
 URL: https://issues.apache.org/jira/browse/ARROW-2805
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


TensorFlow version: 1.7 (GPU enabled but CUDA is not installed)

tensorflow-gpu was installed via pip install

```

import ray
 File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in 

 import pyarrow # noqa: F401
 File 
"/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py", 
line 55, in 
 compat.import_tensorflow_extension()
 File 
"/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", 
line 193, in import_tensorflow_extension
 ctypes.CDLL(ext)
 File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__
 self._handle = _dlopen(self._name, mode)
OSError: libcublas.so.9.0: cannot open shared object file: No such file or 
directory

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2804) [Website] Link to Developer wiki (Confluence) from front page

2018-07-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2804:
---

 Summary: [Website] Link to Developer wiki (Confluence) from front 
page
 Key: ARROW-2804
 URL: https://issues.apache.org/jira/browse/ARROW-2804
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Wes McKinney
 Fix For: 0.10.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2803) [C++] Put hashing function into src/arrow/util

2018-07-06 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2803:
-

 Summary: [C++] Put hashing function into src/arrow/util
 Key: ARROW-2803
 URL: https://issues.apache.org/jira/browse/ARROW-2803
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See [https://github.com/apache/arrow/pull/2220]

We should decide what our default go-to hash function should be (maybe 
murmur3?) and put it into src/arrow/util



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2802) [Docs] Move release management guide to project wiki

2018-07-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2802:
---

 Summary: [Docs] Move release management guide to project wiki
 Key: ARROW-2802
 URL: https://issues.apache.org/jira/browse/ARROW-2802
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Wiki
Reporter: Wes McKinney
 Fix For: 0.10.0


I have begun doing this here 
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I 
think we should remove RELEASE_MANAGEMENT.md and add a note to 
dev/release/README.md to navigate to the Confluence page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset

2018-07-06 Thread Robert Gruener (JIRA)
Robert Gruener created ARROW-2801:
-

 Summary: [Python] Implement splt_row_groups for ParquetDataset
 Key: ARROW-2801
 URL: https://issues.apache.org/jira/browse/ARROW-2801
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Robert Gruener


Currently the split_row_groups argument in ParquetDataset yields a not 
implemented error. An easy and efficient way to implement this is by using the 
summary metadata file instead of opening every footer file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Intro to pandas + pyarrow integration?

2018-07-06 Thread Wes McKinney
In case it's interesting, I gave a talk a little over 3 years ago
about this theme ("we all have data frames, but they're all different
inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly.
I mentioned the desire for a "Apache-licensed, community standard
C/C++ data frame that we can all use".

On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan  wrote:
> Ok, interesting. Thanks Wes, that does make it clear.
>
>
> For other readers, this github issue is related: 
> https://github.com/apache/arrow/issues/2189#issuecomment-402874836
>
>
>
> On 7/6/18, 10:25 AM, "Wes McKinney"  wrote:
>
>>hi Alex,
>>
>>One of the goals of Apache Arrow is to define an open standard for
>>in-memory columnar data (which may be called "tables" or "data frames"
>>in some domains). Among other things, the Arrow columnar format is
>>optimized for memory efficiency and analytical processing performance
>>on very large (even larger-than-RAM) data sets.
>>
>>The way to think about it is that pandas has its own in-memory
>>representation for columnar data, but it is "proprietary" to pandas.
>>To make use of pandas's analytical facilities, you must convert data
>>to pandas's memory representation. As an example, pandas represents
>>strings as NumPy arrays of Python string objects, which is very
>>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>>inside pandas, but this would require a lot of work to port algorithms
>>to run against Arrow: https://github.com/xhochy/fletcher
>>
>>We are working to develop the standard data frame type operations as
>>reusable libraries within this project, and these will run natively
>>against the Arrow columnar format. This is a big project; we would
>>love to have you involved with the effort. One of the reasons I have
>>spent so much of my time the last few years on this project is that I
>>believe it is the best path to build a faster, more efficient
>>pandas-like library for data scientists.
>>
>>best,
>>Wes
>>
>>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan  wrote:
>>> Hello all.
>>>
>>> I'm confused about the current level of integration between pandas and 
>>> pyarrow. Am I correct in understanding that currently I'll need to convert 
>>> pyarrow Tables to pandas DataFrames in order to use most of the pandas 
>>> features?  By "pandas features" I mean every day slicing and dicing of 
>>> data: merge, filtering, melt, spread, etc.
>>>
>>> I have a dataframe which starts out from small files (< 1GB) and quickly 
>>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm 
>>> interested in whether arrow can provide a better, optimized dataframe.
>>>
>>> Thanks.
>>>


Re: Intro to pandas + pyarrow integration?

2018-07-06 Thread Alex Buchanan
Ok, interesting. Thanks Wes, that does make it clear.


For other readers, this github issue is related: 
https://github.com/apache/arrow/issues/2189#issuecomment-402874836



On 7/6/18, 10:25 AM, "Wes McKinney"  wrote:

>hi Alex,
>
>One of the goals of Apache Arrow is to define an open standard for
>in-memory columnar data (which may be called "tables" or "data frames"
>in some domains). Among other things, the Arrow columnar format is
>optimized for memory efficiency and analytical processing performance
>on very large (even larger-than-RAM) data sets.
>
>The way to think about it is that pandas has its own in-memory
>representation for columnar data, but it is "proprietary" to pandas.
>To make use of pandas's analytical facilities, you must convert data
>to pandas's memory representation. As an example, pandas represents
>strings as NumPy arrays of Python string objects, which is very
>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>inside pandas, but this would require a lot of work to port algorithms
>to run against Arrow: https://github.com/xhochy/fletcher
>
>We are working to develop the standard data frame type operations as
>reusable libraries within this project, and these will run natively
>against the Arrow columnar format. This is a big project; we would
>love to have you involved with the effort. One of the reasons I have
>spent so much of my time the last few years on this project is that I
>believe it is the best path to build a faster, more efficient
>pandas-like library for data scientists.
>
>best,
>Wes
>
>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan  wrote:
>> Hello all.
>>
>> I'm confused about the current level of integration between pandas and 
>> pyarrow. Am I correct in understanding that currently I'll need to convert 
>> pyarrow Tables to pandas DataFrames in order to use most of the pandas 
>> features?  By "pandas features" I mean every day slicing and dicing of data: 
>> merge, filtering, melt, spread, etc.
>>
>> I have a dataframe which starts out from small files (< 1GB) and quickly 
>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm 
>> interested in whether arrow can provide a better, optimized dataframe.
>>
>> Thanks.
>>


Re: Housing longer-term Arrow development, design, and roadmap documents

2018-07-06 Thread Wes McKinney
I've started building out some organization on the Arrow wiki landing
page. I think something we can do to help keep organized is to use a
combination of Component and Label tags in JIRA, then add JIRA filters
to pages related to each subproject. We can see how that goes

As an example, I just created a page to track work on Parquet support in Python:

https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Format+Support

As we add more issues labels, they'll show up in the filter.

- Wes

On Fri, Jun 29, 2018 at 6:38 PM, Kouhei Sutou  wrote:
> Hi,
>
>> https://cwiki.apache.org/confluence/display/ARROW
>>
>> If any PMC members would like to be administrators of the space,
>> please let me know your Confluence username. You have to create a
>> separate account (it does not appear to be linked to JIRA accounts)
>
> Can you add me? I've created "kou" account on Confluence.
>
>
> Thanks,
> --
> kou
>
> In 
>   "Re: Housing longer-term Arrow development, design, and roadmap documents" 
> on Tue, 26 Jun 2018 11:27:50 -0400,
>   Wes McKinney  wrote:
>
>> GitHub wiki pages lack collaboration features like commenting. It will
>> be interesting to see what we can work up with JIRA integration, e.g.
>> burndown charts for release management.
>>
>> I asked INFRA to create a Confluence space for us so we can give it a
>> try to see if it works for us. Confluence seems to have gotten a lot
>> nicer since I last used it:
>>
>> https://cwiki.apache.org/confluence/display/ARROW
>>
>> If any PMC members would like to be administrators of the space,
>> please let me know your Confluence username. You have to create a
>> separate account (it does not appear to be linked to JIRA accounts)
>>
>> Thanks
>>
>> On Sun, Jun 24, 2018 at 1:14 PM, Uwe L. Korn  wrote:
>>> Hello,
>>>
>>> I would prefer Confluence over GitHub pages because I would hope that one 
>>> can integrate the ASF JIRA via widgets into the wiki pages. The vast amount 
>>> of issues should all be categorizable into some topic. Once these are 
>>> triaged, they should pop up in the respective wiki pages that could form a 
>>> roadmap. That way, newcomers should get a better start to find the things 
>>> to work on for a certain topic.
>>>
>>> Cheers
>>> Uwe
>>>
>>> On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote:

 Hi Wes,

 I wonder if GitHub wiki pages would be an easier-to-approach alternative?

 Regards

 Antoine.


 Le 24/06/2018 à 08:42, Wes McKinney a écrit :
 > hi folks,
 >
 > Since the scope of Apache Arrow has grown significantly in the last
 > 2.5 years to encompass many programming languages and new areas of
 > functionality, I'd like to discuss how we could better accommodate
 > longer-term asynchronous discussions and stay organized about the
 > development roadmap.
 >
 > At any given time, there could be 10 or more initiatives ongoing, and
 > the number of concurrent initiatives is likely to continue increasing
 > over time as the community grows larger. Just off the top of my head
 > here's some stuff that's ongoing / up in the air:
 >
 > * Remaining columnar format design questions (interval types, unions, 
 > etc.)
 > * Arrow RPC client/server design (aka "Arrow Flight")
 > * Packaging / deployment / release management
 > * Rust language build out
 > * Go language build out
 > * Code generation / LLVM (Gandiva)
 > * ML/AI framework integration (e.g. with TensorFlow, PyTorch)
 > * Plasma roadmap
 > * Record data types (thread I just opened)
 >
 > With ~500 open issues on JIRA, I have found that newcomers feel a bit
 > overwhelmed when they're trying to find a part of the project to get
 > involved with. Eventually one must sink one's teeth into the JIRA
 > backlog, but I think it would be helpful to have some centralized
 > project organization and roadmap documents to help navigate all of the
 > efforts going on in the project.
 >
 > I don't think documents in the repository are a great solution for
 > this, as they don't facilitate discussions very easily --
 > documentation or Markdown documents (like the columnar format
 > specification) are good to write there when some decisions have been
 > made. Google Documents are great, but they are somewhat ephemeral.
 >
 > I would suggest using the ASF's Confluence wiki for these purposes.
 > The Confluence UI is a bit clunky like other Atlassian products, but
 > the wiki-style model (central landing page + links to subprojects) and
 > collaboration features (comments and discussions on pages) would give
 > us what we need. I suspect that it integrates with JIRA also, which
 > would help with cross-references to particular concrete JIRA items
 > related to subprojects. Here's an example of a Confluence landing page
 > for another ASF project:
 > 

Re: Intro to pandas + pyarrow integration?

2018-07-06 Thread Wes McKinney
hi Alex,

One of the goals of Apache Arrow is to define an open standard for
in-memory columnar data (which may be called "tables" or "data frames"
in some domains). Among other things, the Arrow columnar format is
optimized for memory efficiency and analytical processing performance
on very large (even larger-than-RAM) data sets.

The way to think about it is that pandas has its own in-memory
representation for columnar data, but it is "proprietary" to pandas.
To make use of pandas's analytical facilities, you must convert data
to pandas's memory representation. As an example, pandas represents
strings as NumPy arrays of Python string objects, which is very
wasteful. Uwe Korn recently demonstrated an approach to using Arrow
inside pandas, but this would require a lot of work to port algorithms
to run against Arrow: https://github.com/xhochy/fletcher

We are working to develop the standard data frame type operations as
reusable libraries within this project, and these will run natively
against the Arrow columnar format. This is a big project; we would
love to have you involved with the effort. One of the reasons I have
spent so much of my time the last few years on this project is that I
believe it is the best path to build a faster, more efficient
pandas-like library for data scientists.

best,
Wes

On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan  wrote:
> Hello all.
>
> I'm confused about the current level of integration between pandas and 
> pyarrow. Am I correct in understanding that currently I'll need to convert 
> pyarrow Tables to pandas DataFrames in order to use most of the pandas 
> features?  By "pandas features" I mean every day slicing and dicing of data: 
> merge, filtering, melt, spread, etc.
>
> I have a dataframe which starts out from small files (< 1GB) and quickly 
> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm 
> interested in whether arrow can provide a better, optimized dataframe.
>
> Thanks.
>


Intro to pandas + pyarrow integration?

2018-07-06 Thread Alex Buchanan
Hello all.

I'm confused about the current level of integration between pandas and pyarrow. 
Am I correct in understanding that currently I'll need to convert pyarrow 
Tables to pandas DataFrames in order to use most of the pandas features?  By 
"pandas features" I mean every day slicing and dicing of data: merge, 
filtering, melt, spread, etc.

I have a dataframe which starts out from small files (< 1GB) and quickly 
explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm 
interested in whether arrow can provide a better, optimized dataframe.

Thanks.



[jira] [Created] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file

2018-07-06 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2800:
---

 Summary: [Python] Unavailable Parquet column statistics from 
Spark-generated file
 Key: ARROW-2800
 URL: https://issues.apache.org/jira/browse/ARROW-2800
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Robert Gruener
 Fix For: 0.10.0


I have a dataset generated by spark which shows it has statistics for the 
string column when using the java parquet-mr code (shown by using 
`parquet-tools meta`) however reading from pyarrow shows that the statistics 
for that column are not set.  I should not the column only has a single value, 
though it still seems like a problem that pyarrow can't recognize it (it can 
recognize statistics set for the long and double types).

See https://github.com/apache/arrow/files/2161147/metadata.zip for file example.

Pyarrow Code To Check Statistics:

{code}
from pyarrow import parquet as pq

meta = pq.read_metadata('/tmp/metadata.parquet')
# No Statistics For String Column, prints false and statistics object is None
print(meta.row_group(0).column(1).is_stats_set)
{code}

Example parquet-meta output:

{code}
file schema: spark_schema 

int: REQUIRED INT64 R:0 D:0
string:  OPTIONAL BINARY O:UTF8 R:0 D:1
float:   REQUIRED DOUBLE R:0 D:0

row group 1: RC:8333 TS:76031 OFFSET:4 

int:  INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 
ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0]
string:   BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 
ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 4192]
float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 
ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, 
num_nulls: 0]
{code}

I realize the column only has a single value though it still seems like pyarrow 
should be able to read the statistics set. I made this here and not a JIRA 
since I wanted to be sure this is actually an issue and there wasnt a ticket 
already made there (I couldnt find one but I wanted to be sure). Either way I 
would like to understand why this is



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[DRAFT] Arrow Board Report

2018-07-06 Thread Wes McKinney
## Description:

Apache Arrow is a cross-language development platform for in-memory data. It
specifies a standardized language-independent columnar memory format for flat
and hierarchical data, organized for efficient analytic operations on modern
hardware. It also provides computational libraries and zero-copy streaming
messaging and interprocess communication. Languages currently supported include
C, C++, Go, Java, JavaScript, Python, Ruby, and Rust.

## Issues:
- There are no issues requiring board attention at this time

## Activity:
- We have not released since March as we work to improve our release and build
automation. We plan to include binary artifacts in our next release vote, where
we have only had source artifacts in past releases.

## Health report:

The project's user and contributor base is growing rapidly. We are struggling a
bit with maintainer bandwidth. As an example, 2 committers have merged 84% of
patches (where there have been nearly 2000) since the project's inception. We
are discussing ways to grow the maintainer base on the mailing list.

## PMC changes:

 - Currently 23 PMC members.
 - Siddharth Teotia was added to the PMC on Thu May 17 2018

## Committer base changes:

 - Currently 31 committers.
 - No new committers added in the last 3 months
 - Last committer addition was Antoine Pitrou at Tue Apr 03 2018

## Releases:

 - Last release was 0.9.0 on Mon Mar 19 2018

## JIRA activity:

 - 392 JIRA tickets created in the last 3 months
 - 303 JIRA tickets closed/resolved in the last 3 months


Using a shared filesystem abstract API in Arrow Python libraries [was Re: file-system specification]

2018-07-06 Thread Wes McKinney
hi Martin and Antoine,

I apologize I haven't been able to look at this in detail yet. I think
this is a valuable initiative; I created a wiki page so we can begin
to develop a plan to do the work

https://cwiki.apache.org/confluence/display/ARROW/Python+Filesystems+and+Filesystem+API

I added a JIRA filter and tagged a couple of filesystem-related
issues; there are more that should be added to the list. There's a lot
of other work related to filesystem implementations that we can help
organize and plan here.

As far as refining the details of the API, we should see what's the
best place to collect feedback and discuss. Martin, can you set up a
pull request with the entire patch so that many people can comment and
discuss?

NB: TensorFlow defines a filesystem abstraction, albeit in C++ with
SWIG bindings. We might also look there as a check on some of our
assumptions.

Thank you,
Wes

On Tue, May 15, 2018 at 7:47 AM, Antoine Pitrou  wrote:
>
> Hi Martin,
>
> On Wed, 9 May 2018 11:28:15 -0400
> Martin Durant  wrote:
>> I have sketched out a possible start of a python-wide file-system 
>> specification
>> https://github.com/martindurant/filesystem_spec
>>
>> This came about from my work in some other (remote) file-systems 
>> implementations for python, particularly in the context of Dask. Since arrow 
>> also cares about both local files and, for example, hdfs, I thought that 
>> people on this list may have comments and opinions about a possible standard 
>> that we ought to converge on. I do not think that my suggestions so far are 
>> necessarily right or even good in many cases, but I want to get the 
>> conversation going.
>
> Here are some comments:
>
> - API naming: you seem to favour re-using Unix command-line monickers in
>   some places, while using more regular verbs or names in other
>   places.  I think it should be consistent.  Since the Unix
>   command-line doesn't exactly cover the exposed functionality, and
>   since Unix tends to favour short cryptic names, I think it's better
>   to use Python-like naming (which is also more familiar to non-Unix
>   users). For example "move" or "rename" or "replace" instead of "mv",
>   etc.
>
> - **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing
>   arbitrary parameters, which I assume are intended to be
>   backend-specific.  It makes it difficult to add other optional
>   parameters to those APIs in the future.  So I'd make the
>   backend-specific directives a single (optional) dict parameter rather
>   than a **kwargs.
>
> - `invalidate_cache` doesn't state whether it invalidates recursively
>   or not (recursively sounds better intuitively?).  Also, I think it
>   would be more flexible to take a list of paths rather than a single
>   path.
>
> - `du`: the effect of the `deep` parameter isn't obvious to me. I don't
>   know what it would mean *not* to recurse here: what is the size of a
>   directory if you don't recurse into it?
>
> - `glob` may need a formal definition (are trailing slashes
>   significant for directory or symlink resolution? this kind of thing),
>   though you may want to keep edge cases backend-specific.
>
> - are `head` and `tail` at all useful? They can be easily recreated
>   using a generic `open` facility.
>
> - `read_block` tries to do too much in a single API IMHO, and
>   using `open` directly is more flexible anyway.
>
> - if `touch` is intended to emulate the Unix API of the same name, the
>   docstring should state "Create empty file or update last modification
>   timestamp".
>
> - the information dicts returned by several APIs (`ls`, `info`)
>   need standardizing, at least for non backend-specific fields.
>
> - if the backend is a networked filesystem with non-trivial latency,
>   perhaps the operations would deserve being batched (operate on
>   several paths at once), though I will happily defer to your expertise
>   on the topic.
>
> Regards
>
> Antoine.


Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-06 Thread Wes McKinney
This seems possibly similar to the issue reported in
https://github.com/apache/arrow/issues/1946 -- we never found a
resolution. Could we open a JIRA to track the problem?

On Fri, Jul 6, 2018 at 3:57 AM, Josh Quigley
 wrote:
> That works. I've tried a bunch of debugging and work arounds- as far as I
> can tell this is just a problem with deserializr from components and
> multiprocess.
>
> On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, 
> wrote:
>
>> Can you reproduce it without all of the multiprocessing code? E.g., just
>> call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
>> into another interpreter and call *pyarrow.deserialize *or
>> *pyarrow.deserialize_components*?
>> On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley <
>> josh.quig...@lifetrading.com.au>
>> wrote:
>>
>> > Attachment inline:
>> >
>> > import pyarrow as pa
>> > import multiprocessing as mp
>> > import numpy as np
>> >
>> > def make_payload():
>> > """Common function - make data to send"""
>> > return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
>> >
>> > def send_payload(payload, connection):
>> > """Common function - serialize & send data through a socket"""
>> > s = pa.serialize(payload)
>> > c = s.to_components()
>> >
>> > # Send
>> > data = c.pop('data')
>> > connection.send(c)
>> > for d in data:
>> > connection.send_bytes(d)
>> > connection.send_bytes(b'')
>> >
>> >
>> > def recv_payload(connection):
>> > """Common function - recv data through a socket & deserialize"""
>> > c = connection.recv()
>> > c['data'] = []
>> > while True:
>> > r = connection.recv_bytes()
>> > if len(r) == 0:
>> > break
>> > c['data'].append(pa.py_buffer(r))
>> >
>> > print('...deserialize')
>> > return pa.deserialize_components(c)
>> >
>> >
>> > def run_same_process():
>> > """Same process: Send data down a socket, then read data from the
>> > matching socket"""
>> > print('run_same_process')
>> > recv_conn,send_conn = mp.Pipe(duplex=False)
>> > payload = make_payload()
>> > print(payload)
>> > send_payload(payload, send_conn)
>> > payload2 = recv_payload(recv_conn)
>> > print(payload2)
>> >
>> >
>> > def receiver(recv_conn):
>> > """Separate process: runs in a different process, recv data &
>> > deserialize"""
>> > print('Receiver started')
>> > payload = recv_payload(recv_conn)
>> > print(payload)
>> >
>> >
>> > def run_separate_process():
>> > """Separate process: launch the child process, then send data"""
>> >
>> >
>> > print('run_separate_process')
>> > recv_conn,send_conn = mp.Pipe(duplex=False)
>> > process = mp.Process(target=receiver, args=(recv_conn,))
>> > process.start()
>> >
>> > payload = make_payload()
>> > print(payload)
>> > send_payload(payload, send_conn)
>> >
>> > process.join()
>> >
>> > if __name__ == '__main__':
>> > run_same_process()
>> > run_separate_process()
>> >
>> >
>> > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
>> > josh.quig...@lifetrading.com.au>
>> > wrote:
>> >
>> > > A reproducible program attached - it first runs serialize/deserialize
>> > from
>> > > the same process, then it does the same work using a separate process
>> for
>> > > the deserialize.
>> > >
>> > > The behaviour see is (after the same process code executes happily) is
>> > > hanging / child-process crashing during the call to deserialize.
>> > >
>> > > Is this expected, and if not, is there a known workaround?
>> > >
>> > > Running Windows 10, conda distribution,  with package versions listed
>> > > below. I'll also see what happens if I run on *nix.
>> > >
>> > >   - arrow-cpp=0.9.0=py36_vc14_7
>> > >   - boost-cpp=1.66.0=vc14_1
>> > >   - bzip2=1.0.6=vc14_1
>> > >   - hdf5=1.10.2=vc14_0
>> > >   - lzo=2.10=vc14_0
>> > >   - parquet-cpp=1.4.0=vc14_0
>> > >   - snappy=1.1.7=vc14_1
>> > >   - zlib=1.2.11=vc14_0
>> > >   - blas=1.0=mkl
>> > >   - blosc=1.14.3=he51fdeb_0
>> > >   - cython=0.28.3=py36hfa6e2cd_0
>> > >   - icc_rt=2017.0.4=h97af966_0
>> > >   - intel-openmp=2018.0.3=0
>> > >   - numexpr=2.6.5=py36hcd2f87e_0
>> > >   - numpy=1.14.5=py36h9fa60d3_2
>> > >   - numpy-base=1.14.5=py36h5c71026_2
>> > >   - pandas=0.23.1=py36h830ac7b_0
>> > >   - pyarrow=0.9.0=py36hfe5e424_2
>> > >   - pytables=3.4.4=py36he6f6034_0
>> > >   - python=3.6.6=hea74fb7_0
>> > >   - vc=14=h0510ff6_3
>> > >   - vs2015_runtime=14.0.25123=3
>> > >
>> > >
>> >
>>


[jira] [Created] (ARROW-2799) Table.from_pandas silently truncates data, even when passed a schema

2018-07-06 Thread Dave Hirschfeld (JIRA)
Dave Hirschfeld created ARROW-2799:
--

 Summary: Table.from_pandas silently truncates data, even when 
passed a schema
 Key: ARROW-2799
 URL: https://issues.apache.org/jira/browse/ARROW-2799
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
Reporter: Dave Hirschfeld


Ported over from [https://github.com/apache/arrow/issues/2217]


```python
In [8]: import pandas as pd
   ...: import pyarrow as arw

In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)})
   ...: df
Out[9]:
   A  B
0  a  0
1  b  1
2  c  2

In [10]: schema = arw.schema([
...: arw.field('A', arw.string()),
...: arw.field('B', arw.int32()),
...: ])

In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[11]:
pyarrow.Table
A: string
B: int32
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]'
b', "pandas_version": "0.23.1"}'}

In [12]: tbl.to_pandas().equals(df)
Out[12]: True
```
...so if the `schema` matches the pandas datatypes all is well - we can 
roundtrip the DataFrame.

Now, say we have some bad data such that column 'B' is now of type float64. The 
datatypes of the DataFrame don't match the explicitly supplied `schema` object 
but rather than raising a `TypeError` the data is silently truncated and the 
roundtrip DataFrame doesn't match our input DataFame without even a warning 
raised!
```python
In [13]: df['B'].iloc[0] = 1.23
...: df
Out[13]:
   A B
0  a  1.23
1  b  1.00
2  c  2.00

In [14]: # I would expect/want this to raise a TypeError since the schema 
doesn't match the pandas datatypes
...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema)
...: tbl
Out[14]:
pyarrow.Table
A: string
B: int32
metadata

{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":'
b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":'
b' "object", "metadata": null}, {"name": "B", "field_name": "B", "'
b'pandas_type": "int32", "numpy_type": "float64", "metadata": null'
b'}], "pandas_version": "0.23.1"}'}

In [15]: tbl.to_pandas()  # <-- SILENT TRUNCATION!!!
Out[15]:
   A  B
0  a  1
1  b  1
2  c  2

```

To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if 
the DataFrame types don't match an explicitly supplied schema and would hope 
this current behaviour would be considered a bug.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-06 Thread Josh Quigley
That works. I've tried a bunch of debugging and work arounds- as far as I
can tell this is just a problem with deserializr from components and
multiprocess.

On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, 
wrote:

> Can you reproduce it without all of the multiprocessing code? E.g., just
> call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
> into another interpreter and call *pyarrow.deserialize *or
> *pyarrow.deserialize_components*?
> On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley <
> josh.quig...@lifetrading.com.au>
> wrote:
>
> > Attachment inline:
> >
> > import pyarrow as pa
> > import multiprocessing as mp
> > import numpy as np
> >
> > def make_payload():
> > """Common function - make data to send"""
> > return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
> >
> > def send_payload(payload, connection):
> > """Common function - serialize & send data through a socket"""
> > s = pa.serialize(payload)
> > c = s.to_components()
> >
> > # Send
> > data = c.pop('data')
> > connection.send(c)
> > for d in data:
> > connection.send_bytes(d)
> > connection.send_bytes(b'')
> >
> >
> > def recv_payload(connection):
> > """Common function - recv data through a socket & deserialize"""
> > c = connection.recv()
> > c['data'] = []
> > while True:
> > r = connection.recv_bytes()
> > if len(r) == 0:
> > break
> > c['data'].append(pa.py_buffer(r))
> >
> > print('...deserialize')
> > return pa.deserialize_components(c)
> >
> >
> > def run_same_process():
> > """Same process: Send data down a socket, then read data from the
> > matching socket"""
> > print('run_same_process')
> > recv_conn,send_conn = mp.Pipe(duplex=False)
> > payload = make_payload()
> > print(payload)
> > send_payload(payload, send_conn)
> > payload2 = recv_payload(recv_conn)
> > print(payload2)
> >
> >
> > def receiver(recv_conn):
> > """Separate process: runs in a different process, recv data &
> > deserialize"""
> > print('Receiver started')
> > payload = recv_payload(recv_conn)
> > print(payload)
> >
> >
> > def run_separate_process():
> > """Separate process: launch the child process, then send data"""
> >
> >
> > print('run_separate_process')
> > recv_conn,send_conn = mp.Pipe(duplex=False)
> > process = mp.Process(target=receiver, args=(recv_conn,))
> > process.start()
> >
> > payload = make_payload()
> > print(payload)
> > send_payload(payload, send_conn)
> >
> > process.join()
> >
> > if __name__ == '__main__':
> > run_same_process()
> > run_separate_process()
> >
> >
> > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
> > josh.quig...@lifetrading.com.au>
> > wrote:
> >
> > > A reproducible program attached - it first runs serialize/deserialize
> > from
> > > the same process, then it does the same work using a separate process
> for
> > > the deserialize.
> > >
> > > The behaviour see is (after the same process code executes happily) is
> > > hanging / child-process crashing during the call to deserialize.
> > >
> > > Is this expected, and if not, is there a known workaround?
> > >
> > > Running Windows 10, conda distribution,  with package versions listed
> > > below. I'll also see what happens if I run on *nix.
> > >
> > >   - arrow-cpp=0.9.0=py36_vc14_7
> > >   - boost-cpp=1.66.0=vc14_1
> > >   - bzip2=1.0.6=vc14_1
> > >   - hdf5=1.10.2=vc14_0
> > >   - lzo=2.10=vc14_0
> > >   - parquet-cpp=1.4.0=vc14_0
> > >   - snappy=1.1.7=vc14_1
> > >   - zlib=1.2.11=vc14_0
> > >   - blas=1.0=mkl
> > >   - blosc=1.14.3=he51fdeb_0
> > >   - cython=0.28.3=py36hfa6e2cd_0
> > >   - icc_rt=2017.0.4=h97af966_0
> > >   - intel-openmp=2018.0.3=0
> > >   - numexpr=2.6.5=py36hcd2f87e_0
> > >   - numpy=1.14.5=py36h9fa60d3_2
> > >   - numpy-base=1.14.5=py36h5c71026_2
> > >   - pandas=0.23.1=py36h830ac7b_0
> > >   - pyarrow=0.9.0=py36hfe5e424_2
> > >   - pytables=3.4.4=py36he6f6034_0
> > >   - python=3.6.6=hea74fb7_0
> > >   - vc=14=h0510ff6_3
> > >   - vs2015_runtime=14.0.25123=3
> > >
> > >
> >
>


[jira] [Created] (ARROW-2798) [Plasma] Use hashing function that takes into account all UniqueID bytes

2018-07-06 Thread Songqing Zhang (JIRA)
Songqing Zhang created ARROW-2798:
-

 Summary: [Plasma] Use hashing function that takes into account all 
UniqueID bytes
 Key: ARROW-2798
 URL: https://issues.apache.org/jira/browse/ARROW-2798
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Affects Versions: 0.9.0
Reporter: Songqing Zhang


Now, the hashing of UniqueID in plasma is too simple which has caused a 
problem. In some cases(for example, in github/ray, UniqueID is composed of a 
taskID and a index), the UniqueID may be like "00", 
"ff01", "fff02" ... . The current hashing 
method is only to copy the first few bytes of a UniqueID and the result is that 
most of the hashed ids are same, so when the hashed ids put to plasma store, it 
will become very slow when searching(plasma store uses unordered_map to store 
the ids, and when the keys are same, it will become slow just like list).

In fact, the same PR has been merged into ray, see 
[ray-project/ray#2174|https://github.com/ray-project/ray/pull/2174].

and I have tested the perf between the new hashing method and the original one 
with putting lots of objects continuously, it seems the new hashing method 
doesn't cost more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: bug? pyarrow deserialize_components doesn't work in multiple processes

2018-07-06 Thread Robert Nishihara
Can you reproduce it without all of the multiprocessing code? E.g., just
call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes
into another interpreter and call *pyarrow.deserialize *or
*pyarrow.deserialize_components*?
On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley 
wrote:

> Attachment inline:
>
> import pyarrow as pa
> import multiprocessing as mp
> import numpy as np
>
> def make_payload():
> """Common function - make data to send"""
> return ['message', 123, np.random.uniform(-100, 100, (4, 4))]
>
> def send_payload(payload, connection):
> """Common function - serialize & send data through a socket"""
> s = pa.serialize(payload)
> c = s.to_components()
>
> # Send
> data = c.pop('data')
> connection.send(c)
> for d in data:
> connection.send_bytes(d)
> connection.send_bytes(b'')
>
>
> def recv_payload(connection):
> """Common function - recv data through a socket & deserialize"""
> c = connection.recv()
> c['data'] = []
> while True:
> r = connection.recv_bytes()
> if len(r) == 0:
> break
> c['data'].append(pa.py_buffer(r))
>
> print('...deserialize')
> return pa.deserialize_components(c)
>
>
> def run_same_process():
> """Same process: Send data down a socket, then read data from the
> matching socket"""
> print('run_same_process')
> recv_conn,send_conn = mp.Pipe(duplex=False)
> payload = make_payload()
> print(payload)
> send_payload(payload, send_conn)
> payload2 = recv_payload(recv_conn)
> print(payload2)
>
>
> def receiver(recv_conn):
> """Separate process: runs in a different process, recv data &
> deserialize"""
> print('Receiver started')
> payload = recv_payload(recv_conn)
> print(payload)
>
>
> def run_separate_process():
> """Separate process: launch the child process, then send data"""
>
>
> print('run_separate_process')
> recv_conn,send_conn = mp.Pipe(duplex=False)
> process = mp.Process(target=receiver, args=(recv_conn,))
> process.start()
>
> payload = make_payload()
> print(payload)
> send_payload(payload, send_conn)
>
> process.join()
>
> if __name__ == '__main__':
> run_same_process()
> run_separate_process()
>
>
> On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley <
> josh.quig...@lifetrading.com.au>
> wrote:
>
> > A reproducible program attached - it first runs serialize/deserialize
> from
> > the same process, then it does the same work using a separate process for
> > the deserialize.
> >
> > The behaviour see is (after the same process code executes happily) is
> > hanging / child-process crashing during the call to deserialize.
> >
> > Is this expected, and if not, is there a known workaround?
> >
> > Running Windows 10, conda distribution,  with package versions listed
> > below. I'll also see what happens if I run on *nix.
> >
> >   - arrow-cpp=0.9.0=py36_vc14_7
> >   - boost-cpp=1.66.0=vc14_1
> >   - bzip2=1.0.6=vc14_1
> >   - hdf5=1.10.2=vc14_0
> >   - lzo=2.10=vc14_0
> >   - parquet-cpp=1.4.0=vc14_0
> >   - snappy=1.1.7=vc14_1
> >   - zlib=1.2.11=vc14_0
> >   - blas=1.0=mkl
> >   - blosc=1.14.3=he51fdeb_0
> >   - cython=0.28.3=py36hfa6e2cd_0
> >   - icc_rt=2017.0.4=h97af966_0
> >   - intel-openmp=2018.0.3=0
> >   - numexpr=2.6.5=py36hcd2f87e_0
> >   - numpy=1.14.5=py36h9fa60d3_2
> >   - numpy-base=1.14.5=py36h5c71026_2
> >   - pandas=0.23.1=py36h830ac7b_0
> >   - pyarrow=0.9.0=py36hfe5e424_2
> >   - pytables=3.4.4=py36he6f6034_0
> >   - python=3.6.6=hea74fb7_0
> >   - vc=14=h0510ff6_3
> >   - vs2015_runtime=14.0.25123=3
> >
> >
>