[jira] [Created] (ARROW-2805) [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed
Philipp Moritz created ARROW-2805: - Summary: [Python] TensorFlow import workaround not working with tensorflow-gpu if CUDA is not installed Key: ARROW-2805 URL: https://issues.apache.org/jira/browse/ARROW-2805 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz TensorFlow version: 1.7 (GPU enabled but CUDA is not installed) tensorflow-gpu was installed via pip install ``` import ray File "/home/eric/Desktop/ray-private/python/ray/__init__.py", line 28, in import pyarrow # noqa: F401 File "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/__init__.py", line 55, in compat.import_tensorflow_extension() File "/home/eric/Desktop/ray-private/python/ray/pyarrow_files/pyarrow/compat.py", line 193, in import_tensorflow_extension ctypes.CDLL(ext) File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcublas.so.9.0: cannot open shared object file: No such file or directory ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2804) [Website] Link to Developer wiki (Confluence) from front page
Wes McKinney created ARROW-2804: --- Summary: [Website] Link to Developer wiki (Confluence) from front page Key: ARROW-2804 URL: https://issues.apache.org/jira/browse/ARROW-2804 Project: Apache Arrow Issue Type: Improvement Components: Website Reporter: Wes McKinney Fix For: 0.10.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2803) [C++] Put hashing function into src/arrow/util
Philipp Moritz created ARROW-2803: - Summary: [C++] Put hashing function into src/arrow/util Key: ARROW-2803 URL: https://issues.apache.org/jira/browse/ARROW-2803 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See [https://github.com/apache/arrow/pull/2220] We should decide what our default go-to hash function should be (maybe murmur3?) and put it into src/arrow/util -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2802) [Docs] Move release management guide to project wiki
Wes McKinney created ARROW-2802: --- Summary: [Docs] Move release management guide to project wiki Key: ARROW-2802 URL: https://issues.apache.org/jira/browse/ARROW-2802 Project: Apache Arrow Issue Type: Improvement Components: Wiki Reporter: Wes McKinney Fix For: 0.10.0 I have begun doing this here https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide. I think we should remove RELEASE_MANAGEMENT.md and add a note to dev/release/README.md to navigate to the Confluence page -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2801) [Python] Implement splt_row_groups for ParquetDataset
Robert Gruener created ARROW-2801: - Summary: [Python] Implement splt_row_groups for ParquetDataset Key: ARROW-2801 URL: https://issues.apache.org/jira/browse/ARROW-2801 Project: Apache Arrow Issue Type: New Feature Components: Python Reporter: Robert Gruener Currently the split_row_groups argument in ParquetDataset yields a not implemented error. An easy and efficient way to implement this is by using the summary metadata file instead of opening every footer file -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Intro to pandas + pyarrow integration?
In case it's interesting, I gave a talk a little over 3 years ago about this theme ("we all have data frames, but they're all different inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly. I mentioned the desire for a "Apache-licensed, community standard C/C++ data frame that we can all use". On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan wrote: > Ok, interesting. Thanks Wes, that does make it clear. > > > For other readers, this github issue is related: > https://github.com/apache/arrow/issues/2189#issuecomment-402874836 > > > > On 7/6/18, 10:25 AM, "Wes McKinney" wrote: > >>hi Alex, >> >>One of the goals of Apache Arrow is to define an open standard for >>in-memory columnar data (which may be called "tables" or "data frames" >>in some domains). Among other things, the Arrow columnar format is >>optimized for memory efficiency and analytical processing performance >>on very large (even larger-than-RAM) data sets. >> >>The way to think about it is that pandas has its own in-memory >>representation for columnar data, but it is "proprietary" to pandas. >>To make use of pandas's analytical facilities, you must convert data >>to pandas's memory representation. As an example, pandas represents >>strings as NumPy arrays of Python string objects, which is very >>wasteful. Uwe Korn recently demonstrated an approach to using Arrow >>inside pandas, but this would require a lot of work to port algorithms >>to run against Arrow: https://github.com/xhochy/fletcher >> >>We are working to develop the standard data frame type operations as >>reusable libraries within this project, and these will run natively >>against the Arrow columnar format. This is a big project; we would >>love to have you involved with the effort. One of the reasons I have >>spent so much of my time the last few years on this project is that I >>believe it is the best path to build a faster, more efficient >>pandas-like library for data scientists. >> >>best, >>Wes >> >>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: >>> Hello all. >>> >>> I'm confused about the current level of integration between pandas and >>> pyarrow. Am I correct in understanding that currently I'll need to convert >>> pyarrow Tables to pandas DataFrames in order to use most of the pandas >>> features? By "pandas features" I mean every day slicing and dicing of >>> data: merge, filtering, melt, spread, etc. >>> >>> I have a dataframe which starts out from small files (< 1GB) and quickly >>> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >>> interested in whether arrow can provide a better, optimized dataframe. >>> >>> Thanks. >>>
Re: Intro to pandas + pyarrow integration?
Ok, interesting. Thanks Wes, that does make it clear. For other readers, this github issue is related: https://github.com/apache/arrow/issues/2189#issuecomment-402874836 On 7/6/18, 10:25 AM, "Wes McKinney" wrote: >hi Alex, > >One of the goals of Apache Arrow is to define an open standard for >in-memory columnar data (which may be called "tables" or "data frames" >in some domains). Among other things, the Arrow columnar format is >optimized for memory efficiency and analytical processing performance >on very large (even larger-than-RAM) data sets. > >The way to think about it is that pandas has its own in-memory >representation for columnar data, but it is "proprietary" to pandas. >To make use of pandas's analytical facilities, you must convert data >to pandas's memory representation. As an example, pandas represents >strings as NumPy arrays of Python string objects, which is very >wasteful. Uwe Korn recently demonstrated an approach to using Arrow >inside pandas, but this would require a lot of work to port algorithms >to run against Arrow: https://github.com/xhochy/fletcher > >We are working to develop the standard data frame type operations as >reusable libraries within this project, and these will run natively >against the Arrow columnar format. This is a big project; we would >love to have you involved with the effort. One of the reasons I have >spent so much of my time the last few years on this project is that I >believe it is the best path to build a faster, more efficient >pandas-like library for data scientists. > >best, >Wes > >On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: >> Hello all. >> >> I'm confused about the current level of integration between pandas and >> pyarrow. Am I correct in understanding that currently I'll need to convert >> pyarrow Tables to pandas DataFrames in order to use most of the pandas >> features? By "pandas features" I mean every day slicing and dicing of data: >> merge, filtering, melt, spread, etc. >> >> I have a dataframe which starts out from small files (< 1GB) and quickly >> explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm >> interested in whether arrow can provide a better, optimized dataframe. >> >> Thanks. >>
Re: Housing longer-term Arrow development, design, and roadmap documents
I've started building out some organization on the Arrow wiki landing page. I think something we can do to help keep organized is to use a combination of Component and Label tags in JIRA, then add JIRA filters to pages related to each subproject. We can see how that goes As an example, I just created a page to track work on Parquet support in Python: https://cwiki.apache.org/confluence/display/ARROW/Python+Parquet+Format+Support As we add more issues labels, they'll show up in the filter. - Wes On Fri, Jun 29, 2018 at 6:38 PM, Kouhei Sutou wrote: > Hi, > >> https://cwiki.apache.org/confluence/display/ARROW >> >> If any PMC members would like to be administrators of the space, >> please let me know your Confluence username. You have to create a >> separate account (it does not appear to be linked to JIRA accounts) > > Can you add me? I've created "kou" account on Confluence. > > > Thanks, > -- > kou > > In > "Re: Housing longer-term Arrow development, design, and roadmap documents" > on Tue, 26 Jun 2018 11:27:50 -0400, > Wes McKinney wrote: > >> GitHub wiki pages lack collaboration features like commenting. It will >> be interesting to see what we can work up with JIRA integration, e.g. >> burndown charts for release management. >> >> I asked INFRA to create a Confluence space for us so we can give it a >> try to see if it works for us. Confluence seems to have gotten a lot >> nicer since I last used it: >> >> https://cwiki.apache.org/confluence/display/ARROW >> >> If any PMC members would like to be administrators of the space, >> please let me know your Confluence username. You have to create a >> separate account (it does not appear to be linked to JIRA accounts) >> >> Thanks >> >> On Sun, Jun 24, 2018 at 1:14 PM, Uwe L. Korn wrote: >>> Hello, >>> >>> I would prefer Confluence over GitHub pages because I would hope that one >>> can integrate the ASF JIRA via widgets into the wiki pages. The vast amount >>> of issues should all be categorizable into some topic. Once these are >>> triaged, they should pop up in the respective wiki pages that could form a >>> roadmap. That way, newcomers should get a better start to find the things >>> to work on for a certain topic. >>> >>> Cheers >>> Uwe >>> >>> On Sun, Jun 24, 2018, at 7:02 PM, Antoine Pitrou wrote: Hi Wes, I wonder if GitHub wiki pages would be an easier-to-approach alternative? Regards Antoine. Le 24/06/2018 à 08:42, Wes McKinney a écrit : > hi folks, > > Since the scope of Apache Arrow has grown significantly in the last > 2.5 years to encompass many programming languages and new areas of > functionality, I'd like to discuss how we could better accommodate > longer-term asynchronous discussions and stay organized about the > development roadmap. > > At any given time, there could be 10 or more initiatives ongoing, and > the number of concurrent initiatives is likely to continue increasing > over time as the community grows larger. Just off the top of my head > here's some stuff that's ongoing / up in the air: > > * Remaining columnar format design questions (interval types, unions, > etc.) > * Arrow RPC client/server design (aka "Arrow Flight") > * Packaging / deployment / release management > * Rust language build out > * Go language build out > * Code generation / LLVM (Gandiva) > * ML/AI framework integration (e.g. with TensorFlow, PyTorch) > * Plasma roadmap > * Record data types (thread I just opened) > > With ~500 open issues on JIRA, I have found that newcomers feel a bit > overwhelmed when they're trying to find a part of the project to get > involved with. Eventually one must sink one's teeth into the JIRA > backlog, but I think it would be helpful to have some centralized > project organization and roadmap documents to help navigate all of the > efforts going on in the project. > > I don't think documents in the repository are a great solution for > this, as they don't facilitate discussions very easily -- > documentation or Markdown documents (like the columnar format > specification) are good to write there when some decisions have been > made. Google Documents are great, but they are somewhat ephemeral. > > I would suggest using the ASF's Confluence wiki for these purposes. > The Confluence UI is a bit clunky like other Atlassian products, but > the wiki-style model (central landing page + links to subprojects) and > collaboration features (comments and discussions on pages) would give > us what we need. I suspect that it integrates with JIRA also, which > would help with cross-references to particular concrete JIRA items > related to subprojects. Here's an example of a Confluence landing page > for another ASF project: >
Re: Intro to pandas + pyarrow integration?
hi Alex, One of the goals of Apache Arrow is to define an open standard for in-memory columnar data (which may be called "tables" or "data frames" in some domains). Among other things, the Arrow columnar format is optimized for memory efficiency and analytical processing performance on very large (even larger-than-RAM) data sets. The way to think about it is that pandas has its own in-memory representation for columnar data, but it is "proprietary" to pandas. To make use of pandas's analytical facilities, you must convert data to pandas's memory representation. As an example, pandas represents strings as NumPy arrays of Python string objects, which is very wasteful. Uwe Korn recently demonstrated an approach to using Arrow inside pandas, but this would require a lot of work to port algorithms to run against Arrow: https://github.com/xhochy/fletcher We are working to develop the standard data frame type operations as reusable libraries within this project, and these will run natively against the Arrow columnar format. This is a big project; we would love to have you involved with the effort. One of the reasons I have spent so much of my time the last few years on this project is that I believe it is the best path to build a faster, more efficient pandas-like library for data scientists. best, Wes On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan wrote: > Hello all. > > I'm confused about the current level of integration between pandas and > pyarrow. Am I correct in understanding that currently I'll need to convert > pyarrow Tables to pandas DataFrames in order to use most of the pandas > features? By "pandas features" I mean every day slicing and dicing of data: > merge, filtering, melt, spread, etc. > > I have a dataframe which starts out from small files (< 1GB) and quickly > explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm > interested in whether arrow can provide a better, optimized dataframe. > > Thanks. >
Intro to pandas + pyarrow integration?
Hello all. I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features? By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc. I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe. Thanks.
[jira] [Created] (ARROW-2800) [Python] Unavailable Parquet column statistics from Spark-generated file
Wes McKinney created ARROW-2800: --- Summary: [Python] Unavailable Parquet column statistics from Spark-generated file Key: ARROW-2800 URL: https://issues.apache.org/jira/browse/ARROW-2800 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Robert Gruener Fix For: 0.10.0 I have a dataset generated by spark which shows it has statistics for the string column when using the java parquet-mr code (shown by using `parquet-tools meta`) however reading from pyarrow shows that the statistics for that column are not set. I should not the column only has a single value, though it still seems like a problem that pyarrow can't recognize it (it can recognize statistics set for the long and double types). See https://github.com/apache/arrow/files/2161147/metadata.zip for file example. Pyarrow Code To Check Statistics: {code} from pyarrow import parquet as pq meta = pq.read_metadata('/tmp/metadata.parquet') # No Statistics For String Column, prints false and statistics object is None print(meta.row_group(0).column(1).is_stats_set) {code} Example parquet-meta output: {code} file schema: spark_schema int: REQUIRED INT64 R:0 D:0 string: OPTIONAL BINARY O:UTF8 R:0 D:1 float: REQUIRED DOUBLE R:0 D:0 row group 1: RC:8333 TS:76031 OFFSET:4 int: INT64 SNAPPY DO:0 FPO:4 SZ:7793/8181/1.05 VC:8333 ENC:PLAIN_DICTIONARY,BIT_PACKED ST:[min: 0, max: 100, num_nulls: 0] string: BINARY SNAPPY DO:0 FPO:7797 SZ:1146/1139/0.99 VC:8333 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ST:[min: hello, max: hello, num_nulls: 4192] float:DOUBLE SNAPPY DO:0 FPO:8943 SZ:66720/66711/1.00 VC:8333 ENC:PLAIN,BIT_PACKED ST:[min: 0.0057611096964338415, max: 99.99811053829232, num_nulls: 0] {code} I realize the column only has a single value though it still seems like pyarrow should be able to read the statistics set. I made this here and not a JIRA since I wanted to be sure this is actually an issue and there wasnt a ticket already made there (I couldnt find one but I wanted to be sure). Either way I would like to understand why this is -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[DRAFT] Arrow Board Report
## Description: Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Go, Java, JavaScript, Python, Ruby, and Rust. ## Issues: - There are no issues requiring board attention at this time ## Activity: - We have not released since March as we work to improve our release and build automation. We plan to include binary artifacts in our next release vote, where we have only had source artifacts in past releases. ## Health report: The project's user and contributor base is growing rapidly. We are struggling a bit with maintainer bandwidth. As an example, 2 committers have merged 84% of patches (where there have been nearly 2000) since the project's inception. We are discussing ways to grow the maintainer base on the mailing list. ## PMC changes: - Currently 23 PMC members. - Siddharth Teotia was added to the PMC on Thu May 17 2018 ## Committer base changes: - Currently 31 committers. - No new committers added in the last 3 months - Last committer addition was Antoine Pitrou at Tue Apr 03 2018 ## Releases: - Last release was 0.9.0 on Mon Mar 19 2018 ## JIRA activity: - 392 JIRA tickets created in the last 3 months - 303 JIRA tickets closed/resolved in the last 3 months
Using a shared filesystem abstract API in Arrow Python libraries [was Re: file-system specification]
hi Martin and Antoine, I apologize I haven't been able to look at this in detail yet. I think this is a valuable initiative; I created a wiki page so we can begin to develop a plan to do the work https://cwiki.apache.org/confluence/display/ARROW/Python+Filesystems+and+Filesystem+API I added a JIRA filter and tagged a couple of filesystem-related issues; there are more that should be added to the list. There's a lot of other work related to filesystem implementations that we can help organize and plan here. As far as refining the details of the API, we should see what's the best place to collect feedback and discuss. Martin, can you set up a pull request with the entire patch so that many people can comment and discuss? NB: TensorFlow defines a filesystem abstraction, albeit in C++ with SWIG bindings. We might also look there as a check on some of our assumptions. Thank you, Wes On Tue, May 15, 2018 at 7:47 AM, Antoine Pitrou wrote: > > Hi Martin, > > On Wed, 9 May 2018 11:28:15 -0400 > Martin Durant wrote: >> I have sketched out a possible start of a python-wide file-system >> specification >> https://github.com/martindurant/filesystem_spec >> >> This came about from my work in some other (remote) file-systems >> implementations for python, particularly in the context of Dask. Since arrow >> also cares about both local files and, for example, hdfs, I thought that >> people on this list may have comments and opinions about a possible standard >> that we ought to converge on. I do not think that my suggestions so far are >> necessarily right or even good in many cases, but I want to get the >> conversation going. > > Here are some comments: > > - API naming: you seem to favour re-using Unix command-line monickers in > some places, while using more regular verbs or names in other > places. I think it should be consistent. Since the Unix > command-line doesn't exactly cover the exposed functionality, and > since Unix tends to favour short cryptic names, I think it's better > to use Python-like naming (which is also more familiar to non-Unix > users). For example "move" or "rename" or "replace" instead of "mv", > etc. > > - **kwargs parameters: a couple APIs (`mkdir`, `put`...) allow passing > arbitrary parameters, which I assume are intended to be > backend-specific. It makes it difficult to add other optional > parameters to those APIs in the future. So I'd make the > backend-specific directives a single (optional) dict parameter rather > than a **kwargs. > > - `invalidate_cache` doesn't state whether it invalidates recursively > or not (recursively sounds better intuitively?). Also, I think it > would be more flexible to take a list of paths rather than a single > path. > > - `du`: the effect of the `deep` parameter isn't obvious to me. I don't > know what it would mean *not* to recurse here: what is the size of a > directory if you don't recurse into it? > > - `glob` may need a formal definition (are trailing slashes > significant for directory or symlink resolution? this kind of thing), > though you may want to keep edge cases backend-specific. > > - are `head` and `tail` at all useful? They can be easily recreated > using a generic `open` facility. > > - `read_block` tries to do too much in a single API IMHO, and > using `open` directly is more flexible anyway. > > - if `touch` is intended to emulate the Unix API of the same name, the > docstring should state "Create empty file or update last modification > timestamp". > > - the information dicts returned by several APIs (`ls`, `info`) > need standardizing, at least for non backend-specific fields. > > - if the backend is a networked filesystem with non-trivial latency, > perhaps the operations would deserve being batched (operate on > several paths at once), though I will happily defer to your expertise > on the topic. > > Regards > > Antoine.
Re: bug? pyarrow deserialize_components doesn't work in multiple processes
This seems possibly similar to the issue reported in https://github.com/apache/arrow/issues/1946 -- we never found a resolution. Could we open a JIRA to track the problem? On Fri, Jul 6, 2018 at 3:57 AM, Josh Quigley wrote: > That works. I've tried a bunch of debugging and work arounds- as far as I > can tell this is just a problem with deserializr from components and > multiprocess. > > On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, > wrote: > >> Can you reproduce it without all of the multiprocessing code? E.g., just >> call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes >> into another interpreter and call *pyarrow.deserialize *or >> *pyarrow.deserialize_components*? >> On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley < >> josh.quig...@lifetrading.com.au> >> wrote: >> >> > Attachment inline: >> > >> > import pyarrow as pa >> > import multiprocessing as mp >> > import numpy as np >> > >> > def make_payload(): >> > """Common function - make data to send""" >> > return ['message', 123, np.random.uniform(-100, 100, (4, 4))] >> > >> > def send_payload(payload, connection): >> > """Common function - serialize & send data through a socket""" >> > s = pa.serialize(payload) >> > c = s.to_components() >> > >> > # Send >> > data = c.pop('data') >> > connection.send(c) >> > for d in data: >> > connection.send_bytes(d) >> > connection.send_bytes(b'') >> > >> > >> > def recv_payload(connection): >> > """Common function - recv data through a socket & deserialize""" >> > c = connection.recv() >> > c['data'] = [] >> > while True: >> > r = connection.recv_bytes() >> > if len(r) == 0: >> > break >> > c['data'].append(pa.py_buffer(r)) >> > >> > print('...deserialize') >> > return pa.deserialize_components(c) >> > >> > >> > def run_same_process(): >> > """Same process: Send data down a socket, then read data from the >> > matching socket""" >> > print('run_same_process') >> > recv_conn,send_conn = mp.Pipe(duplex=False) >> > payload = make_payload() >> > print(payload) >> > send_payload(payload, send_conn) >> > payload2 = recv_payload(recv_conn) >> > print(payload2) >> > >> > >> > def receiver(recv_conn): >> > """Separate process: runs in a different process, recv data & >> > deserialize""" >> > print('Receiver started') >> > payload = recv_payload(recv_conn) >> > print(payload) >> > >> > >> > def run_separate_process(): >> > """Separate process: launch the child process, then send data""" >> > >> > >> > print('run_separate_process') >> > recv_conn,send_conn = mp.Pipe(duplex=False) >> > process = mp.Process(target=receiver, args=(recv_conn,)) >> > process.start() >> > >> > payload = make_payload() >> > print(payload) >> > send_payload(payload, send_conn) >> > >> > process.join() >> > >> > if __name__ == '__main__': >> > run_same_process() >> > run_separate_process() >> > >> > >> > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley < >> > josh.quig...@lifetrading.com.au> >> > wrote: >> > >> > > A reproducible program attached - it first runs serialize/deserialize >> > from >> > > the same process, then it does the same work using a separate process >> for >> > > the deserialize. >> > > >> > > The behaviour see is (after the same process code executes happily) is >> > > hanging / child-process crashing during the call to deserialize. >> > > >> > > Is this expected, and if not, is there a known workaround? >> > > >> > > Running Windows 10, conda distribution, with package versions listed >> > > below. I'll also see what happens if I run on *nix. >> > > >> > > - arrow-cpp=0.9.0=py36_vc14_7 >> > > - boost-cpp=1.66.0=vc14_1 >> > > - bzip2=1.0.6=vc14_1 >> > > - hdf5=1.10.2=vc14_0 >> > > - lzo=2.10=vc14_0 >> > > - parquet-cpp=1.4.0=vc14_0 >> > > - snappy=1.1.7=vc14_1 >> > > - zlib=1.2.11=vc14_0 >> > > - blas=1.0=mkl >> > > - blosc=1.14.3=he51fdeb_0 >> > > - cython=0.28.3=py36hfa6e2cd_0 >> > > - icc_rt=2017.0.4=h97af966_0 >> > > - intel-openmp=2018.0.3=0 >> > > - numexpr=2.6.5=py36hcd2f87e_0 >> > > - numpy=1.14.5=py36h9fa60d3_2 >> > > - numpy-base=1.14.5=py36h5c71026_2 >> > > - pandas=0.23.1=py36h830ac7b_0 >> > > - pyarrow=0.9.0=py36hfe5e424_2 >> > > - pytables=3.4.4=py36he6f6034_0 >> > > - python=3.6.6=hea74fb7_0 >> > > - vc=14=h0510ff6_3 >> > > - vs2015_runtime=14.0.25123=3 >> > > >> > > >> > >>
[jira] [Created] (ARROW-2799) Table.from_pandas silently truncates data, even when passed a schema
Dave Hirschfeld created ARROW-2799: -- Summary: Table.from_pandas silently truncates data, even when passed a schema Key: ARROW-2799 URL: https://issues.apache.org/jira/browse/ARROW-2799 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Reporter: Dave Hirschfeld Ported over from [https://github.com/apache/arrow/issues/2217] ```python In [8]: import pandas as pd ...: import pyarrow as arw In [9]: df = pd.DataFrame({'A': list('abc'), 'B': np.arange(3)}) ...: df Out[9]: A B 0 a 0 1 b 1 2 c 2 In [10]: schema = arw.schema([ ...: arw.field('A', arw.string()), ...: arw.field('B', arw.int32()), ...: ]) In [11]: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema) ...: tbl Out[11]: pyarrow.Table A: string B: int32 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}, {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "int32", "metadata": null}]' b', "pandas_version": "0.23.1"}'} In [12]: tbl.to_pandas().equals(df) Out[12]: True ``` ...so if the `schema` matches the pandas datatypes all is well - we can roundtrip the DataFrame. Now, say we have some bad data such that column 'B' is now of type float64. The datatypes of the DataFrame don't match the explicitly supplied `schema` object but rather than raising a `TypeError` the data is silently truncated and the roundtrip DataFrame doesn't match our input DataFame without even a warning raised! ```python In [13]: df['B'].iloc[0] = 1.23 ...: df Out[13]: A B 0 a 1.23 1 b 1.00 2 c 2.00 In [14]: # I would expect/want this to raise a TypeError since the schema doesn't match the pandas datatypes ...: tbl = arw.Table.from_pandas(df, preserve_index=False, schema=schema) ...: tbl Out[14]: pyarrow.Table A: string B: int32 metadata {b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name":' b' "A", "field_name": "A", "pandas_type": "unicode", "numpy_type":' b' "object", "metadata": null}, {"name": "B", "field_name": "B", "' b'pandas_type": "int32", "numpy_type": "float64", "metadata": null' b'}], "pandas_version": "0.23.1"}'} In [15]: tbl.to_pandas() # <-- SILENT TRUNCATION!!! Out[15]: A B 0 a 1 1 b 1 2 c 2 ``` To be clear, I would really like `Table.from_pandas` to raise a `TypeError` if the DataFrame types don't match an explicitly supplied schema and would hope this current behaviour would be considered a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: bug? pyarrow deserialize_components doesn't work in multiple processes
That works. I've tried a bunch of debugging and work arounds- as far as I can tell this is just a problem with deserializr from components and multiprocess. On Fri., 6 Jul. 2018, 5:12 pm Robert Nishihara, wrote: > Can you reproduce it without all of the multiprocessing code? E.g., just > call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes > into another interpreter and call *pyarrow.deserialize *or > *pyarrow.deserialize_components*? > On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley < > josh.quig...@lifetrading.com.au> > wrote: > > > Attachment inline: > > > > import pyarrow as pa > > import multiprocessing as mp > > import numpy as np > > > > def make_payload(): > > """Common function - make data to send""" > > return ['message', 123, np.random.uniform(-100, 100, (4, 4))] > > > > def send_payload(payload, connection): > > """Common function - serialize & send data through a socket""" > > s = pa.serialize(payload) > > c = s.to_components() > > > > # Send > > data = c.pop('data') > > connection.send(c) > > for d in data: > > connection.send_bytes(d) > > connection.send_bytes(b'') > > > > > > def recv_payload(connection): > > """Common function - recv data through a socket & deserialize""" > > c = connection.recv() > > c['data'] = [] > > while True: > > r = connection.recv_bytes() > > if len(r) == 0: > > break > > c['data'].append(pa.py_buffer(r)) > > > > print('...deserialize') > > return pa.deserialize_components(c) > > > > > > def run_same_process(): > > """Same process: Send data down a socket, then read data from the > > matching socket""" > > print('run_same_process') > > recv_conn,send_conn = mp.Pipe(duplex=False) > > payload = make_payload() > > print(payload) > > send_payload(payload, send_conn) > > payload2 = recv_payload(recv_conn) > > print(payload2) > > > > > > def receiver(recv_conn): > > """Separate process: runs in a different process, recv data & > > deserialize""" > > print('Receiver started') > > payload = recv_payload(recv_conn) > > print(payload) > > > > > > def run_separate_process(): > > """Separate process: launch the child process, then send data""" > > > > > > print('run_separate_process') > > recv_conn,send_conn = mp.Pipe(duplex=False) > > process = mp.Process(target=receiver, args=(recv_conn,)) > > process.start() > > > > payload = make_payload() > > print(payload) > > send_payload(payload, send_conn) > > > > process.join() > > > > if __name__ == '__main__': > > run_same_process() > > run_separate_process() > > > > > > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley < > > josh.quig...@lifetrading.com.au> > > wrote: > > > > > A reproducible program attached - it first runs serialize/deserialize > > from > > > the same process, then it does the same work using a separate process > for > > > the deserialize. > > > > > > The behaviour see is (after the same process code executes happily) is > > > hanging / child-process crashing during the call to deserialize. > > > > > > Is this expected, and if not, is there a known workaround? > > > > > > Running Windows 10, conda distribution, with package versions listed > > > below. I'll also see what happens if I run on *nix. > > > > > > - arrow-cpp=0.9.0=py36_vc14_7 > > > - boost-cpp=1.66.0=vc14_1 > > > - bzip2=1.0.6=vc14_1 > > > - hdf5=1.10.2=vc14_0 > > > - lzo=2.10=vc14_0 > > > - parquet-cpp=1.4.0=vc14_0 > > > - snappy=1.1.7=vc14_1 > > > - zlib=1.2.11=vc14_0 > > > - blas=1.0=mkl > > > - blosc=1.14.3=he51fdeb_0 > > > - cython=0.28.3=py36hfa6e2cd_0 > > > - icc_rt=2017.0.4=h97af966_0 > > > - intel-openmp=2018.0.3=0 > > > - numexpr=2.6.5=py36hcd2f87e_0 > > > - numpy=1.14.5=py36h9fa60d3_2 > > > - numpy-base=1.14.5=py36h5c71026_2 > > > - pandas=0.23.1=py36h830ac7b_0 > > > - pyarrow=0.9.0=py36hfe5e424_2 > > > - pytables=3.4.4=py36he6f6034_0 > > > - python=3.6.6=hea74fb7_0 > > > - vc=14=h0510ff6_3 > > > - vs2015_runtime=14.0.25123=3 > > > > > > > > >
[jira] [Created] (ARROW-2798) [Plasma] Use hashing function that takes into account all UniqueID bytes
Songqing Zhang created ARROW-2798: - Summary: [Plasma] Use hashing function that takes into account all UniqueID bytes Key: ARROW-2798 URL: https://issues.apache.org/jira/browse/ARROW-2798 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Affects Versions: 0.9.0 Reporter: Songqing Zhang Now, the hashing of UniqueID in plasma is too simple which has caused a problem. In some cases(for example, in github/ray, UniqueID is composed of a taskID and a index), the UniqueID may be like "00", "ff01", "fff02" ... . The current hashing method is only to copy the first few bytes of a UniqueID and the result is that most of the hashed ids are same, so when the hashed ids put to plasma store, it will become very slow when searching(plasma store uses unordered_map to store the ids, and when the keys are same, it will become slow just like list). In fact, the same PR has been merged into ray, see [ray-project/ray#2174|https://github.com/ray-project/ray/pull/2174]. and I have tested the perf between the new hashing method and the original one with putting lots of objects continuously, it seems the new hashing method doesn't cost more time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: bug? pyarrow deserialize_components doesn't work in multiple processes
Can you reproduce it without all of the multiprocessing code? E.g., just call *pyarrow.serialize* in one interpreter. Then copy and paste the bytes into another interpreter and call *pyarrow.deserialize *or *pyarrow.deserialize_components*? On Thu, Jul 5, 2018 at 9:48 PM Josh Quigley wrote: > Attachment inline: > > import pyarrow as pa > import multiprocessing as mp > import numpy as np > > def make_payload(): > """Common function - make data to send""" > return ['message', 123, np.random.uniform(-100, 100, (4, 4))] > > def send_payload(payload, connection): > """Common function - serialize & send data through a socket""" > s = pa.serialize(payload) > c = s.to_components() > > # Send > data = c.pop('data') > connection.send(c) > for d in data: > connection.send_bytes(d) > connection.send_bytes(b'') > > > def recv_payload(connection): > """Common function - recv data through a socket & deserialize""" > c = connection.recv() > c['data'] = [] > while True: > r = connection.recv_bytes() > if len(r) == 0: > break > c['data'].append(pa.py_buffer(r)) > > print('...deserialize') > return pa.deserialize_components(c) > > > def run_same_process(): > """Same process: Send data down a socket, then read data from the > matching socket""" > print('run_same_process') > recv_conn,send_conn = mp.Pipe(duplex=False) > payload = make_payload() > print(payload) > send_payload(payload, send_conn) > payload2 = recv_payload(recv_conn) > print(payload2) > > > def receiver(recv_conn): > """Separate process: runs in a different process, recv data & > deserialize""" > print('Receiver started') > payload = recv_payload(recv_conn) > print(payload) > > > def run_separate_process(): > """Separate process: launch the child process, then send data""" > > > print('run_separate_process') > recv_conn,send_conn = mp.Pipe(duplex=False) > process = mp.Process(target=receiver, args=(recv_conn,)) > process.start() > > payload = make_payload() > print(payload) > send_payload(payload, send_conn) > > process.join() > > if __name__ == '__main__': > run_same_process() > run_separate_process() > > > On Fri, Jul 6, 2018 at 2:42 PM Josh Quigley < > josh.quig...@lifetrading.com.au> > wrote: > > > A reproducible program attached - it first runs serialize/deserialize > from > > the same process, then it does the same work using a separate process for > > the deserialize. > > > > The behaviour see is (after the same process code executes happily) is > > hanging / child-process crashing during the call to deserialize. > > > > Is this expected, and if not, is there a known workaround? > > > > Running Windows 10, conda distribution, with package versions listed > > below. I'll also see what happens if I run on *nix. > > > > - arrow-cpp=0.9.0=py36_vc14_7 > > - boost-cpp=1.66.0=vc14_1 > > - bzip2=1.0.6=vc14_1 > > - hdf5=1.10.2=vc14_0 > > - lzo=2.10=vc14_0 > > - parquet-cpp=1.4.0=vc14_0 > > - snappy=1.1.7=vc14_1 > > - zlib=1.2.11=vc14_0 > > - blas=1.0=mkl > > - blosc=1.14.3=he51fdeb_0 > > - cython=0.28.3=py36hfa6e2cd_0 > > - icc_rt=2017.0.4=h97af966_0 > > - intel-openmp=2018.0.3=0 > > - numexpr=2.6.5=py36hcd2f87e_0 > > - numpy=1.14.5=py36h9fa60d3_2 > > - numpy-base=1.14.5=py36h5c71026_2 > > - pandas=0.23.1=py36h830ac7b_0 > > - pyarrow=0.9.0=py36hfe5e424_2 > > - pytables=3.4.4=py36he6f6034_0 > > - python=3.6.6=hea74fb7_0 > > - vc=14=h0510ff6_3 > > - vs2015_runtime=14.0.25123=3 > > > > >