[jira] [Created] (ARROW-2008) [Python] Type inference for int32 NumPy arrays as list return int64 and then conversion fails
Wes McKinney created ARROW-2008: --- Summary: [Python] Type inference for int32 NumPy arrays as list return int64 and then conversion fails Key: ARROW-2008 URL: https://issues.apache.org/jira/browse/ARROW-2008 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.9.0 See report in [https://github.com/apache/arrow/issues/1430] {{arrow::py::InferArrowType}} is called, when traverses the array as though it were any other Python sequence, and NumPy int32 scalars are not recognized as such -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2007) [Python] Sequence converter for float32 not implemented
Wes McKinney created ARROW-2007: --- Summary: [Python] Sequence converter for float32 not implemented Key: ARROW-2007 URL: https://issues.apache.org/jira/browse/ARROW-2007 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.9.0 See bug report in [https://github.com/apache/arrow/issues/1431,] example {code:java} import pyarrow as pa l = [[1.2, 3.4], [9.0, 42.0]] pa.array(l, type=pa.list_(pa.float32())){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages
Wes McKinney created ARROW-2006: --- Summary: [C++] Add option to trim excess padding when writing IPC messages Key: ARROW-2006 URL: https://issues.apache.org/jira/browse/ARROW-2006 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney This will help with situations like [https://github.com/apache/arrow/issues/1467] where we don't really need the extra padding bytes -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Help triaging Arrow GitHub issues
hi folks, We have 23 open issues on GitHub: https://github.com/apache/arrow/issues While having the GitHub issues is helpful for capturing bug reports and lightweight interactions with the community, it isn't a good long-term place to manage the project's development roadmap or priorities -- everything needs to end up on JIRA. I will try to close some of the issues myself and migrate lingering items to JIRA, but any help from others in the community would be very much appreciated. Thanks, Wes
[jira] [Created] (ARROW-2005) [Python] pyflakes warnings on Cython files not failing build
Wes McKinney created ARROW-2005: --- Summary: [Python] pyflakes warnings on Cython files not failing build Key: ARROW-2005 URL: https://issues.apache.org/jira/browse/ARROW-2005 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.9.0 I see the following flakes in master: {code:java} pyarrow/plasma.pyx:251:80: E501 line too long (82 > 79 characters) pyarrow/plasma.pyx:305:80: E501 line too long (96 > 79 characters) pyarrow/_orc.pyx:53:46: E127 continuation line over-indented for visual indent pyarrow/_orc.pyx:72:49: E703 statement ends with a semicolon pyarrow/_orc.pyx:75:52: E703 statement ends with a semicolon pyarrow/_orc.pyx:88:80: E501 line too long (85 > 79 characters) pyarrow/_orc.pyx:92:80: E501 line too long (94 > 79 characters) pyarrow/_orc.pxd:32:80: E501 line too long (87 > 79 characters) pyarrow/_orc.pxd:43:80: E501 line too long (90 > 79 characters) 9{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Arrow-Parquet converters in Java
Hi Sidd, Thanks for the information. This could be a very useful tool. Li On Wed, Jan 17, 2018 at 3:05 PM, Siddharth Teotiawrote: > Hi Li, > > We do have support for Parquet <-> Arrow reader/writer in Dremio OSS. > Please take a look here: > > https://github.com/dremio/dremio-oss/tree/master/sabot/ > kernel/src/main/java/com/dremio/exec/store/parquet > https://github.com/dremio/dremio-oss/blob/master/sabot/ > kernel/src/main/java/com/dremio/exec/store/parquet/columnreaders/ > DeprecatedParquetVectorizedReader.java > > We are yet to discuss how to factor out some/all of such implementation > from Dremio and contribute back to Parquet and/or Arrow. > > Thanks, > Sidd > > > On Wed, Jan 17, 2018 at 10:14 AM, Li Jin wrote: > > > Hey folks, > > > > I know this is supported in C++, but is there a library to convert > between > > Arrow and Parquet? (i.e., read Parquet files in Arrow format, write Arrow > > format to Parquet files). > > > > Jacques and Sidd, does Dremio has some library to do this? > > > > Thanks, > > Li > > >
Re: Trying to build to build pyarrow for python 2.7
Hi Wes, Great, thanks for the information. On Tue, 16 Jan 2018 at 20:19 Wes McKinneywrote: > hi Simba -- the PyPI / pip wheels will only be updated when there is a > new release. We'll either make a 0.8.1 release or 0.9.0 sometime in > February depending on how development is progressing. > > - Wes > > On Sun, Jan 14, 2018 at 9:19 AM, simba nyatsanga > wrote: > > Thanks a lot. I see that there's a PR that's been opened to resolve the > > encoding issue - https://github.com/apache/arrow/pull/1476 > > > > Do you think this PR (if merged ) will also roll out as part of version > > 0.9.0, or I'll be able to pip install with the merge commit as soon as > it's > > merged? > > > > Kind Regards > > > > On Sun, 14 Jan 2018 at 15:50 Uwe L. Korn wrote: > > > >> Nice to hear that it worked. > >> > >> Updating the docs should not be necessary, we should rather see that we > >> soon get a 0.9.0 release out (but that will also take some more weeks) > >> > >> Uwe > >> > >> On Sun, Jan 14, 2018, at 2:42 PM, simba nyatsanga wrote: > >> > Amazing, thanks Uwe! > >> > > >> > I was able to build pyarrow successfully for python 2.7 using your > >> > workaround. I appreciate that you've got a possible solution for the > too. > >> > > >> > Besides the PR getting reviewed by more experienced maintainers, I'm > >> > thinking to pull your branch and try the building process from > scratch. > >> > Otherwise I was wondering if it's valuable, in the meantime, to update > >> the > >> > docs with your work around? > >> > > >> > Kind Regards > >> > Simba > >> > > >> > On Sun, 14 Jan 2018 at 15:17 Uwe L. Korn wrote: > >> > > >> > > Hello Simba, > >> > > > >> > > it looks like you are running to > >> > > https://issues.apache.org/jira/browse/ARROW-1856. > >> > > > >> > > To work around this issue, please "unset PARQUET_HOME" before you > call > >> the > >> > > setup.py. Also set PKG_CONFIG_PATH, in your case this should be > "export > >> > > > PKG_CONFIG_PATH=/Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig". > >> By > >> > > doing this, you do the package discovery using pkg-config instead of > >> the > >> > > *_HOME variables. Currently this is the only path on which we can > >> > > auto-detect the extension of the parquet shared library. > >> > > > >> > > Nevertheless, I will take a shot at fixing the issues as it seems > that > >> > > multiple users run into it. > >> > > > >> > > Uwe > >> > > > >> > > On Thu, Jan 11, 2018, at 11:42 PM, simba nyatsanga wrote: > >> > > > Hi Wes, > >> > > > > >> > > > Apologies for the ambiguity there. To clarify, I used the conda > >> > > > instructions only to create a conda environment. So I did this > >> > > > > >> > > > conda create -y -q -n pyarrow-dev \ > >> > > > python=2.7 numpy six setuptools cython pandas pytest \ > >> > > > cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy > zlib \ > >> > > > gflags brotli jemalloc lz4-c zstd -c conda-forge > >> > > > > >> > > > > >> > > > I followed the instructions closely and I've stumbled upon a > >> different > >> > > > error from the one I initially had encountered. Now the issue > seems > >> to be > >> > > > that when I'm building the Arrow C++ i.e running the following > steps: > >> > > > > >> > > > mkdir parquet-cpp/build > >> > > > pushd parquet-cpp/build > >> > > > > >> > > > cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ > >> > > > -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ > >> > > > -DPARQUET_BUILD_BENCHMARKS=off \ > >> > > > -DPARQUET_BUILD_EXECUTABLES=off \ > >> > > > -DPARQUET_BUILD_TESTS=off \ > >> > > > .. > >> > > > > >> > > > make -j4 > >> > > > make install > >> > > > popd > >> > > > > >> > > > > >> > > > The make install step generates *libparquet.1.3.2.dylib* as one of > >> the > >> > > > artefacts, as illustrated below: > >> > > > > >> > > > -- Install configuration: "RELEASE"-- Installing: > >> > > > > >> /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/cmake/parquet- > >> > > > cppConfig.cmake-- > >> > > > Installing: > /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/ > >> > > > cmake/parquet-cppConfigVersion.cmake-- > >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet. > >> > > > 1.3.2.dylib-- > >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet. > >> > > > 1.dylib-- > >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/ > >> > > > libparquet.dylib-- > >> > > > Installing: > /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.a-- > >> > > > Installing: > /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/ > >> > > > column_reader.h-- > >> > > > Installing: > /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/ > >> > > > column_page.h-- > >> > > > Installing: > /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/ > >> > > > column_scanner.h-- > >> > > > Installing: >
[jira] [Created] (ARROW-2004) [C++] Add shrink_to_fit option in BufferBuilder::Resize
Wes McKinney created ARROW-2004: --- Summary: [C++] Add shrink_to_fit option in BufferBuilder::Resize Key: ARROW-2004 URL: https://issues.apache.org/jira/browse/ARROW-2004 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.9.0 See discussion in https://github.com/apache/arrow/pull/1481#discussion_r162157558 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Arrow policy on rewriting git history?
Got it (I remember that discussion actually). The status quo is OK for us.. longer term we'll switch to using releases. On Wed, Jan 17, 2018 at 7:50 AM Wes McKinneywrote: > We have been rebasing master after releases so that the release tag > (and commits for the changelog, Java package metadata, etc.) appears > in master. This only affects PRs merged while the release vote is > open, but it's understandably not ideal. > > There was a prior mailing list thread where we discussed this. The > alternative is to not merge PRs while a release vote is open, but this > has the effect of artificially slowing down the development cadence. > > I would suggest we do a 0.8.1 bug fix release sometime in the next 2 > weeks with the goal of helping Ray get onto a tagged release, and > establish some process to help us validate master before cutting a > release candidates to avoid having to cancel a release vote. We also > need to be able validate the Spark integration more easily (this is > ongoing in https://github.com/apache/arrow/pull/1319 -- Bryan do you > have time to work on this?) > > thanks > Wes > > On Wed, Jan 17, 2018 at 12:39 AM, Robert Nishihara > wrote: > > I've noticed that specific commits sometimes disappear from the master > > branch. Is this an inevitable consequence of the way Arrow does releases? > > Or would it be possible to avoid removing commits from the master branch? > > > > Of course once we start using Arrow releases this won't be an issue. At > the > > moment we check out specific Arrow commits, and so there are a number of > > commits in our history that no longer build because the corresponding > > commits in Arrow have disappeared. >
Arrow-Parquet converters in Java
Hey folks, I know this is supported in C++, but is there a library to convert between Arrow and Parquet? (i.e., read Parquet files in Arrow format, write Arrow format to Parquet files). Jacques and Sidd, does Dremio has some library to do this? Thanks, Li
[jira] [Created] (ARROW-2003) [Python] Do not use deprecated kwarg in pandas.core.internals.make_block
Wes McKinney created ARROW-2003: --- Summary: [Python] Do not use deprecated kwarg in pandas.core.internals.make_block Key: ARROW-2003 URL: https://issues.apache.org/jira/browse/ARROW-2003 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.9.0 see bug report in [https://github.com/apache/arrow/issues/1484] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Arrow policy on rewriting git history?
We have been rebasing master after releases so that the release tag (and commits for the changelog, Java package metadata, etc.) appears in master. This only affects PRs merged while the release vote is open, but it's understandably not ideal. There was a prior mailing list thread where we discussed this. The alternative is to not merge PRs while a release vote is open, but this has the effect of artificially slowing down the development cadence. I would suggest we do a 0.8.1 bug fix release sometime in the next 2 weeks with the goal of helping Ray get onto a tagged release, and establish some process to help us validate master before cutting a release candidates to avoid having to cancel a release vote. We also need to be able validate the Spark integration more easily (this is ongoing in https://github.com/apache/arrow/pull/1319 -- Bryan do you have time to work on this?) thanks Wes On Wed, Jan 17, 2018 at 12:39 AM, Robert Nishiharawrote: > I've noticed that specific commits sometimes disappear from the master > branch. Is this an inevitable consequence of the way Arrow does releases? > Or would it be possible to avoid removing commits from the master branch? > > Of course once we start using Arrow releases this won't be an issue. At the > moment we check out specific Arrow commits, and so there are a number of > commits in our history that no longer build because the corresponding > commits in Arrow have disappeared.