[jira] [Created] (ARROW-2008) [Python] Type inference for int32 NumPy arrays as list return int64 and then conversion fails

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2008:
---

 Summary: [Python] Type inference for int32 NumPy arrays as 
list return int64 and then conversion fails
 Key: ARROW-2008
 URL: https://issues.apache.org/jira/browse/ARROW-2008
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See report in [https://github.com/apache/arrow/issues/1430]

{{arrow::py::InferArrowType}} is called, when traverses the array as though it 
were any other Python sequence, and NumPy int32 scalars are not recognized as 
such



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2007) [Python] Sequence converter for float32 not implemented

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2007:
---

 Summary: [Python] Sequence converter for float32 not implemented
 Key: ARROW-2007
 URL: https://issues.apache.org/jira/browse/ARROW-2007
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


See bug report in [https://github.com/apache/arrow/issues/1431,] example
{code:java}
import pyarrow as pa
l = [[1.2, 3.4], [9.0, 42.0]]
pa.array(l, type=pa.list_(pa.float32())){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2006) [C++] Add option to trim excess padding when writing IPC messages

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2006:
---

 Summary: [C++] Add option to trim excess padding when writing IPC 
messages
 Key: ARROW-2006
 URL: https://issues.apache.org/jira/browse/ARROW-2006
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


This will help with situations like 
[https://github.com/apache/arrow/issues/1467] where we don't really need the 
extra padding bytes



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Help triaging Arrow GitHub issues

2018-01-17 Thread Wes McKinney
hi folks,

We have 23 open issues on GitHub:

https://github.com/apache/arrow/issues

While having the GitHub issues is helpful for capturing bug reports
and lightweight interactions with the community, it isn't a good
long-term place to manage the project's development roadmap or
priorities -- everything needs to end up on JIRA.

I will try to close some of the issues myself and migrate lingering
items to JIRA, but any help from others in the community would be very
much appreciated.

Thanks,
Wes


[jira] [Created] (ARROW-2005) [Python] pyflakes warnings on Cython files not failing build

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2005:
---

 Summary: [Python] pyflakes warnings on Cython files not failing 
build
 Key: ARROW-2005
 URL: https://issues.apache.org/jira/browse/ARROW-2005
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


I see the following flakes in master:
{code:java}
pyarrow/plasma.pyx:251:80: E501 line too long (82 > 79 characters)
pyarrow/plasma.pyx:305:80: E501 line too long (96 > 79 characters)
pyarrow/_orc.pyx:53:46: E127 continuation line over-indented for visual indent
pyarrow/_orc.pyx:72:49: E703 statement ends with a semicolon
pyarrow/_orc.pyx:75:52: E703 statement ends with a semicolon
pyarrow/_orc.pyx:88:80: E501 line too long (85 > 79 characters)
pyarrow/_orc.pyx:92:80: E501 line too long (94 > 79 characters)
pyarrow/_orc.pxd:32:80: E501 line too long (87 > 79 characters)
pyarrow/_orc.pxd:43:80: E501 line too long (90 > 79 characters)
9{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow-Parquet converters in Java

2018-01-17 Thread Li Jin
Hi Sidd,

Thanks for the information. This could be a very useful tool.

Li

On Wed, Jan 17, 2018 at 3:05 PM, Siddharth Teotia 
wrote:

> Hi Li,
>
> We do have support for Parquet <-> Arrow reader/writer in Dremio OSS.
> Please take a look here:
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/
> kernel/src/main/java/com/dremio/exec/store/parquet
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/exec/store/parquet/columnreaders/
> DeprecatedParquetVectorizedReader.java
>
> We are yet to discuss how to factor out some/all of such implementation
> from Dremio and contribute back to Parquet and/or Arrow.
>
> Thanks,
> Sidd
>
>
> On Wed, Jan 17, 2018 at 10:14 AM, Li Jin  wrote:
>
> > Hey folks,
> >
> > I know this is supported in C++, but is there a library to convert
> between
> > Arrow and Parquet? (i.e., read Parquet files in Arrow format, write Arrow
> > format to Parquet files).
> >
> > Jacques and Sidd, does Dremio has some library to do this?
> >
> > Thanks,
> > Li
> >
>


Re: Trying to build to build pyarrow for python 2.7

2018-01-17 Thread simba nyatsanga
Hi Wes,

Great, thanks for the information.

On Tue, 16 Jan 2018 at 20:19 Wes McKinney  wrote:

> hi Simba -- the PyPI / pip wheels will only be updated when there is a
> new release. We'll either make a 0.8.1 release or 0.9.0 sometime in
> February depending on how development is progressing.
>
> - Wes
>
> On Sun, Jan 14, 2018 at 9:19 AM, simba nyatsanga 
> wrote:
> > Thanks a lot. I see that there's a PR  that's been opened to resolve the
> > encoding issue - https://github.com/apache/arrow/pull/1476
> >
> > Do you think this PR (if merged ) will also roll out as part of version
> > 0.9.0, or I'll be able to pip install with the merge commit as soon as
> it's
> > merged?
> >
> > Kind Regards
> >
> > On Sun, 14 Jan 2018 at 15:50 Uwe L. Korn  wrote:
> >
> >> Nice to hear that it worked.
> >>
> >> Updating the docs should not be necessary, we should rather see that we
> >> soon get a 0.9.0 release out (but that will also take some more weeks)
> >>
> >> Uwe
> >>
> >> On Sun, Jan 14, 2018, at 2:42 PM, simba nyatsanga wrote:
> >> > Amazing, thanks Uwe!
> >> >
> >> > I was able to build pyarrow successfully for python 2.7 using your
> >> > workaround. I appreciate that you've got a possible solution for the
> too.
> >> >
> >> > Besides the PR getting reviewed by more experienced maintainers, I'm
> >> > thinking to pull your branch and try the building process from
> scratch.
> >> > Otherwise I was wondering if it's valuable, in the meantime, to update
> >> the
> >> > docs with your work around?
> >> >
> >> > Kind Regards
> >> > Simba
> >> >
> >> > On Sun, 14 Jan 2018 at 15:17 Uwe L. Korn  wrote:
> >> >
> >> > > Hello Simba,
> >> > >
> >> > > it looks like you are running to
> >> > > https://issues.apache.org/jira/browse/ARROW-1856.
> >> > >
> >> > > To work around this issue, please "unset PARQUET_HOME" before you
> call
> >> the
> >> > > setup.py. Also set PKG_CONFIG_PATH, in your case this should be
> "export
> >> > >
> PKG_CONFIG_PATH=/Users/simba/anaconda/envs/pyarrow-dev/lib/pkgconfig".
> >> By
> >> > > doing this, you do the package discovery using pkg-config instead of
> >> the
> >> > > *_HOME variables. Currently this is the only path on which we can
> >> > > auto-detect the extension of the parquet shared library.
> >> > >
> >> > > Nevertheless, I will take a shot at fixing the issues as it seems
> that
> >> > > multiple users run into it.
> >> > >
> >> > > Uwe
> >> > >
> >> > > On Thu, Jan 11, 2018, at 11:42 PM, simba nyatsanga wrote:
> >> > > > Hi Wes,
> >> > > >
> >> > > > Apologies for the ambiguity there. To clarify, I used the conda
> >> > > > instructions only to create a conda environment. So I did this
> >> > > >
> >> > > > conda create -y -q -n pyarrow-dev \
> >> > > >   python=2.7 numpy six setuptools cython pandas pytest \
> >> > > >   cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy
> zlib \
> >> > > >   gflags brotli jemalloc lz4-c zstd -c conda-forge
> >> > > >
> >> > > >
> >> > > > I followed the instructions closely and I've stumbled upon a
> >> different
> >> > > > error from the one I initially had encountered. Now the issue
> seems
> >> to be
> >> > > > that when I'm building the Arrow C++ i.e running the following
> steps:
> >> > > >
> >> > > > mkdir parquet-cpp/build
> >> > > > pushd parquet-cpp/build
> >> > > >
> >> > > > cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
> >> > > >   -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
> >> > > >   -DPARQUET_BUILD_BENCHMARKS=off \
> >> > > >   -DPARQUET_BUILD_EXECUTABLES=off \
> >> > > >   -DPARQUET_BUILD_TESTS=off \
> >> > > >   ..
> >> > > >
> >> > > > make -j4
> >> > > > make install
> >> > > > popd
> >> > > >
> >> > > >
> >> > > > The make install step generates *libparquet.1.3.2.dylib* as one of
> >> the
> >> > > > artefacts, as illustrated below:
> >> > > >
> >> > > > -- Install configuration: "RELEASE"-- Installing:
> >> > > >
> >> /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/cmake/parquet-
> >> > > > cppConfig.cmake--
> >> > > > Installing:
> /Users/simba/anaconda/envs/pyarrow-dev/share/parquet-cpp/
> >> > > > cmake/parquet-cppConfigVersion.cmake--
> >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> >> > > > 1.3.2.dylib--
> >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.
> >> > > > 1.dylib--
> >> > > > Installing: /Users/simba/anaconda/envs/pyarrow-dev/lib/
> >> > > > libparquet.dylib--
> >> > > > Installing:
> /Users/simba/anaconda/envs/pyarrow-dev/lib/libparquet.a--
> >> > > > Installing:
> /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> >> > > > column_reader.h--
> >> > > > Installing:
> /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> >> > > > column_page.h--
> >> > > > Installing:
> /Users/simba/anaconda/envs/pyarrow-dev/include/parquet/
> >> > > > column_scanner.h--
> >> > > > Installing:
> 

[jira] [Created] (ARROW-2004) [C++] Add shrink_to_fit option in BufferBuilder::Resize

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2004:
---

 Summary: [C++] Add shrink_to_fit option in BufferBuilder::Resize
 Key: ARROW-2004
 URL: https://issues.apache.org/jira/browse/ARROW-2004
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


See discussion in 
https://github.com/apache/arrow/pull/1481#discussion_r162157558



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow policy on rewriting git history?

2018-01-17 Thread Robert Nishihara
Got it (I remember that discussion actually). The status quo is OK for us..
longer term we'll switch to using releases.

On Wed, Jan 17, 2018 at 7:50 AM Wes McKinney  wrote:

> We have been rebasing master after releases so that the release tag
> (and commits for the changelog, Java package metadata, etc.) appears
> in master. This only affects PRs merged while the release vote is
> open, but it's understandably not ideal.
>
> There was a prior mailing list thread where we discussed this. The
> alternative is to not merge PRs while a release vote is open, but this
> has the effect of artificially slowing down the development cadence.
>
> I would suggest we do a 0.8.1 bug fix release sometime in the next 2
> weeks with the goal of helping Ray get onto a tagged release, and
> establish some process to help us validate master before cutting a
> release candidates to avoid having to cancel a release vote. We also
> need to be able validate the Spark integration more easily (this is
> ongoing in https://github.com/apache/arrow/pull/1319 -- Bryan do you
> have time to work on this?)
>
> thanks
> Wes
>
> On Wed, Jan 17, 2018 at 12:39 AM, Robert Nishihara
>  wrote:
> > I've noticed that specific commits sometimes disappear from the master
> > branch. Is this an inevitable consequence of the way Arrow does releases?
> > Or would it be possible to avoid removing commits from the master branch?
> >
> > Of course once we start using Arrow releases this won't be an issue. At
> the
> > moment we check out specific Arrow commits, and so there are a number of
> > commits in our history that no longer build because the corresponding
> > commits in Arrow have disappeared.
>


Arrow-Parquet converters in Java

2018-01-17 Thread Li Jin
Hey folks,

I know this is supported in C++, but is there a library to convert between
Arrow and Parquet? (i.e., read Parquet files in Arrow format, write Arrow
format to Parquet files).

Jacques and Sidd, does Dremio has some library to do this?

Thanks,
Li


[jira] [Created] (ARROW-2003) [Python] Do not use deprecated kwarg in pandas.core.internals.make_block

2018-01-17 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2003:
---

 Summary: [Python] Do not use deprecated kwarg in 
pandas.core.internals.make_block
 Key: ARROW-2003
 URL: https://issues.apache.org/jira/browse/ARROW-2003
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


see bug report in [https://github.com/apache/arrow/issues/1484]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow policy on rewriting git history?

2018-01-17 Thread Wes McKinney
We have been rebasing master after releases so that the release tag
(and commits for the changelog, Java package metadata, etc.) appears
in master. This only affects PRs merged while the release vote is
open, but it's understandably not ideal.

There was a prior mailing list thread where we discussed this. The
alternative is to not merge PRs while a release vote is open, but this
has the effect of artificially slowing down the development cadence.

I would suggest we do a 0.8.1 bug fix release sometime in the next 2
weeks with the goal of helping Ray get onto a tagged release, and
establish some process to help us validate master before cutting a
release candidates to avoid having to cancel a release vote. We also
need to be able validate the Spark integration more easily (this is
ongoing in https://github.com/apache/arrow/pull/1319 -- Bryan do you
have time to work on this?)

thanks
Wes

On Wed, Jan 17, 2018 at 12:39 AM, Robert Nishihara
 wrote:
> I've noticed that specific commits sometimes disappear from the master
> branch. Is this an inevitable consequence of the way Arrow does releases?
> Or would it be possible to avoid removing commits from the master branch?
>
> Of course once we start using Arrow releases this won't be an issue. At the
> moment we check out specific Arrow commits, and so there are a number of
> commits in our history that no longer build because the corresponding
> commits in Arrow have disappeared.