[jira] [Created] (ARROW-2024) Remove global SerializationContext variables.

2018-01-23 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-2024:
---

 Summary: Remove global SerializationContext variables.
 Key: ARROW-2024
 URL: https://issues.apache.org/jira/browse/ARROW-2024
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


We should get rid of the global variables 
_default_serialization_context and pandas_serialization_context 
and replace them with functions default_serialization_context() and 
pandas_serialization_context().

This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2023) [C++] Test opening IPC stream reader or file reader on an empty InputStream

2018-01-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2023:
---

 Summary: [C++] Test opening IPC stream reader or file reader on an 
empty InputStream
 Key: ARROW-2023
 URL: https://issues.apache.org/jira/browse/ARROW-2023
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


This was reported to segfault in ARROW-1589



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Filters on Arrow record batch

2018-01-23 Thread Wes McKinney
hi Animesh -- it does not yet, but the idea has come up on occasion.
You are welcome to propose additions to the format for including
statistics in a stream of record batch messages (these could possibly
be embedded in the main RecordBatch metadata or sent as a separate
message).

As an aside, I just opened
https://issues.apache.org/jira/browse/ARROW-2022 to think about the
idea of sending along arbitrary extra metadata with a record batch
message.

- Wes

On Sat, Jan 20, 2018 at 6:07 AM, Animesh Trivedi
 wrote:
> Hi all,
>
> Is it possible to have push-down filters on Arrow record batches while
> reading data in? Something like what parquet have.
>
> Does Arrow maintain any per batch statistics?
>
> Thanks
> --
> Animesh


[jira] [Created] (ARROW-2022) [Format] Add custom metadata field specific to a RecordBatch message

2018-01-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2022:
---

 Summary: [Format] Add custom metadata field specific to a 
RecordBatch message
 Key: ARROW-2022
 URL: https://issues.apache.org/jira/browse/ARROW-2022
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney


While we can have schema- and field-level custom metadata, we cannot send 
metadata at the record batch level. This could include things like statistics 
(although statistics isn't a great example, because this might be something we 
want to eventually standardize), but other things too

See message definitions in 
https://github.com/apache/arrow/blob/master/format/Message.fbs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2021) Reduce Travis CI flakiness due to apt connectivity problems

2018-01-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2021:
---

 Summary: Reduce Travis CI flakiness due to apt connectivity 
problems
 Key: ARROW-2021
 URL: https://issues.apache.org/jira/browse/ARROW-2021
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Wes McKinney


We have been experiencing periodic apt flakiness in Travis CI. See discussion 
in https://github.com/apache/arrow/pull/1481#issuecomment-359993584



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2020) pyarrow: Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps

2018-01-23 Thread Yiannis Liodakis (JIRA)
Yiannis Liodakis created ARROW-2020:
---

 Summary: pyarrow: Parquet segfaults if coercing ns timestamps and 
writing 96-bit timestamps
 Key: ARROW-2020
 URL: https://issues.apache.org/jira/browse/ARROW-2020
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
 Environment: OS: Mac OS X 10.13.2
Python: 3.6.4
PyArrow: 0.8.0
Reporter: Yiannis Liodakis
 Attachments: crash-report.txt

If you try to write a PyArrow table containing nanosecond-resolution timestamps 
to Parquet using `coerce_timestamps` and 
`use_deprecated_int96_timestamps=True`, the Arrow library will segfault.

The crash doesn't happen if you don't coerce the timestamp resolution or if you 
don't use 96-bit timestamps.

 

 

*To Reproduce:*

 
{code:java}
 
import datetime

import pyarrow
from pyarrow import parquet

schema = pyarrow.schema([
pyarrow.field('last_updated', pyarrow.timestamp('ns')),
])

data = [
pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')),
]

table = pyarrow.Table.from_arrays(data, ['last_updated'])

with open('test_file.parquet', 'wb') as fdesc:
parquet.write_table(table, fdesc,
coerce_timestamps='us',  # 'ms' works too
use_deprecated_int96_timestamps=True){code}
 

See attached file for the crash report.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2019) Control the memory allocated for inner vector in LIST

2018-01-23 Thread Siddharth Teotia (JIRA)
Siddharth Teotia created ARROW-2019:
---

 Summary: Control the memory allocated for inner vector in LIST
 Key: ARROW-2019
 URL: https://issues.apache.org/jira/browse/ARROW-2019
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We have observed cases in our external sort code where the amount of memory 
actually allocated for a record batch sometimes turns out to be more than 
necessary and also more than what was reserved by the operator for special 
purposes. Thus queries fail with OOM.

Usually to control the memory allocated by vector.allocateNew() is to do a 
setInitialCapacity() and the latter modifies the vector state variables which 
are then used to allocate memory. However, due to the multiplier of 5 used in 
List Vector, we end up asking for more memory than necessary. For example, for 
a value count of 4095, we asked for 128KB of memory for an offset buffer of 
VarCharVector for a field which was list of varchars. 

We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
allocation). 

We had earlier made changes to setInitialCapacity() of ListVector when we were 
facing problems with deeply nested lists and decided to use the multiplier only 
for the leaf scalar vector. 

It looks like there is a need for a specialized setInitialCapacity() for 
ListVector where the caller dictates the repeatedness.

Also, there is another bug in setInitialCapacity() where the allocation of 
validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2018) [C++] Build instruction on macOS and Homebrew is incomplete

2018-01-23 Thread yosuke shiro (JIRA)
yosuke shiro created ARROW-2018:
---

 Summary: [C++] Build instruction on macOS and Homebrew is 
incomplete
 Key: ARROW-2018
 URL: https://issues.apache.org/jira/browse/ARROW-2018
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: yosuke shiro


I read [https://github.com/apache/arrow/blob/master/cpp/README.md]

I did the following instruction
{quote}On OS X, you can use [Homebrew|https://brew.sh/]:

brew update && brew bundle --file=c_glib/Brewfile
{quote}
I got the following result
{quote}% brew update && brew bundle --file=c_glib/Brewfile (git)-[master]
Updated 3 taps (caskroom/cask, caskroom/versions, homebrew/core).
==> Updated Formulae
git ✔ bit cryptopp envconsul fwup hugo just leptonica mlt pdnsrec wget
imagemagick@6 ✔ cocoapods dlib etcd gitlab-runner imagemagick khard libtomcrypt 
nss quicktype wtf
awscli conan dmd fn go imagesnap knot-resolver mariadb-connector-c orc-tools 
tomcat
bench cryfs dub folly godep jenkins kubernetes-helm micropython parallel vala
==> Tapping homebrew/bundle
Cloning into '/usr/local/Homebrew/Library/Taps/homebrew/homebrew-bundle'...

remote: Counting objects: 59, done.
remote: Compressing objects: 100% (53/53), done.
remote: Total 59 (delta 8), reused 13 (delta 3), pack-reused 0
Unpacking objects: 100% (59/59), done.
Tapped 0 formulae (130 files, 173.3KB)
Error: No Brewfile found
{quote}
 

I need the following information to succeed
 * I need to clone the Apache Arrow repository.
 * I need to move to the top directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)