Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Micah Kornfield
>
> * Should optional components be "opt in", "out out", or a mix?
> Currently it's a mix, and that's confusing for people. I think we
> should make them all "opt in".

Agreed they should all be opt in by default.  I think active developer are
quite adept at flipping the appropriate CMake flags.


> * Do we want to bring the out-of-the-box core build down to zero
> dependencies, including not depending on boost::filesystem and
> possibly checking the compiled Flatbuffers files.

 While it may be
> slightly more maintenance work, I think the optics of a
> "dependency-free" core build would be beneficial and help the project
> marketing-wise.

I'm -.5 on checking in generated artifacts but this is mostly stylistic.
In the case of flatbuffers it seems like we might be able to get-away with
vendoring since it should mostly be headers only.

I would prefer to try come up with more granular components and be
very conservative on what is "core".  I think it should be possible have a
zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
in a core package [1].  This combined with discussion Antoine started on an
ABI compatible C-layer would make basic inter-op within a process
reasonable.  Moving up the stack to IPC and files, there is probably a way
to package headers separately from implementations.  This would allow other
projects wishing to integrate with Arrow to bring their own implementations
without the baggage of boost::filesystem. Would this leave anything besides
"flatbuffers" as a hard dependency to support IPC?

Thanks,
Micah


[1] It probably makes sense to go even further and separate out MemoryPool
and Buffer, so we can break the circular relationship between parquet and
arrow.

On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:

> To be clear I think we should make these changes right after 0.15.0 is
> released so we aren't playing whackamole with our packaging scripts.
> I'm happy to take the lead on the work...
>
> On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> wrote:
> >
> > On Wed, 18 Sep 2019 09:46:54 -0500
> > Wes McKinney  wrote:
> > > I think these are both interesting areas to explore further. I'd like
> > > to focus on the couple of immediate items I think we should address
> > >
> > > * Should optional components be "opt in", "out out", or a mix?
> > > Currently it's a mix, and that's confusing for people. I think we
> > > should make them all "opt in".
> > > * Do we want to bring the out-of-the-box core build down to zero
> > > dependencies, including not depending on boost::filesystem and
> > > possibly checking the compiled Flatbuffers files. While it may be
> > > slightly more maintenance work, I think the optics of a
> > > "dependency-free" core build would be beneficial and help the project
> > > marketing-wise.
> > >
> > > Both of these issues must be addressed whether we undertake a Bazel
> > > implementation or some other refactor of the C++ build system.
> >
> > I think checking in the Flatbuffers files (and also Protobuf and Thrift
> > where applicable :-)) would be fine.
> >
> > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > Still worth deciding whether we want to prioritize development time for
> > it, because it's not entirely trivial either.
> >
> > Regards
> >
> > Antoine.
> >
> >
>


Re: Timeline for 0.15.0 release

2019-09-18 Thread Micah Kornfield
>
> The process should be well documented at this point but there are a
> number of steps.

Is [1] the up-to-date documentation for the release?   Are there
instructions for the adding the code signing Key to SVN?

I will make a go of it.  i will try to mitigate any internet issues by
doing the process for a cloud instance (I assume that isn't a problem?).

Thanks,
Micah

[1]
https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide

On Wed, Sep 18, 2019 at 8:29 AM Wes McKinney  wrote:

> The process should be well documented at this point but there are a
> number of steps. Note that you need to add your code signing key to
> the KEYS file in SVN (that's not very hard to do). I think it's fine
> to hand off the process to others after the VOTE but it would be
> tricky to have multiple RMs involved with producing the source and
> binary artifacts for the vote
>
> On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield 
> wrote:
> >
> > SGTM, as well.
> >
> > I should have a little bit of time next week if I can help as RM but I
> have
> > a couple of concerns:
> > 1.  In the past I've had trouble downloading and validating releases.
> I'm a
> > bit worried, that I might have similar problems doing the necessary
> uploads.
> > 2.  My internet connection will likely be not great, I don't know if this
> > would make it even less likely to be successful.
> >
> > Does it become problematic if somehow I would have to abandon the process
> > mid-release?  Is there anyone who could serve as a backup?  Are the steps
> > well documented?
> >
> > Thanks,
> > Micah
> >
> > On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson <
> neal.p.richard...@gmail.com>
> > wrote:
> >
> > > Sounds good to me.
> > >
> > > Do we have a release manager yet? Any volunteers?
> > >
> > > Neal
> > >
> > > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney 
> wrote:
> > >
> > > > hi all,
> > > >
> > > > It looks like we're drawing close to be able to make the 0.15.0
> > > > release. I would suggest "pencils down" at the end of this week and
> > > > see if a release candidate can be produced next Monday September 23.
> > > > Any thoughts or objections?
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > On Wed, Sep 11, 2019 at 11:23 AM Wes McKinney 
> > > wrote:
> > > > >
> > > > > hi Eric -- yes, that's correct. I'm planning to amend the Format
> docs
> > > > > today regarding the EOS issue and also update the C++ library
> > > > >
> > > > > On Wed, Sep 11, 2019 at 11:21 AM Eric Erhardt
> > > > >  wrote:
> > > > > >
> > > > > > I assume the plan is to merge the ARROW-6313-flatbuffer-alignment
> > > > branch into master before the 0.15 release, correct?
> > > > > >
> > > > > > BTW - I believe the C# alignment changes are ready to be merged
> into
> > > > the alignment branch -  https://github.com/apache/arrow/pull/5280/
> > > > > >
> > > > > > Eric
> > > > > >
> > > > > > -Original Message-
> > > > > > From: Micah Kornfield 
> > > > > > Sent: Tuesday, September 10, 2019 10:24 PM
> > > > > > To: Wes McKinney 
> > > > > > Cc: dev ; niki.lj 
> > > > > > Subject: Re: Timeline for 0.15.0 release
> > > > > >
> > > > > > I should have a little more bandwidth to help with some of the
> > > > packaging starting tomorrow and going into the weekend.
> > > > > >
> > > > > > On Tuesday, September 10, 2019, Wes McKinney <
> wesmck...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > With the state of nightly packaging and integration builds
> things
> > > > > > > aren't looking too good for being in release readiness by the
> end
> > > of
> > > > > > > this week but maybe I'm wrong. I'm planning to be working to
> close
> > > as
> > > > > > > many issues as I can and also to help with the ongoing
> alignment
> > > > fixes.
> > > > > > >
> > > > > > > Wes
> > > > > > >
> > > > > > > On Thu, Sep 5, 2019, 11:07 PM Micah Kornfield <
> > > emkornfi...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Just for reference [1] has a dashboard of the current issues:
> > > > > > >>
> > > > > > >>
> > > >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> > > > > > >> ki.apache.org
> > > > %2Fconfluence%2Fdisplay%2FARROW%2FArrow%2B0.15.0%2BRelea
> > > > > > >> sedata=02%7C01%7CEric.Erhardt%40microsoft.com
> > > > %7Ccbead81a42104034
> > > > > > >>
> > > > a4f308d736678a45%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370376
> > > > > > >>
> > > > 90648216338sdata=0Upux3i%2B9X6f8uanGKSGM5VYxR6c2ADWrxSPi1%2FgbH4
> > > > > > >> %3Dreserved=0
> > > > > > >>
> > > > > > >> On Thu, Sep 5, 2019 at 3:43 PM Wes McKinney <
> wesmck...@gmail.com>
> > > > wrote:
> > > > > > >>
> > > > > > >>> hi all,
> > > > > > >>>
> > > > > > >>> It doesn't seem like we're going to be in a position to
> release
> > > at
> > > > > > >>> the beginning of next week. I hope that one more week of
> work (or
> > > > > > >>> less) will be enough to get us there. Aside from merging the
> > > > > > >>> alignment 

Draft blog post for 0.15 release

2019-09-18 Thread Neal Richardson
Hi all,
In preparation for next week, I've started a release announcement blog
post here: https://github.com/apache/arrow-site/pull/27

Please fill in the parts you know best. Committers can just push edits
to my branch; also feel free to reply to this thread with content, or
email me directly, and I'll add it in for you.

Neal


[jira] [Created] (ARROW-6616) [Website] Release annoucement blog post for 0.15

2019-09-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6616:
--

 Summary: [Website] Release annoucement blog post for 0.15
 Key: ARROW-6616
 URL: https://issues.apache.org/jira/browse/ARROW-6616
 Project: Apache Arrow
  Issue Type: Task
  Components: Website
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.15.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6615) [C++] Add filtering option to fs::Selector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6615:
-

 Summary: [C++] Add filtering option to fs::Selector
 Key: ARROW-6615
 URL: https://issues.apache.org/jira/browse/ARROW-6615
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


It would convenient if Selector could support file path filtering, either via a 
regex or globbing applied to the path.

This is semi required for filtering file in Dataset to properly apply the file 
format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6614) [C++][Dataset] Implement FileSystemDataSourceDiscovery

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6614:
-

 Summary: [C++][Dataset] Implement FileSystemDataSourceDiscovery
 Key: ARROW-6614
 URL: https://issues.apache.org/jira/browse/ARROW-6614
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Francois Saint-Jacques


DataSourceDiscovery is what allows InferingSchema and constructing a DataSource 
with PartitionScheme.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6613) [C++] Remove dependency on boost::filesystem

2019-09-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6613:
-

 Summary: [C++] Remove dependency on boost::filesystem
 Key: ARROW-6613
 URL: https://issues.apache.org/jira/browse/ARROW-6613
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 1.0.0


See ARROW-2196 for details.
boost::filesystem should not be required for base functionality at least 
(including filesystems, probably).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6612) [C++] Add ARROW_CSV CMake build flag

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6612:
---

 Summary: [C++] Add ARROW_CSV CMake build flag
 Key: ARROW-6612
 URL: https://issues.apache.org/jira/browse/ARROW-6612
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


I think it would be better to make building this part of the project not 
unconditional



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6611) [C++] Make ARROW_JSON=OFF the default

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6611:
---

 Summary: [C++] Make ARROW_JSON=OFF the default
 Key: ARROW-6611
 URL: https://issues.apache.org/jira/browse/ARROW-6611
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


The JSON-based functionality is only needed for 

* Integration tests
* Unit tests
* JSON scanning

If the user opts in to unit tests or integration tests, then we can flip it on, 
but I think that the user should opt in when building libarrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6610) [C++] Add ARROW_FILESYSTEM=ON/OFF CMake configuration flag

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6610:
---

 Summary: [C++] Add ARROW_FILESYSTEM=ON/OFF CMake configuration flag
 Key: ARROW-6610
 URL: https://issues.apache.org/jira/browse/ARROW-6610
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Building this code should not be required in order to take advantage of the 
columnar core (memory allocation, data structures, IPC)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6609) [C++] Add minimal build Dockerfile example

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6609:
---

 Summary: [C++] Add minimal build Dockerfile example
 Key: ARROW-6609
 URL: https://issues.apache.org/jira/browse/ARROW-6609
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.15.0


This will also help developers test a minimal build configuration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6608) [C++] Make default for ARROW_HDFS to be OFF

2019-09-18 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6608:
---

 Summary: [C++] Make default for ARROW_HDFS to be OFF
 Key: ARROW-6608
 URL: https://issues.apache.org/jira/browse/ARROW-6608
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 1.0.0


This is one optional usage of {{boost::filesystem}} that could be eliminated 
from the simple "core" build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6607) Support for set/list columns in python

2019-09-18 Thread Giora Simchoni (Jira)
Giora Simchoni created ARROW-6607:
-

 Summary: Support for set/list columns in python
 Key: ARROW-6607
 URL: https://issues.apache.org/jira/browse/ARROW-6607
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
 Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in 
Windows 10
Reporter: Giora Simchoni


Hi,

Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10...

```python
import pandas as pd

df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), set([3,4,5])]})

df.to_feather('test.ft')
```

I get:

```
Traceback (most recent call last):
 File "", line 1, in 
 File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
line 2131, in to_feather
 to_feather(self, fname)
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", 
line 83, in to_feather
 feather.write_feather(df, path)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 182, in write_feather
 writer.write(df)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 93, in write
 table = Table.from_pandas(df, preserve_index=False)
 File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 496, in dataframe_to_arrays
 for c, f in zip(columns_to_convert, convert_fields)]
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 496, in 
 for c, f in zip(columns_to_convert, convert_fields)]
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 487, in convert_column
 raise e
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 481, in convert_column
 result = pa.array(col, type=type_, from_pandas=True, safe=safe)
 File "pyarrow/array.pxi", line 191, in pyarrow.lib.array
 File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array
 File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not 
recognize Python value type when inferring an Arrow data type', 'Conversion 
failed for column b with type object')
```

And obviously `df.drop('b', axis=1).to_feather('test.ft')` works.

Questions:
(1) Is it possible to support these kind of set/list columns?
(2) Anyone has an idea on how to deal with this? I *cannot* unnest these 
set/list columns as this would explode the DataFrame. My only other idea is to 
convert set `\{1,2}` into a string `1,2` and parse it after reading the file. 
And hoping it won't be slow.

 

Update:

With lists column the error is different:

```python
import pandas as pd

df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]})

df.to_feather('test.ft')
```

```

Traceback (most recent call last):
 File "", line 1, in 
 File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
line 2131, in to_feather
 to_feather(self, fname)
 File 
"/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", 
line 83, in to_feather
 feather.write_feather(df, path)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 182, in write_feather
 writer.write(df)
 File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", 
line 97, in write
 self.writer.write_array(name, col.data.chunk(0))
 File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array
 File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: list

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] C-level in-process array protocol

2019-09-18 Thread Antoine Pitrou


Hello,

One thing that was discussed in the sync call is the ability to easily
pass arrays at runtime between Arrow implementations or Arrow-supporting
libraries in the same process, without bearing the cost of linking to
e.g. the C++ Arrow library.

(for example: "Duckdb wants to provide an option to return Arrow data of
result sets, but they don't like having Arrow as a dependency")

One possibility would be to define a C-level protocol similar in spirit
to the Python buffer protocol, which some people may be familiar with (*).

The basic idea is to define a simple C struct, which is ABI-stable and
describes an Arrow away adequately.  The struct can be stack-allocated.
Its definition can also be copied in another project (or interfaced with
using a C FFI layer, depending on the language).

There is no formal proposal, this message is meant to stir the discussion.

Issues to work out:

* Memory lifetime issues: where Python simply associates the Py_buffer
with a PyObject owner (a garbage-collected Python object), we need
another means to control lifetime of pointed areas.  One simple
possibility is to include a destructor function pointer in the protocol
struct.

* Arrow type representation.  We probably need some kind of "format"
mini-language to represent Arrow types, so that a type can be described
using a `const char*`.  Ideally, primitives types at least should be
trivially parsable.  We may take inspiration from Python here (`struct`
module format characters, PEP 3118 format additions).

Example C struct definition (not a formal proposal!):

struct ArrowBuffer {
  void* data;
  int64_t nbytes;
  // Called by the consumer when it doesn't need the buffer anymore
  void (*release)(struct ArrowBuffer*);
  // Opaque user data (for e.g. the release callback)
  void* user_data;
};

struct ArrowArray {
  // Type description
  const char* format;
  // Data description
  int64_t length;
  int64_t null_count;
  int64_t n_buffers;
  // Note: this pointers are probably owned by the ArrowArray struct
  // and will be released and free()ed by the release callback.
  struct BufferDescriptor* buffers;
  struct ArrowDescriptor* dictionary;
  // Called by the consumer when it doesn't need the array anymore
  void (*release)(struct ArrowArrayDescriptor*);
  // Opaque user data (for e.g. the release callback)
  void* user_data;
};

Thoughts?

(*) For the record, the reference for the Python buffer protocol:
https://docs.python.org/3/c-api/buffer.html#buffer-structure
and its C struct definition:
https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195

Regards

Antoine.


[jira] [Created] (ARROW-6606) [C++] Construct tree structure from std::vector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6606:
-

 Summary: [C++] Construct tree structure from 
std::vector
 Key: ARROW-6606
 URL: https://issues.apache.org/jira/browse/ARROW-6606
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Francois Saint-Jacques


This will be used by FileSystemDataSource for pushdown predicate pruning of 
branches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6605) [C++] Add recursion depth control to fs::Selector

2019-09-18 Thread Francois Saint-Jacques (Jira)
Francois Saint-Jacques created ARROW-6605:
-

 Summary: [C++] Add recursion depth control to fs::Selector
 Key: ARROW-6605
 URL: https://issues.apache.org/jira/browse/ARROW-6605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Francois Saint-Jacques


This is similar to the recursive options, but also control the depth.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6604) [C++] Add support for nested types to MakeArrayFromScalar

2019-09-18 Thread Benjamin Kietzman (Jira)
Benjamin Kietzman created ARROW-6604:


 Summary: [C++] Add support for nested types to MakeArrayFromScalar
 Key: ARROW-6604
 URL: https://issues.apache.org/jira/browse/ARROW-6604
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


At the same time move MakeArrayFromScalar and MakeArrayOfNull under 
src/arrow/array/




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2019-09-18 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-6603:
---

 Summary: [C#] ArrayBuilder API to support writing nulls
 Key: ARROW-6603
 URL: https://issues.apache.org/jira/browse/ARROW-6603
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Reporter: Eric Erhardt


There is currently no API in the PrimitiveArrayBuilder class to support writing 
nulls.  See this TODO - 
[https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]

 

Also see [https://github.com/apache/arrow/issues/5381].

 

We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Timeline for 0.15.0 release

2019-09-18 Thread Wes McKinney
The process should be well documented at this point but there are a
number of steps. Note that you need to add your code signing key to
the KEYS file in SVN (that's not very hard to do). I think it's fine
to hand off the process to others after the VOTE but it would be
tricky to have multiple RMs involved with producing the source and
binary artifacts for the vote

On Tue, Sep 17, 2019 at 10:55 PM Micah Kornfield  wrote:
>
> SGTM, as well.
>
> I should have a little bit of time next week if I can help as RM but I have
> a couple of concerns:
> 1.  In the past I've had trouble downloading and validating releases. I'm a
> bit worried, that I might have similar problems doing the necessary uploads.
> 2.  My internet connection will likely be not great, I don't know if this
> would make it even less likely to be successful.
>
> Does it become problematic if somehow I would have to abandon the process
> mid-release?  Is there anyone who could serve as a backup?  Are the steps
> well documented?
>
> Thanks,
> Micah
>
> On Tue, Sep 17, 2019 at 4:25 PM Neal Richardson 
> wrote:
>
> > Sounds good to me.
> >
> > Do we have a release manager yet? Any volunteers?
> >
> > Neal
> >
> > On Tue, Sep 17, 2019 at 4:06 PM Wes McKinney  wrote:
> >
> > > hi all,
> > >
> > > It looks like we're drawing close to be able to make the 0.15.0
> > > release. I would suggest "pencils down" at the end of this week and
> > > see if a release candidate can be produced next Monday September 23.
> > > Any thoughts or objections?
> > >
> > > Thanks,
> > > Wes
> > >
> > > On Wed, Sep 11, 2019 at 11:23 AM Wes McKinney 
> > wrote:
> > > >
> > > > hi Eric -- yes, that's correct. I'm planning to amend the Format docs
> > > > today regarding the EOS issue and also update the C++ library
> > > >
> > > > On Wed, Sep 11, 2019 at 11:21 AM Eric Erhardt
> > > >  wrote:
> > > > >
> > > > > I assume the plan is to merge the ARROW-6313-flatbuffer-alignment
> > > branch into master before the 0.15 release, correct?
> > > > >
> > > > > BTW - I believe the C# alignment changes are ready to be merged into
> > > the alignment branch -  https://github.com/apache/arrow/pull/5280/
> > > > >
> > > > > Eric
> > > > >
> > > > > -Original Message-
> > > > > From: Micah Kornfield 
> > > > > Sent: Tuesday, September 10, 2019 10:24 PM
> > > > > To: Wes McKinney 
> > > > > Cc: dev ; niki.lj 
> > > > > Subject: Re: Timeline for 0.15.0 release
> > > > >
> > > > > I should have a little more bandwidth to help with some of the
> > > packaging starting tomorrow and going into the weekend.
> > > > >
> > > > > On Tuesday, September 10, 2019, Wes McKinney 
> > > wrote:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > With the state of nightly packaging and integration builds things
> > > > > > aren't looking too good for being in release readiness by the end
> > of
> > > > > > this week but maybe I'm wrong. I'm planning to be working to close
> > as
> > > > > > many issues as I can and also to help with the ongoing alignment
> > > fixes.
> > > > > >
> > > > > > Wes
> > > > > >
> > > > > > On Thu, Sep 5, 2019, 11:07 PM Micah Kornfield <
> > emkornfi...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Just for reference [1] has a dashboard of the current issues:
> > > > > >>
> > > > > >>
> > > https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwi
> > > > > >> ki.apache.org
> > > %2Fconfluence%2Fdisplay%2FARROW%2FArrow%2B0.15.0%2BRelea
> > > > > >> sedata=02%7C01%7CEric.Erhardt%40microsoft.com
> > > %7Ccbead81a42104034
> > > > > >>
> > > a4f308d736678a45%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6370376
> > > > > >>
> > > 90648216338sdata=0Upux3i%2B9X6f8uanGKSGM5VYxR6c2ADWrxSPi1%2FgbH4
> > > > > >> %3Dreserved=0
> > > > > >>
> > > > > >> On Thu, Sep 5, 2019 at 3:43 PM Wes McKinney 
> > > wrote:
> > > > > >>
> > > > > >>> hi all,
> > > > > >>>
> > > > > >>> It doesn't seem like we're going to be in a position to release
> > at
> > > > > >>> the beginning of next week. I hope that one more week of work (or
> > > > > >>> less) will be enough to get us there. Aside from merging the
> > > > > >>> alignment changes, we need to make sure that our packaging jobs
> > > > > >>> required for the release candidate are all working.
> > > > > >>>
> > > > > >>> If folks could remove issues from the 0.15.0 backlog that they
> > > don't
> > > > > >>> think they will finish by end of next week that would help focus
> > > > > >>> efforts (there are currently 78 issues in 0.15.0 still). I am
> > > > > >>> looking to tackle a few small features related to dictionaries
> > > while
> > > > > >>> the release window is still open.
> > > > > >>>
> > > > > >>> - Wes
> > > > > >>>
> > > > > >>> On Tue, Aug 27, 2019 at 3:48 PM Wes McKinney <
> > wesmck...@gmail.com>
> > > > > >>> wrote:
> > > > > >>> >
> > > > > >>> > hi,
> > > > > >>> >
> > > > > >>> > I think we should try to release the week of September 9, so
> > > > > >>> > development work should be 

[jira] [Created] (ARROW-6602) [Doc] Add feature / implementation matrix

2019-09-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6602:
-

 Summary: [Doc] Add feature / implementation matrix
 Key: ARROW-6602
 URL: https://issues.apache.org/jira/browse/ARROW-6602
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Antoine Pitrou
 Fix For: 1.0.0


We have many different implementations and each implementation makes a 
different set of features available. It would be nice to have a top-level doc 
page making it clear which implementation supports what.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow sync call September 19 at 12:00 US/Eastern, 16:00 UTC

2019-09-18 Thread Wes McKinney
I'm unable to join today but hope that participants can review the
active DISCUSS threads

On Tue, Sep 17, 2019 at 11:28 PM Neal Richardson
 wrote:
>
> Hi all,
> Belated reminder that the biweekly Arrow call is coming up in less than 12
> hours at https://meet.google.com/vtm-teks-phx. All are welcome to join.
> Notes will be sent out to the mailing list afterwards.
>
> Neal


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Wes McKinney
To be clear I think we should make these changes right after 0.15.0 is
released so we aren't playing whackamole with our packaging scripts.
I'm happy to take the lead on the work...

On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou  wrote:
>
> On Wed, 18 Sep 2019 09:46:54 -0500
> Wes McKinney  wrote:
> > I think these are both interesting areas to explore further. I'd like
> > to focus on the couple of immediate items I think we should address
> >
> > * Should optional components be "opt in", "out out", or a mix?
> > Currently it's a mix, and that's confusing for people. I think we
> > should make them all "opt in".
> > * Do we want to bring the out-of-the-box core build down to zero
> > dependencies, including not depending on boost::filesystem and
> > possibly checking the compiled Flatbuffers files. While it may be
> > slightly more maintenance work, I think the optics of a
> > "dependency-free" core build would be beneficial and help the project
> > marketing-wise.
> >
> > Both of these issues must be addressed whether we undertake a Bazel
> > implementation or some other refactor of the C++ build system.
>
> I think checking in the Flatbuffers files (and also Protobuf and Thrift
> where applicable :-)) would be fine.
>
> As for boost::filesystem, getting rid of it wouldn't be a huge task.
> Still worth deciding whether we want to prioritize development time for
> it, because it's not entirely trivial either.
>
> Regards
>
> Antoine.
>
>


[jira] [Created] (ARROW-6600) [Java] Implement dictionary-encoded subfields for Union type

2019-09-18 Thread Ji Liu (Jira)
Ji Liu created ARROW-6600:
-

 Summary: [Java] Implement dictionary-encoded subfields for Union 
type
 Key: ARROW-6600
 URL: https://issues.apache.org/jira/browse/ARROW-6600
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Implement dictionary-encoded subfields for {{Union}} type. Each child vector 
could be encodable or not.

 

Meanwhile extra common logic into {{DictionaryEncoder}} as well as refactor 
List subfield encoding to keep consistent with {{Struct/Union}} type.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Wes McKinney
I think these are both interesting areas to explore further. I'd like
to focus on the couple of immediate items I think we should address

* Should optional components be "opt in", "out out", or a mix?
Currently it's a mix, and that's confusing for people. I think we
should make them all "opt in".
* Do we want to bring the out-of-the-box core build down to zero
dependencies, including not depending on boost::filesystem and
possibly checking the compiled Flatbuffers files. While it may be
slightly more maintenance work, I think the optics of a
"dependency-free" core build would be beneficial and help the project
marketing-wise.

Both of these issues must be addressed whether we undertake a Bazel
implementation or some other refactor of the C++ build system.

On Wed, Sep 18, 2019 at 2:48 AM Uwe L. Korn  wrote:
>
> Hello Micah,
>
> I don't think we have explored using bazel yet. I would see it as a possible 
> modular alternative but as you mention it will be a lot of work and we would 
> probably need a mentor who is familiar with bazel, otherwise we probably end 
> up spending too much time on this and get a non-typical bazel setup.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> > It has come up in the past, but I wonder if exploring Bazel as a build
> > system with its a very explicit dependency graph might help (I'm not sure
> > if something similar is available in CMake).
> >
> > This is also a lot of work, but could also potentially benefit the
> > developer experience because we can make unit tests depend on individual
> > compilable units instead of all of libarrow.  There are trade-offs here as
> > well in terms of public API coverage.
> >
> > On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:
> >
> > > Hello,
> > >
> > > I can think of two other alternatives that make it more visible what Arrow
> > > core is and what are the optional components:
> > >
> > > * Error out when no component is selected instead of building just the
> > > core Arrow. Here we could add an explanative message that list all
> > > components and for each component 2-3 words what it does and what it
> > > requires. This would make the first-time experience much better.
> > > * Split the CMake project into several subprojects. By correctly
> > > structuring the CMakefiles, we should be able to separate out the Arrow
> > > components into separate CMake projects that can be built independently if
> > > needed while all using the same third-party toolchain. We would still have
> > > a top-level CMakeLists.txt that is invoked just like the current one but
> > > through having subprojects, you would not anymore be bound to use the
> > > single top-level one. This would also have some benefit for packagers that
> > > could separate out the build of individual Arrow modules. Furthermore, it
> > > would also make it easier for PoC/academic projects to just take the Arrow
> > > Core sources and drop it in as a CMake subproject; while this is not a 
> > > good
> > > solution for production-grade software, it is quite common practice to do
> > > this in research.
> > > I really like this approach and I think this is something we should have
> > > as a long-term target, I'm also happy to implement given the time but I
> > > think one CMake refactor per year is the maximum I can do and that was
> > > already eaten up by the dependency detection. Also, I'm unsure about how
> > > much this would block us at the moment vs the marketing benefit of having 
> > > a
> > > more modular Arrow; currently I'm leaning on the side that the
> > > marketing/adoption benefit would be much larger but we lack someone
> > > frustration-tolerant to do the refactoring.
> > >
> > > Uwe
> > >
> > > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > Lately there seem to be more and more people suggesting that the
> > > > optional components in the Arrow C++ project are getting in the way of
> > > > using the "core" which implements the columnar format and IPC
> > > > protocol. I am not sure I agree with this argument, but in general I
> > > > think it would be a good idea to make all optional components in the
> > > > project "opt in" rather than "opt out"
> > > >
> > > > To demonstrate where things currently stand, I created a Dockerfile to
> > > > try to make the smallest possible and most dependency-free build
> > > >
> > > >
> > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > > >
> > > > Here is the output of this build
> > > >
> > > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > > >
> > > > First, let's look at the CMake invocation
> > > >
> > > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > > -DARROW_BOOST_USE_SHARED=OFF \
> > > > -DARROW_COMPUTE=OFF \
> > > > -DARROW_DATASET=OFF \
> > > > -DARROW_JEMALLOC=OFF \
> > > > -DARROW_JSON=ON \
> > > > -DARROW_USE_GLOG=OFF \
> > > > -DARROW_WITH_BZ2=OFF \
> > > > -DARROW_WITH_ZLIB=OFF \
> > > > 

[jira] [Created] (ARROW-6598) [Java] Sort the code for ApproxEqualsVisitor

2019-09-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-6598:
---

 Summary: [Java] Sort the code for ApproxEqualsVisitor
 Key: ARROW-6598
 URL: https://issues.apache.org/jira/browse/ARROW-6598
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As a follow up issue of ARROW-6458, we finalize the code for 
ApproxEqualsVisitor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[NIGHTLY] Arrow Build Report for Job nightly-2019-09-18-0

2019-09-18 Thread Crossbow


Arrow Build Report for Job nightly-2019-09-18-0

All tasks: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0

Failed Tasks:
- docker-cpp-fuzzit:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-cpp-fuzzit
- docker-docs:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-docs
- docker-spark-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-spark-integration
- conda-linux-gcc-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-linux-gcc-py36

Succeeded Tasks:
- ubuntu-disco:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-ubuntu-disco
- docker-iwyu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-iwyu
- docker-go:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-go
- debian-stretch:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-debian-stretch
- ubuntu-xenial:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-ubuntu-xenial
- conda-linux-gcc-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-linux-gcc-py27
- conda-win-vs2015-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-win-vs2015-py37
- docker-cpp-release:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-cpp-release
- conda-osx-clang-py37:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-osx-clang-py37
- wheel-manylinux2010-cp37m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux2010-cp37m
- docker-dask-integration:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-dask-integration
- wheel-manylinux1-cp27m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux1-cp27m
- wheel-manylinux1-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux1-cp36m
- wheel-win-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-appveyor-wheel-win-cp35m
- docker-r:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-r
- docker-js:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-js
- docker-c_glib:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-c_glib
- docker-python-3.6-nopandas:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-python-3.6-nopandas
- centos-7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-centos-7
- centos-6:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-centos-6
- docker-python-2.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-python-2.7
- wheel-manylinux2010-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux2010-cp36m
- conda-osx-clang-py27:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-osx-clang-py27
- wheel-manylinux2010-cp35m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux2010-cp35m
- docker-python-3.7:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-python-3.7
- wheel-win-cp36m:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-appveyor-wheel-win-cp36m
- docker-cpp-cmake32:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-cpp-cmake32
- docker-pandas-master:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-circle-docker-pandas-master
- wheel-manylinux2010-cp27mu:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-wheel-manylinux2010-cp27mu
- gandiva-jar-trusty:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-travis-gandiva-jar-trusty
- conda-osx-clang-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-osx-clang-py36
- conda-win-vs2015-py36:
  URL: 
https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2019-09-18-0-azure-conda-win-vs2015-py36
- wheel-manylinux1-cp37m:
  URL: 

[jira] [Created] (ARROW-6597) [Python] Segfault in test_pandas with Python 2.7

2019-09-18 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6597:
-

 Summary: [Python] Segfault in test_pandas with Python 2.7
 Key: ARROW-6597
 URL: https://issues.apache.org/jira/browse/ARROW-6597
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


I get a segfault in test_pandas with Python 2.7.

gdb stack trace (excerpt):
{code}
Thread 27 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffb7fff700 (LWP 17725)]
0x7fffcac1a9f9 in arrow::py::internal::PyDate_from_int (val=10957, 
unit=arrow::DateUnit::DAY, out=0x55e1b9b0) at 
../src/arrow/python/datetime.cc:229
229   *out = PyDate_FromDate(static_cast(year), 
static_cast(month),
(gdb) bt
#0  0x7fffcac1a9f9 in arrow::py::internal::PyDate_from_int (val=10957, 
unit=arrow::DateUnit::DAY, out=0x55e1b9b0) at 
../src/arrow/python/datetime.cc:229
#1  0x7fffcabaed34 in arrow::Status 
arrow::py::ConvertDates(arrow::py::PandasOptions const&, 
arrow::ChunkedArray const&, _object**)::{lambda(int, 
_object**)#1}::operator()(int, _object**) const (this=0x7fffb7ffde90, 
value=10957, out=0x55e1b9b0) at ../src/arrow/python/arrow_to_pandas.cc:657
#2  0x7fffcabaeb8c in arrow::Status 
arrow::py::ConvertAsPyObjects(arrow::py::PandasOptions const&, 
arrow::ChunkedArray const&, _object**)::{lambda(int, 
_object**)#1}&>(arrow::py::PandasOptions const&, arrow::ChunkedArray const&, 
arrow::Status 
arrow::py::ConvertDates(arrow::py::PandasOptions const&, 
arrow::ChunkedArray const&, _object**)::{lambda(int, _object**)#1}&, 
_object**)::{lambda(int const&, _object**)#1}::operator()(int const, _object**) 
const (this=0x7fffb7ffdd88, value=@0x7fffb7ffdcbc: 10957, 
out_values=0x55e1b9b0)
at ../src/arrow/python/arrow_to_pandas.cc:417
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: How can I help?

2019-09-18 Thread SemanticBeeng .
Hi Weston,

Documenting your use cases would be a great help, imo.
If open then am interested to help with that.
I am looking to build some advanced POC.

Please advise
Thanks
Nick
https://twitter.com/semanticbeeng



On Tue, Sep 17, 2019 at 6:10 PM Weston Platter 
wrote:

> Hey there,
>
> I’ve had huge success using Arrow in production at my last couple jobs,
> and wanted to ask how I can help and give back.
>
> On twitter, Wes mentioned that there’s some work to be done with python
> packaging and wheels (
> https://twitter.com/wesmckinn/status/1174071228253929472). I’ve got some
> free time in the  next 2-3 weeks and would be happy to pitch in where it
> makes the most sense. Is there a few small tasks I could get started on?
>
> Weston
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello Micah,

I don't think we have explored using bazel yet. I would see it as a possible 
modular alternative but as you mention it will be a lot of work and we would 
probably need a mentor who is familiar with bazel, otherwise we probably end up 
spending too much time on this and get a non-typical bazel setup.

Uwe

On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> It has come up in the past, but I wonder if exploring Bazel as a build
> system with its a very explicit dependency graph might help (I'm not sure
> if something similar is available in CMake).
> 
> This is also a lot of work, but could also potentially benefit the
> developer experience because we can make unit tests depend on individual
> compilable units instead of all of libarrow.  There are trade-offs here as
> well in terms of public API coverage.
> 
> On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:
> 
> > Hello,
> >
> > I can think of two other alternatives that make it more visible what Arrow
> > core is and what are the optional components:
> >
> > * Error out when no component is selected instead of building just the
> > core Arrow. Here we could add an explanative message that list all
> > components and for each component 2-3 words what it does and what it
> > requires. This would make the first-time experience much better.
> > * Split the CMake project into several subprojects. By correctly
> > structuring the CMakefiles, we should be able to separate out the Arrow
> > components into separate CMake projects that can be built independently if
> > needed while all using the same third-party toolchain. We would still have
> > a top-level CMakeLists.txt that is invoked just like the current one but
> > through having subprojects, you would not anymore be bound to use the
> > single top-level one. This would also have some benefit for packagers that
> > could separate out the build of individual Arrow modules. Furthermore, it
> > would also make it easier for PoC/academic projects to just take the Arrow
> > Core sources and drop it in as a CMake subproject; while this is not a good
> > solution for production-grade software, it is quite common practice to do
> > this in research.
> > I really like this approach and I think this is something we should have
> > as a long-term target, I'm also happy to implement given the time but I
> > think one CMake refactor per year is the maximum I can do and that was
> > already eaten up by the dependency detection. Also, I'm unsure about how
> > much this would block us at the moment vs the marketing benefit of having a
> > more modular Arrow; currently I'm leaning on the side that the
> > marketing/adoption benefit would be much larger but we lack someone
> > frustration-tolerant to do the refactoring.
> >
> > Uwe
> >
> > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > Lately there seem to be more and more people suggesting that the
> > > optional components in the Arrow C++ project are getting in the way of
> > > using the "core" which implements the columnar format and IPC
> > > protocol. I am not sure I agree with this argument, but in general I
> > > think it would be a good idea to make all optional components in the
> > > project "opt in" rather than "opt out"
> > >
> > > To demonstrate where things currently stand, I created a Dockerfile to
> > > try to make the smallest possible and most dependency-free build
> > >
> > >
> > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > >
> > > Here is the output of this build
> > >
> > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > >
> > > First, let's look at the CMake invocation
> > >
> > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > -DARROW_BOOST_USE_SHARED=OFF \
> > > -DARROW_COMPUTE=OFF \
> > > -DARROW_DATASET=OFF \
> > > -DARROW_JEMALLOC=OFF \
> > > -DARROW_JSON=ON \
> > > -DARROW_USE_GLOG=OFF \
> > > -DARROW_WITH_BZ2=OFF \
> > > -DARROW_WITH_ZLIB=OFF \
> > > -DARROW_WITH_ZSTD=OFF \
> > > -DARROW_WITH_LZ4=OFF \
> > > -DARROW_WITH_SNAPPY=OFF \
> > > -DARROW_WITH_BROTLI=OFF \
> > > -DARROW_BUILD_UTILITIES=OFF
> > >
> > > Aside from the issue of how to obtain and link Boost, here's a couple of
> > things:
> > >
> > > * COMPUTE and DATASET IMHO should be off by default
> > > * All compression libraries should be turned off
> > > * GLOG should be off by default
> > > * Utilities should be off (they are used for integration testing)
> > > * Jemalloc should probably be off, but we should make it clear that
> > > opting in will yield better performance
> > >
> > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > > the build. I opened ARROW-6590 to fix this
> > >
> > > Aside from potentially changing these defaults, there's some things in
> > > the build that we might want to turn into optional pieces:
> > >
> > > * We should see if we can make boost::filesystem not mandatory in the
> > > barebones build, if only to satisfy the 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Micah Kornfield
It has come up in the past, but I wonder if exploring Bazel as a build
system with its a very explicit dependency graph might help (I'm not sure
if something similar is available in CMake).

This is also a lot of work, but could also potentially benefit the
developer experience because we can make unit tests depend on individual
compilable units instead of all of libarrow.  There are trade-offs here as
well in terms of public API coverage.

On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:

> Hello,
>
> I can think of two other alternatives that make it more visible what Arrow
> core is and what are the optional components:
>
> * Error out when no component is selected instead of building just the
> core Arrow. Here we could add an explanative message that list all
> components and for each component 2-3 words what it does and what it
> requires. This would make the first-time experience much better.
> * Split the CMake project into several subprojects. By correctly
> structuring the CMakefiles, we should be able to separate out the Arrow
> components into separate CMake projects that can be built independently if
> needed while all using the same third-party toolchain. We would still have
> a top-level CMakeLists.txt that is invoked just like the current one but
> through having subprojects, you would not anymore be bound to use the
> single top-level one. This would also have some benefit for packagers that
> could separate out the build of individual Arrow modules. Furthermore, it
> would also make it easier for PoC/academic projects to just take the Arrow
> Core sources and drop it in as a CMake subproject; while this is not a good
> solution for production-grade software, it is quite common practice to do
> this in research.
> I really like this approach and I think this is something we should have
> as a long-term target, I'm also happy to implement given the time but I
> think one CMake refactor per year is the maximum I can do and that was
> already eaten up by the dependency detection. Also, I'm unsure about how
> much this would block us at the moment vs the marketing benefit of having a
> more modular Arrow; currently I'm leaning on the side that the
> marketing/adoption benefit would be much larger but we lack someone
> frustration-tolerant to do the refactoring.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > hi folks,
> >
> > Lately there seem to be more and more people suggesting that the
> > optional components in the Arrow C++ project are getting in the way of
> > using the "core" which implements the columnar format and IPC
> > protocol. I am not sure I agree with this argument, but in general I
> > think it would be a good idea to make all optional components in the
> > project "opt in" rather than "opt out"
> >
> > To demonstrate where things currently stand, I created a Dockerfile to
> > try to make the smallest possible and most dependency-free build
> >
> >
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> >
> > Here is the output of this build
> >
> > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> >
> > First, let's look at the CMake invocation
> >
> > cmake .. -DBOOST_SOURCE=BUNDLED \
> > -DARROW_BOOST_USE_SHARED=OFF \
> > -DARROW_COMPUTE=OFF \
> > -DARROW_DATASET=OFF \
> > -DARROW_JEMALLOC=OFF \
> > -DARROW_JSON=ON \
> > -DARROW_USE_GLOG=OFF \
> > -DARROW_WITH_BZ2=OFF \
> > -DARROW_WITH_ZLIB=OFF \
> > -DARROW_WITH_ZSTD=OFF \
> > -DARROW_WITH_LZ4=OFF \
> > -DARROW_WITH_SNAPPY=OFF \
> > -DARROW_WITH_BROTLI=OFF \
> > -DARROW_BUILD_UTILITIES=OFF
> >
> > Aside from the issue of how to obtain and link Boost, here's a couple of
> things:
> >
> > * COMPUTE and DATASET IMHO should be off by default
> > * All compression libraries should be turned off
> > * GLOG should be off by default
> > * Utilities should be off (they are used for integration testing)
> > * Jemalloc should probably be off, but we should make it clear that
> > opting in will yield better performance
> >
> > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > the build. I opened ARROW-6590 to fix this
> >
> > Aside from potentially changing these defaults, there's some things in
> > the build that we might want to turn into optional pieces:
> >
> > * We should see if we can make boost::filesystem not mandatory in the
> > barebones build, if only to satisfy the peanut gallery
> > * double-conversion is used in the CSV module. I think that
> > double-conversion_ep and the CSV module should both be made opt-in
> > * rapidjson_ep should be made optional. JSON support is only needed
> > for integration testing
> >
> > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> > is not mandatory.
> >
> > In general, enabling optional components is primarily relevant for
> > packagers. If we implement these changes, a number of package build
> > scripts will have to change.
> >
> > Thanks,
> > Wes
> >
>


[jira] [Created] (ARROW-6596) Getting "Cannot call io___MemoryMappedFile__Open()" error while reading a parquet file

2019-09-18 Thread Addhyan (Jira)
Addhyan created ARROW-6596:
--

 Summary: Getting "Cannot call io___MemoryMappedFile__Open()" error 
while reading a parquet file
 Key: ARROW-6596
 URL: https://issues.apache.org/jira/browse/ARROW-6596
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.14.1
 Environment: ubuntu 18.04
Reporter: Addhyan
 Fix For: 0.14.1


I am using r/Dockerfile to get all the R dependency and following back to get 
everything to get the arrow/r work in linux (either ubuntu/debian) but it is 
continuously giving me this error:

Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) : 

  Cannot call io___MemoryMappedFile__Open()

I have installed all the required cpp libraries as mentioned here: 
[https://arrow.apache.org/install/] under "Ubuntu 18.04 LTS or later".  I have 
also tried to use 
[cpp/Dockerfile|https://github.com/apache/arrow/blob/master/cpp/Dockerfile] and 
then followed backwards without any luck. The error is consistent and doesn't 
go away. 

I am trying to build a docker image with dockerfile containing everything that 
arrow needs, all the cpp libraries etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello,

I can think of two other alternatives that make it more visible what Arrow core 
is and what are the optional components:

* Error out when no component is selected instead of building just the core 
Arrow. Here we could add an explanative message that list all components and 
for each component 2-3 words what it does and what it requires. This would make 
the first-time experience much better.
* Split the CMake project into several subprojects. By correctly structuring 
the CMakefiles, we should be able to separate out the Arrow components into 
separate CMake projects that can be built independently if needed while all 
using the same third-party toolchain. We would still have a top-level 
CMakeLists.txt that is invoked just like the current one but through having 
subprojects, you would not anymore be bound to use the single top-level one. 
This would also have some benefit for packagers that could separate out the 
build of individual Arrow modules. Furthermore, it would also make it easier 
for PoC/academic projects to just take the Arrow Core sources and drop it in as 
a CMake subproject; while this is not a good solution for production-grade 
software, it is quite common practice to do this in research.
I really like this approach and I think this is something we should have as a 
long-term target, I'm also happy to implement given the time but I think one 
CMake refactor per year is the maximum I can do and that was already eaten up 
by the dependency detection. Also, I'm unsure about how much this would block 
us at the moment vs the marketing benefit of having a more modular Arrow; 
currently I'm leaning on the side that the marketing/adoption benefit would be 
much larger but we lack someone frustration-tolerant to do the refactoring.

Uwe

On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> hi folks,
> 
> Lately there seem to be more and more people suggesting that the
> optional components in the Arrow C++ project are getting in the way of
> using the "core" which implements the columnar format and IPC
> protocol. I am not sure I agree with this argument, but in general I
> think it would be a good idea to make all optional components in the
> project "opt in" rather than "opt out"
> 
> To demonstrate where things currently stand, I created a Dockerfile to
> try to make the smallest possible and most dependency-free build
> 
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> 
> Here is the output of this build
> 
> https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> 
> First, let's look at the CMake invocation
> 
> cmake .. -DBOOST_SOURCE=BUNDLED \
> -DARROW_BOOST_USE_SHARED=OFF \
> -DARROW_COMPUTE=OFF \
> -DARROW_DATASET=OFF \
> -DARROW_JEMALLOC=OFF \
> -DARROW_JSON=ON \
> -DARROW_USE_GLOG=OFF \
> -DARROW_WITH_BZ2=OFF \
> -DARROW_WITH_ZLIB=OFF \
> -DARROW_WITH_ZSTD=OFF \
> -DARROW_WITH_LZ4=OFF \
> -DARROW_WITH_SNAPPY=OFF \
> -DARROW_WITH_BROTLI=OFF \
> -DARROW_BUILD_UTILITIES=OFF
> 
> Aside from the issue of how to obtain and link Boost, here's a couple of 
> things:
> 
> * COMPUTE and DATASET IMHO should be off by default
> * All compression libraries should be turned off
> * GLOG should be off by default
> * Utilities should be off (they are used for integration testing)
> * Jemalloc should probably be off, but we should make it clear that
> opting in will yield better performance
> 
> I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> the build. I opened ARROW-6590 to fix this
> 
> Aside from potentially changing these defaults, there's some things in
> the build that we might want to turn into optional pieces:
> 
> * We should see if we can make boost::filesystem not mandatory in the
> barebones build, if only to satisfy the peanut gallery
> * double-conversion is used in the CSV module. I think that
> double-conversion_ep and the CSV module should both be made opt-in
> * rapidjson_ep should be made optional. JSON support is only needed
> for integration testing
> 
> We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> is not mandatory.
> 
> In general, enabling optional components is primarily relevant for
> packagers. If we implement these changes, a number of package build
> scripts will have to change.
> 
> Thanks,
> Wes
>