Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Antoine Pitrou


Yes, I don't think we should go the full way of separating Arrow in
micro-components.  The IO and IPC layer aren't heavyweight.  We should
simply address the most often-quoted annoyances.

Regards

Antoine.


Le 20/09/2019 à 17:41, Wes McKinney a écrit :
> Implementing the format fully requires memory management and IO
> interfaces (i.e. arrow/io/{file.h, interfaces.h, memory.h}). So those
> parts are not separable.
> 
> On Fri, Sep 20, 2019 at 10:36 AM Neal Richardson
>  wrote:
>>
>> I wonder if having a core "format" C++ library, which the io, compute,
>> etc. library/libraries would depend on, is a natural step.
>> Particularly since we're coming up on 1.0 and the format is being
>> declared stable.
>>
>> Neal
>>
>> On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
>>>
>>> We would have to be even more careful about managing symbol exports.
>>> Third party projects would need to link more libraries in their
>>> applications (not unlike the way that Boost works now -- I suppose
>>> that Boost is the closest analogue to what we're going for)
>>>
>>> On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  
>>> wrote:
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.


 Something for longer term to think about.  What are you seeing as the 
 added maintenance here?


 On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
>
> hi Micah,
>
>
> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> wrote:
>>
>>>
>>> * Should optional components be "opt in", "out out", or a mix?
>>> Currently it's a mix, and that's confusing for people. I think we
>>> should make them all "opt in".
>>
>> Agreed they should all be opt in by default.  I think active developer 
>> are
>> quite adept at flipping the appropriate CMake flags.
>>
>
> Cool. I opened a tracking JIRA
> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> issues. Sorry for the new JIRA flood
>
>>
>>> * Do we want to bring the out-of-the-box core build down to zero
>>> dependencies, including not depending on boost::filesystem and
>>> possibly checking the compiled Flatbuffers files.
>>
>>  While it may be
>>> slightly more maintenance work, I think the optics of a
>>> "dependency-free" core build would be beneficial and help the project
>>> marketing-wise.
>>
>> I'm -.5 on checking in generated artifacts but this is mostly stylistic.
>> In the case of flatbuffers it seems like we might be able to get-away 
>> with
>> vendoring since it should mostly be headers only.
>>
>> I would prefer to try come up with more granular components and be
>> very conservative on what is "core".  I think it should be possible have 
>> a
>> zero dependency build if only MemoryPool, Buffers, Arrays and 
>> ArrayBuilders
>> in a core package [1].  This combined with discussion Antoine started on 
>> an
>> ABI compatible C-layer would make basic inter-op within a process
>> reasonable.  Moving up the stack to IPC and files, there is probably a 
>> way
>> to package headers separately from implementations.  This would allow 
>> other
>> projects wishing to integrate with Arrow to bring their own 
>> implementations
>> without the baggage of boost::filesystem. Would this leave anything 
>> besides
>> "flatbuffers" as a hard dependency to support IPC?
>>
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.
>
>> Thanks,
>> Micah
>>
>>
>> [1] It probably makes sense to go even further and separate out 
>> MemoryPool
>> and Buffer, so we can break the circular relationship between parquet and
>> arrow.
>
> Don't think this is possible even then, particularly in light of my
> recent work reading and writing Arrow columnar data "closer to the
> metal"  inside Parquet, yielding beneficial performance improvements.
>
>>
>> On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>>
>>> To be clear I think we should make these changes right after 0.15.0 is
>>> released so we aren't playing whackamole with our packaging scripts.
>>> I'm happy to take the lead on the work...
>>>
>>> On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
>>> wrote:

 On Wed, 18 Sep 2019 09:46:54 -0500
 Wes McKinney  wrote:
> I think these are both interesting areas to explore further. I'd like
> to focus on the couple of immediate items I think we should address

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Wes McKinney
Implementing the format fully requires memory management and IO
interfaces (i.e. arrow/io/{file.h, interfaces.h, memory.h}). So those
parts are not separable.

On Fri, Sep 20, 2019 at 10:36 AM Neal Richardson
 wrote:
>
> I wonder if having a core "format" C++ library, which the io, compute,
> etc. library/libraries would depend on, is a natural step.
> Particularly since we're coming up on 1.0 and the format is being
> declared stable.
>
> Neal
>
> On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
> >
> > We would have to be even more careful about managing symbol exports.
> > Third party projects would need to link more libraries in their
> > applications (not unlike the way that Boost works now -- I suppose
> > that Boost is the closest analogue to what we're going for)
> >
> > On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  
> > wrote:
> > >>
> > >> We could indeed split up libarrow into more shared libraries. This
> > >> would mean accepting a lot more maintenance effort though, on a team
> > >> that is already overburdened. I'm not too keen on that in the short
> > >> term.
> > >
> > >
> > > Something for longer term to think about.  What are you seeing as the 
> > > added maintenance here?
> > >
> > >
> > > On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
> > >>
> > >> hi Micah,
> > >>
> > >>
> > >> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> > >> wrote:
> > >> >
> > >> > >
> > >> > > * Should optional components be "opt in", "out out", or a mix?
> > >> > > Currently it's a mix, and that's confusing for people. I think we
> > >> > > should make them all "opt in".
> > >> >
> > >> > Agreed they should all be opt in by default.  I think active developer 
> > >> > are
> > >> > quite adept at flipping the appropriate CMake flags.
> > >> >
> > >>
> > >> Cool. I opened a tracking JIRA
> > >> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> > >> issues. Sorry for the new JIRA flood
> > >>
> > >> >
> > >> > > * Do we want to bring the out-of-the-box core build down to zero
> > >> > > dependencies, including not depending on boost::filesystem and
> > >> > > possibly checking the compiled Flatbuffers files.
> > >> >
> > >> >  While it may be
> > >> > > slightly more maintenance work, I think the optics of a
> > >> > > "dependency-free" core build would be beneficial and help the project
> > >> > > marketing-wise.
> > >> >
> > >> > I'm -.5 on checking in generated artifacts but this is mostly 
> > >> > stylistic.
> > >> > In the case of flatbuffers it seems like we might be able to get-away 
> > >> > with
> > >> > vendoring since it should mostly be headers only.
> > >> >
> > >> > I would prefer to try come up with more granular components and be
> > >> > very conservative on what is "core".  I think it should be possible 
> > >> > have a
> > >> > zero dependency build if only MemoryPool, Buffers, Arrays and 
> > >> > ArrayBuilders
> > >> > in a core package [1].  This combined with discussion Antoine started 
> > >> > on an
> > >> > ABI compatible C-layer would make basic inter-op within a process
> > >> > reasonable.  Moving up the stack to IPC and files, there is probably a 
> > >> > way
> > >> > to package headers separately from implementations.  This would allow 
> > >> > other
> > >> > projects wishing to integrate with Arrow to bring their own 
> > >> > implementations
> > >> > without the baggage of boost::filesystem. Would this leave anything 
> > >> > besides
> > >> > "flatbuffers" as a hard dependency to support IPC?
> > >> >
> > >>
> > >> We could indeed split up libarrow into more shared libraries. This
> > >> would mean accepting a lot more maintenance effort though, on a team
> > >> that is already overburdened. I'm not too keen on that in the short
> > >> term.
> > >>
> > >> > Thanks,
> > >> > Micah
> > >> >
> > >> >
> > >> > [1] It probably makes sense to go even further and separate out 
> > >> > MemoryPool
> > >> > and Buffer, so we can break the circular relationship between parquet 
> > >> > and
> > >> > arrow.
> > >>
> > >> Don't think this is possible even then, particularly in light of my
> > >> recent work reading and writing Arrow columnar data "closer to the
> > >> metal"  inside Parquet, yielding beneficial performance improvements.
> > >>
> > >> >
> > >> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  
> > >> > wrote:
> > >> >
> > >> > > To be clear I think we should make these changes right after 0.15.0 
> > >> > > is
> > >> > > released so we aren't playing whackamole with our packaging scripts.
> > >> > > I'm happy to take the lead on the work...
> > >> > >
> > >> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > >> > > wrote:
> > >> > > >
> > >> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> > >> > > > Wes McKinney  wrote:
> > >> > > > > I think these are both interesting areas to explore further. I'd 
> > >> > > > > like
> > >> > > > > to focus on the couple of immediate items I think we should 
> > >> > > > > address
> > >> > > > >
> > 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Neal Richardson
I wonder if having a core "format" C++ library, which the io, compute,
etc. library/libraries would depend on, is a natural step.
Particularly since we're coming up on 1.0 and the format is being
declared stable.

Neal

On Fri, Sep 20, 2019 at 8:28 AM Wes McKinney  wrote:
>
> We would have to be even more careful about managing symbol exports.
> Third party projects would need to link more libraries in their
> applications (not unlike the way that Boost works now -- I suppose
> that Boost is the closest analogue to what we're going for)
>
> On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  wrote:
> >>
> >> We could indeed split up libarrow into more shared libraries. This
> >> would mean accepting a lot more maintenance effort though, on a team
> >> that is already overburdened. I'm not too keen on that in the short
> >> term.
> >
> >
> > Something for longer term to think about.  What are you seeing as the added 
> > maintenance here?
> >
> >
> > On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
> >>
> >> hi Micah,
> >>
> >>
> >> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
> >> wrote:
> >> >
> >> > >
> >> > > * Should optional components be "opt in", "out out", or a mix?
> >> > > Currently it's a mix, and that's confusing for people. I think we
> >> > > should make them all "opt in".
> >> >
> >> > Agreed they should all be opt in by default.  I think active developer 
> >> > are
> >> > quite adept at flipping the appropriate CMake flags.
> >> >
> >>
> >> Cool. I opened a tracking JIRA
> >> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> >> issues. Sorry for the new JIRA flood
> >>
> >> >
> >> > > * Do we want to bring the out-of-the-box core build down to zero
> >> > > dependencies, including not depending on boost::filesystem and
> >> > > possibly checking the compiled Flatbuffers files.
> >> >
> >> >  While it may be
> >> > > slightly more maintenance work, I think the optics of a
> >> > > "dependency-free" core build would be beneficial and help the project
> >> > > marketing-wise.
> >> >
> >> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> >> > In the case of flatbuffers it seems like we might be able to get-away 
> >> > with
> >> > vendoring since it should mostly be headers only.
> >> >
> >> > I would prefer to try come up with more granular components and be
> >> > very conservative on what is "core".  I think it should be possible have 
> >> > a
> >> > zero dependency build if only MemoryPool, Buffers, Arrays and 
> >> > ArrayBuilders
> >> > in a core package [1].  This combined with discussion Antoine started on 
> >> > an
> >> > ABI compatible C-layer would make basic inter-op within a process
> >> > reasonable.  Moving up the stack to IPC and files, there is probably a 
> >> > way
> >> > to package headers separately from implementations.  This would allow 
> >> > other
> >> > projects wishing to integrate with Arrow to bring their own 
> >> > implementations
> >> > without the baggage of boost::filesystem. Would this leave anything 
> >> > besides
> >> > "flatbuffers" as a hard dependency to support IPC?
> >> >
> >>
> >> We could indeed split up libarrow into more shared libraries. This
> >> would mean accepting a lot more maintenance effort though, on a team
> >> that is already overburdened. I'm not too keen on that in the short
> >> term.
> >>
> >> > Thanks,
> >> > Micah
> >> >
> >> >
> >> > [1] It probably makes sense to go even further and separate out 
> >> > MemoryPool
> >> > and Buffer, so we can break the circular relationship between parquet and
> >> > arrow.
> >>
> >> Don't think this is possible even then, particularly in light of my
> >> recent work reading and writing Arrow columnar data "closer to the
> >> metal"  inside Parquet, yielding beneficial performance improvements.
> >>
> >> >
> >> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
> >> >
> >> > > To be clear I think we should make these changes right after 0.15.0 is
> >> > > released so we aren't playing whackamole with our packaging scripts.
> >> > > I'm happy to take the lead on the work...
> >> > >
> >> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> >> > > wrote:
> >> > > >
> >> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> >> > > > Wes McKinney  wrote:
> >> > > > > I think these are both interesting areas to explore further. I'd 
> >> > > > > like
> >> > > > > to focus on the couple of immediate items I think we should address
> >> > > > >
> >> > > > > * Should optional components be "opt in", "out out", or a mix?
> >> > > > > Currently it's a mix, and that's confusing for people. I think we
> >> > > > > should make them all "opt in".
> >> > > > > * Do we want to bring the out-of-the-box core build down to zero
> >> > > > > dependencies, including not depending on boost::filesystem and
> >> > > > > possibly checking the compiled Flatbuffers files. While it may be
> >> > > > > slightly more maintenance work, I think the optics of a
> >> > > > > 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Wes McKinney
We would have to be even more careful about managing symbol exports.
Third party projects would need to link more libraries in their
applications (not unlike the way that Boost works now -- I suppose
that Boost is the closest analogue to what we're going for)

On Fri, Sep 20, 2019 at 2:30 AM Micah Kornfield  wrote:
>>
>> We could indeed split up libarrow into more shared libraries. This
>> would mean accepting a lot more maintenance effort though, on a team
>> that is already overburdened. I'm not too keen on that in the short
>> term.
>
>
> Something for longer term to think about.  What are you seeing as the added 
> maintenance here?
>
>
> On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:
>>
>> hi Micah,
>>
>>
>> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  
>> wrote:
>> >
>> > >
>> > > * Should optional components be "opt in", "out out", or a mix?
>> > > Currently it's a mix, and that's confusing for people. I think we
>> > > should make them all "opt in".
>> >
>> > Agreed they should all be opt in by default.  I think active developer are
>> > quite adept at flipping the appropriate CMake flags.
>> >
>>
>> Cool. I opened a tracking JIRA
>> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
>> issues. Sorry for the new JIRA flood
>>
>> >
>> > > * Do we want to bring the out-of-the-box core build down to zero
>> > > dependencies, including not depending on boost::filesystem and
>> > > possibly checking the compiled Flatbuffers files.
>> >
>> >  While it may be
>> > > slightly more maintenance work, I think the optics of a
>> > > "dependency-free" core build would be beneficial and help the project
>> > > marketing-wise.
>> >
>> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
>> > In the case of flatbuffers it seems like we might be able to get-away with
>> > vendoring since it should mostly be headers only.
>> >
>> > I would prefer to try come up with more granular components and be
>> > very conservative on what is "core".  I think it should be possible have a
>> > zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
>> > in a core package [1].  This combined with discussion Antoine started on an
>> > ABI compatible C-layer would make basic inter-op within a process
>> > reasonable.  Moving up the stack to IPC and files, there is probably a way
>> > to package headers separately from implementations.  This would allow other
>> > projects wishing to integrate with Arrow to bring their own implementations
>> > without the baggage of boost::filesystem. Would this leave anything besides
>> > "flatbuffers" as a hard dependency to support IPC?
>> >
>>
>> We could indeed split up libarrow into more shared libraries. This
>> would mean accepting a lot more maintenance effort though, on a team
>> that is already overburdened. I'm not too keen on that in the short
>> term.
>>
>> > Thanks,
>> > Micah
>> >
>> >
>> > [1] It probably makes sense to go even further and separate out MemoryPool
>> > and Buffer, so we can break the circular relationship between parquet and
>> > arrow.
>>
>> Don't think this is possible even then, particularly in light of my
>> recent work reading and writing Arrow columnar data "closer to the
>> metal"  inside Parquet, yielding beneficial performance improvements.
>>
>> >
>> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>> >
>> > > To be clear I think we should make these changes right after 0.15.0 is
>> > > released so we aren't playing whackamole with our packaging scripts.
>> > > I'm happy to take the lead on the work...
>> > >
>> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
>> > > wrote:
>> > > >
>> > > > On Wed, 18 Sep 2019 09:46:54 -0500
>> > > > Wes McKinney  wrote:
>> > > > > I think these are both interesting areas to explore further. I'd like
>> > > > > to focus on the couple of immediate items I think we should address
>> > > > >
>> > > > > * Should optional components be "opt in", "out out", or a mix?
>> > > > > Currently it's a mix, and that's confusing for people. I think we
>> > > > > should make them all "opt in".
>> > > > > * Do we want to bring the out-of-the-box core build down to zero
>> > > > > dependencies, including not depending on boost::filesystem and
>> > > > > possibly checking the compiled Flatbuffers files. While it may be
>> > > > > slightly more maintenance work, I think the optics of a
>> > > > > "dependency-free" core build would be beneficial and help the project
>> > > > > marketing-wise.
>> > > > >
>> > > > > Both of these issues must be addressed whether we undertake a Bazel
>> > > > > implementation or some other refactor of the C++ build system.
>> > > >
>> > > > I think checking in the Flatbuffers files (and also Protobuf and Thrift
>> > > > where applicable :-)) would be fine.
>> > > >
>> > > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
>> > > > Still worth deciding whether we want to prioritize development time for
>> > > > it, 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-20 Thread Micah Kornfield
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.


Something for longer term to think about.  What are you seeing as the added
maintenance here?


On Thu, Sep 19, 2019 at 5:38 PM Wes McKinney  wrote:

> hi Micah,
>
>
> On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield 
> wrote:
> >
> > >
> > > * Should optional components be "opt in", "out out", or a mix?
> > > Currently it's a mix, and that's confusing for people. I think we
> > > should make them all "opt in".
> >
> > Agreed they should all be opt in by default.  I think active developer
> are
> > quite adept at flipping the appropriate CMake flags.
> >
>
> Cool. I opened a tracking JIRA
> https://issues.apache.org/jira/browse/ARROW-6637 and attached many
> issues. Sorry for the new JIRA flood
>
> >
> > > * Do we want to bring the out-of-the-box core build down to zero
> > > dependencies, including not depending on boost::filesystem and
> > > possibly checking the compiled Flatbuffers files.
> >
> >  While it may be
> > > slightly more maintenance work, I think the optics of a
> > > "dependency-free" core build would be beneficial and help the project
> > > marketing-wise.
> >
> > I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> > In the case of flatbuffers it seems like we might be able to get-away
> with
> > vendoring since it should mostly be headers only.
> >
> > I would prefer to try come up with more granular components and be
> > very conservative on what is "core".  I think it should be possible have
> a
> > zero dependency build if only MemoryPool, Buffers, Arrays and
> ArrayBuilders
> > in a core package [1].  This combined with discussion Antoine started on
> an
> > ABI compatible C-layer would make basic inter-op within a process
> > reasonable.  Moving up the stack to IPC and files, there is probably a
> way
> > to package headers separately from implementations.  This would allow
> other
> > projects wishing to integrate with Arrow to bring their own
> implementations
> > without the baggage of boost::filesystem. Would this leave anything
> besides
> > "flatbuffers" as a hard dependency to support IPC?
> >
>
> We could indeed split up libarrow into more shared libraries. This
> would mean accepting a lot more maintenance effort though, on a team
> that is already overburdened. I'm not too keen on that in the short
> term.
>
> > Thanks,
> > Micah
> >
> >
> > [1] It probably makes sense to go even further and separate out
> MemoryPool
> > and Buffer, so we can break the circular relationship between parquet and
> > arrow.
>
> Don't think this is possible even then, particularly in light of my
> recent work reading and writing Arrow columnar data "closer to the
> metal"  inside Parquet, yielding beneficial performance improvements.
>
> >
> > On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney 
> wrote:
> >
> > > To be clear I think we should make these changes right after 0.15.0 is
> > > released so we aren't playing whackamole with our packaging scripts.
> > > I'm happy to take the lead on the work...
> > >
> > > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > > wrote:
> > > >
> > > > On Wed, 18 Sep 2019 09:46:54 -0500
> > > > Wes McKinney  wrote:
> > > > > I think these are both interesting areas to explore further. I'd
> like
> > > > > to focus on the couple of immediate items I think we should address
> > > > >
> > > > > * Should optional components be "opt in", "out out", or a mix?
> > > > > Currently it's a mix, and that's confusing for people. I think we
> > > > > should make them all "opt in".
> > > > > * Do we want to bring the out-of-the-box core build down to zero
> > > > > dependencies, including not depending on boost::filesystem and
> > > > > possibly checking the compiled Flatbuffers files. While it may be
> > > > > slightly more maintenance work, I think the optics of a
> > > > > "dependency-free" core build would be beneficial and help the
> project
> > > > > marketing-wise.
> > > > >
> > > > > Both of these issues must be addressed whether we undertake a Bazel
> > > > > implementation or some other refactor of the C++ build system.
> > > >
> > > > I think checking in the Flatbuffers files (and also Protobuf and
> Thrift
> > > > where applicable :-)) would be fine.
> > > >
> > > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > > > Still worth deciding whether we want to prioritize development time
> for
> > > > it, because it's not entirely trivial either.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-19 Thread Wes McKinney
hi Micah,


On Thu, Sep 19, 2019 at 12:41 AM Micah Kornfield  wrote:
>
> >
> > * Should optional components be "opt in", "out out", or a mix?
> > Currently it's a mix, and that's confusing for people. I think we
> > should make them all "opt in".
>
> Agreed they should all be opt in by default.  I think active developer are
> quite adept at flipping the appropriate CMake flags.
>

Cool. I opened a tracking JIRA
https://issues.apache.org/jira/browse/ARROW-6637 and attached many
issues. Sorry for the new JIRA flood

>
> > * Do we want to bring the out-of-the-box core build down to zero
> > dependencies, including not depending on boost::filesystem and
> > possibly checking the compiled Flatbuffers files.
>
>  While it may be
> > slightly more maintenance work, I think the optics of a
> > "dependency-free" core build would be beneficial and help the project
> > marketing-wise.
>
> I'm -.5 on checking in generated artifacts but this is mostly stylistic.
> In the case of flatbuffers it seems like we might be able to get-away with
> vendoring since it should mostly be headers only.
>
> I would prefer to try come up with more granular components and be
> very conservative on what is "core".  I think it should be possible have a
> zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
> in a core package [1].  This combined with discussion Antoine started on an
> ABI compatible C-layer would make basic inter-op within a process
> reasonable.  Moving up the stack to IPC and files, there is probably a way
> to package headers separately from implementations.  This would allow other
> projects wishing to integrate with Arrow to bring their own implementations
> without the baggage of boost::filesystem. Would this leave anything besides
> "flatbuffers" as a hard dependency to support IPC?
>

We could indeed split up libarrow into more shared libraries. This
would mean accepting a lot more maintenance effort though, on a team
that is already overburdened. I'm not too keen on that in the short
term.

> Thanks,
> Micah
>
>
> [1] It probably makes sense to go even further and separate out MemoryPool
> and Buffer, so we can break the circular relationship between parquet and
> arrow.

Don't think this is possible even then, particularly in light of my
recent work reading and writing Arrow columnar data "closer to the
metal"  inside Parquet, yielding beneficial performance improvements.

>
> On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:
>
> > To be clear I think we should make these changes right after 0.15.0 is
> > released so we aren't playing whackamole with our packaging scripts.
> > I'm happy to take the lead on the work...
> >
> > On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> > wrote:
> > >
> > > On Wed, 18 Sep 2019 09:46:54 -0500
> > > Wes McKinney  wrote:
> > > > I think these are both interesting areas to explore further. I'd like
> > > > to focus on the couple of immediate items I think we should address
> > > >
> > > > * Should optional components be "opt in", "out out", or a mix?
> > > > Currently it's a mix, and that's confusing for people. I think we
> > > > should make them all "opt in".
> > > > * Do we want to bring the out-of-the-box core build down to zero
> > > > dependencies, including not depending on boost::filesystem and
> > > > possibly checking the compiled Flatbuffers files. While it may be
> > > > slightly more maintenance work, I think the optics of a
> > > > "dependency-free" core build would be beneficial and help the project
> > > > marketing-wise.
> > > >
> > > > Both of these issues must be addressed whether we undertake a Bazel
> > > > implementation or some other refactor of the C++ build system.
> > >
> > > I think checking in the Flatbuffers files (and also Protobuf and Thrift
> > > where applicable :-)) would be fine.
> > >
> > > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > > Still worth deciding whether we want to prioritize development time for
> > > it, because it's not entirely trivial either.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Micah Kornfield
>
> * Should optional components be "opt in", "out out", or a mix?
> Currently it's a mix, and that's confusing for people. I think we
> should make them all "opt in".

Agreed they should all be opt in by default.  I think active developer are
quite adept at flipping the appropriate CMake flags.


> * Do we want to bring the out-of-the-box core build down to zero
> dependencies, including not depending on boost::filesystem and
> possibly checking the compiled Flatbuffers files.

 While it may be
> slightly more maintenance work, I think the optics of a
> "dependency-free" core build would be beneficial and help the project
> marketing-wise.

I'm -.5 on checking in generated artifacts but this is mostly stylistic.
In the case of flatbuffers it seems like we might be able to get-away with
vendoring since it should mostly be headers only.

I would prefer to try come up with more granular components and be
very conservative on what is "core".  I think it should be possible have a
zero dependency build if only MemoryPool, Buffers, Arrays and ArrayBuilders
in a core package [1].  This combined with discussion Antoine started on an
ABI compatible C-layer would make basic inter-op within a process
reasonable.  Moving up the stack to IPC and files, there is probably a way
to package headers separately from implementations.  This would allow other
projects wishing to integrate with Arrow to bring their own implementations
without the baggage of boost::filesystem. Would this leave anything besides
"flatbuffers" as a hard dependency to support IPC?

Thanks,
Micah


[1] It probably makes sense to go even further and separate out MemoryPool
and Buffer, so we can break the circular relationship between parquet and
arrow.

On Wed, Sep 18, 2019 at 8:03 AM Wes McKinney  wrote:

> To be clear I think we should make these changes right after 0.15.0 is
> released so we aren't playing whackamole with our packaging scripts.
> I'm happy to take the lead on the work...
>
> On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou 
> wrote:
> >
> > On Wed, 18 Sep 2019 09:46:54 -0500
> > Wes McKinney  wrote:
> > > I think these are both interesting areas to explore further. I'd like
> > > to focus on the couple of immediate items I think we should address
> > >
> > > * Should optional components be "opt in", "out out", or a mix?
> > > Currently it's a mix, and that's confusing for people. I think we
> > > should make them all "opt in".
> > > * Do we want to bring the out-of-the-box core build down to zero
> > > dependencies, including not depending on boost::filesystem and
> > > possibly checking the compiled Flatbuffers files. While it may be
> > > slightly more maintenance work, I think the optics of a
> > > "dependency-free" core build would be beneficial and help the project
> > > marketing-wise.
> > >
> > > Both of these issues must be addressed whether we undertake a Bazel
> > > implementation or some other refactor of the C++ build system.
> >
> > I think checking in the Flatbuffers files (and also Protobuf and Thrift
> > where applicable :-)) would be fine.
> >
> > As for boost::filesystem, getting rid of it wouldn't be a huge task.
> > Still worth deciding whether we want to prioritize development time for
> > it, because it's not entirely trivial either.
> >
> > Regards
> >
> > Antoine.
> >
> >
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Wes McKinney
To be clear I think we should make these changes right after 0.15.0 is
released so we aren't playing whackamole with our packaging scripts.
I'm happy to take the lead on the work...

On Wed, Sep 18, 2019 at 9:54 AM Antoine Pitrou  wrote:
>
> On Wed, 18 Sep 2019 09:46:54 -0500
> Wes McKinney  wrote:
> > I think these are both interesting areas to explore further. I'd like
> > to focus on the couple of immediate items I think we should address
> >
> > * Should optional components be "opt in", "out out", or a mix?
> > Currently it's a mix, and that's confusing for people. I think we
> > should make them all "opt in".
> > * Do we want to bring the out-of-the-box core build down to zero
> > dependencies, including not depending on boost::filesystem and
> > possibly checking the compiled Flatbuffers files. While it may be
> > slightly more maintenance work, I think the optics of a
> > "dependency-free" core build would be beneficial and help the project
> > marketing-wise.
> >
> > Both of these issues must be addressed whether we undertake a Bazel
> > implementation or some other refactor of the C++ build system.
>
> I think checking in the Flatbuffers files (and also Protobuf and Thrift
> where applicable :-)) would be fine.
>
> As for boost::filesystem, getting rid of it wouldn't be a huge task.
> Still worth deciding whether we want to prioritize development time for
> it, because it's not entirely trivial either.
>
> Regards
>
> Antoine.
>
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Wes McKinney
I think these are both interesting areas to explore further. I'd like
to focus on the couple of immediate items I think we should address

* Should optional components be "opt in", "out out", or a mix?
Currently it's a mix, and that's confusing for people. I think we
should make them all "opt in".
* Do we want to bring the out-of-the-box core build down to zero
dependencies, including not depending on boost::filesystem and
possibly checking the compiled Flatbuffers files. While it may be
slightly more maintenance work, I think the optics of a
"dependency-free" core build would be beneficial and help the project
marketing-wise.

Both of these issues must be addressed whether we undertake a Bazel
implementation or some other refactor of the C++ build system.

On Wed, Sep 18, 2019 at 2:48 AM Uwe L. Korn  wrote:
>
> Hello Micah,
>
> I don't think we have explored using bazel yet. I would see it as a possible 
> modular alternative but as you mention it will be a lot of work and we would 
> probably need a mentor who is familiar with bazel, otherwise we probably end 
> up spending too much time on this and get a non-typical bazel setup.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> > It has come up in the past, but I wonder if exploring Bazel as a build
> > system with its a very explicit dependency graph might help (I'm not sure
> > if something similar is available in CMake).
> >
> > This is also a lot of work, but could also potentially benefit the
> > developer experience because we can make unit tests depend on individual
> > compilable units instead of all of libarrow.  There are trade-offs here as
> > well in terms of public API coverage.
> >
> > On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:
> >
> > > Hello,
> > >
> > > I can think of two other alternatives that make it more visible what Arrow
> > > core is and what are the optional components:
> > >
> > > * Error out when no component is selected instead of building just the
> > > core Arrow. Here we could add an explanative message that list all
> > > components and for each component 2-3 words what it does and what it
> > > requires. This would make the first-time experience much better.
> > > * Split the CMake project into several subprojects. By correctly
> > > structuring the CMakefiles, we should be able to separate out the Arrow
> > > components into separate CMake projects that can be built independently if
> > > needed while all using the same third-party toolchain. We would still have
> > > a top-level CMakeLists.txt that is invoked just like the current one but
> > > through having subprojects, you would not anymore be bound to use the
> > > single top-level one. This would also have some benefit for packagers that
> > > could separate out the build of individual Arrow modules. Furthermore, it
> > > would also make it easier for PoC/academic projects to just take the Arrow
> > > Core sources and drop it in as a CMake subproject; while this is not a 
> > > good
> > > solution for production-grade software, it is quite common practice to do
> > > this in research.
> > > I really like this approach and I think this is something we should have
> > > as a long-term target, I'm also happy to implement given the time but I
> > > think one CMake refactor per year is the maximum I can do and that was
> > > already eaten up by the dependency detection. Also, I'm unsure about how
> > > much this would block us at the moment vs the marketing benefit of having 
> > > a
> > > more modular Arrow; currently I'm leaning on the side that the
> > > marketing/adoption benefit would be much larger but we lack someone
> > > frustration-tolerant to do the refactoring.
> > >
> > > Uwe
> > >
> > > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > Lately there seem to be more and more people suggesting that the
> > > > optional components in the Arrow C++ project are getting in the way of
> > > > using the "core" which implements the columnar format and IPC
> > > > protocol. I am not sure I agree with this argument, but in general I
> > > > think it would be a good idea to make all optional components in the
> > > > project "opt in" rather than "opt out"
> > > >
> > > > To demonstrate where things currently stand, I created a Dockerfile to
> > > > try to make the smallest possible and most dependency-free build
> > > >
> > > >
> > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > > >
> > > > Here is the output of this build
> > > >
> > > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > > >
> > > > First, let's look at the CMake invocation
> > > >
> > > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > > -DARROW_BOOST_USE_SHARED=OFF \
> > > > -DARROW_COMPUTE=OFF \
> > > > -DARROW_DATASET=OFF \
> > > > -DARROW_JEMALLOC=OFF \
> > > > -DARROW_JSON=ON \
> > > > -DARROW_USE_GLOG=OFF \
> > > > -DARROW_WITH_BZ2=OFF \
> > > > -DARROW_WITH_ZLIB=OFF \
> > > > 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello Micah,

I don't think we have explored using bazel yet. I would see it as a possible 
modular alternative but as you mention it will be a lot of work and we would 
probably need a mentor who is familiar with bazel, otherwise we probably end up 
spending too much time on this and get a non-typical bazel setup.

Uwe

On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> It has come up in the past, but I wonder if exploring Bazel as a build
> system with its a very explicit dependency graph might help (I'm not sure
> if something similar is available in CMake).
> 
> This is also a lot of work, but could also potentially benefit the
> developer experience because we can make unit tests depend on individual
> compilable units instead of all of libarrow.  There are trade-offs here as
> well in terms of public API coverage.
> 
> On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:
> 
> > Hello,
> >
> > I can think of two other alternatives that make it more visible what Arrow
> > core is and what are the optional components:
> >
> > * Error out when no component is selected instead of building just the
> > core Arrow. Here we could add an explanative message that list all
> > components and for each component 2-3 words what it does and what it
> > requires. This would make the first-time experience much better.
> > * Split the CMake project into several subprojects. By correctly
> > structuring the CMakefiles, we should be able to separate out the Arrow
> > components into separate CMake projects that can be built independently if
> > needed while all using the same third-party toolchain. We would still have
> > a top-level CMakeLists.txt that is invoked just like the current one but
> > through having subprojects, you would not anymore be bound to use the
> > single top-level one. This would also have some benefit for packagers that
> > could separate out the build of individual Arrow modules. Furthermore, it
> > would also make it easier for PoC/academic projects to just take the Arrow
> > Core sources and drop it in as a CMake subproject; while this is not a good
> > solution for production-grade software, it is quite common practice to do
> > this in research.
> > I really like this approach and I think this is something we should have
> > as a long-term target, I'm also happy to implement given the time but I
> > think one CMake refactor per year is the maximum I can do and that was
> > already eaten up by the dependency detection. Also, I'm unsure about how
> > much this would block us at the moment vs the marketing benefit of having a
> > more modular Arrow; currently I'm leaning on the side that the
> > marketing/adoption benefit would be much larger but we lack someone
> > frustration-tolerant to do the refactoring.
> >
> > Uwe
> >
> > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > hi folks,
> > >
> > > Lately there seem to be more and more people suggesting that the
> > > optional components in the Arrow C++ project are getting in the way of
> > > using the "core" which implements the columnar format and IPC
> > > protocol. I am not sure I agree with this argument, but in general I
> > > think it would be a good idea to make all optional components in the
> > > project "opt in" rather than "opt out"
> > >
> > > To demonstrate where things currently stand, I created a Dockerfile to
> > > try to make the smallest possible and most dependency-free build
> > >
> > >
> > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > >
> > > Here is the output of this build
> > >
> > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > >
> > > First, let's look at the CMake invocation
> > >
> > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > -DARROW_BOOST_USE_SHARED=OFF \
> > > -DARROW_COMPUTE=OFF \
> > > -DARROW_DATASET=OFF \
> > > -DARROW_JEMALLOC=OFF \
> > > -DARROW_JSON=ON \
> > > -DARROW_USE_GLOG=OFF \
> > > -DARROW_WITH_BZ2=OFF \
> > > -DARROW_WITH_ZLIB=OFF \
> > > -DARROW_WITH_ZSTD=OFF \
> > > -DARROW_WITH_LZ4=OFF \
> > > -DARROW_WITH_SNAPPY=OFF \
> > > -DARROW_WITH_BROTLI=OFF \
> > > -DARROW_BUILD_UTILITIES=OFF
> > >
> > > Aside from the issue of how to obtain and link Boost, here's a couple of
> > things:
> > >
> > > * COMPUTE and DATASET IMHO should be off by default
> > > * All compression libraries should be turned off
> > > * GLOG should be off by default
> > > * Utilities should be off (they are used for integration testing)
> > > * Jemalloc should probably be off, but we should make it clear that
> > > opting in will yield better performance
> > >
> > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > > the build. I opened ARROW-6590 to fix this
> > >
> > > Aside from potentially changing these defaults, there's some things in
> > > the build that we might want to turn into optional pieces:
> > >
> > > * We should see if we can make boost::filesystem not mandatory in the
> > > barebones build, if only to satisfy the 

Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Micah Kornfield
It has come up in the past, but I wonder if exploring Bazel as a build
system with its a very explicit dependency graph might help (I'm not sure
if something similar is available in CMake).

This is also a lot of work, but could also potentially benefit the
developer experience because we can make unit tests depend on individual
compilable units instead of all of libarrow.  There are trade-offs here as
well in terms of public API coverage.

On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn  wrote:

> Hello,
>
> I can think of two other alternatives that make it more visible what Arrow
> core is and what are the optional components:
>
> * Error out when no component is selected instead of building just the
> core Arrow. Here we could add an explanative message that list all
> components and for each component 2-3 words what it does and what it
> requires. This would make the first-time experience much better.
> * Split the CMake project into several subprojects. By correctly
> structuring the CMakefiles, we should be able to separate out the Arrow
> components into separate CMake projects that can be built independently if
> needed while all using the same third-party toolchain. We would still have
> a top-level CMakeLists.txt that is invoked just like the current one but
> through having subprojects, you would not anymore be bound to use the
> single top-level one. This would also have some benefit for packagers that
> could separate out the build of individual Arrow modules. Furthermore, it
> would also make it easier for PoC/academic projects to just take the Arrow
> Core sources and drop it in as a CMake subproject; while this is not a good
> solution for production-grade software, it is quite common practice to do
> this in research.
> I really like this approach and I think this is something we should have
> as a long-term target, I'm also happy to implement given the time but I
> think one CMake refactor per year is the maximum I can do and that was
> already eaten up by the dependency detection. Also, I'm unsure about how
> much this would block us at the moment vs the marketing benefit of having a
> more modular Arrow; currently I'm leaning on the side that the
> marketing/adoption benefit would be much larger but we lack someone
> frustration-tolerant to do the refactoring.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > hi folks,
> >
> > Lately there seem to be more and more people suggesting that the
> > optional components in the Arrow C++ project are getting in the way of
> > using the "core" which implements the columnar format and IPC
> > protocol. I am not sure I agree with this argument, but in general I
> > think it would be a good idea to make all optional components in the
> > project "opt in" rather than "opt out"
> >
> > To demonstrate where things currently stand, I created a Dockerfile to
> > try to make the smallest possible and most dependency-free build
> >
> >
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> >
> > Here is the output of this build
> >
> > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> >
> > First, let's look at the CMake invocation
> >
> > cmake .. -DBOOST_SOURCE=BUNDLED \
> > -DARROW_BOOST_USE_SHARED=OFF \
> > -DARROW_COMPUTE=OFF \
> > -DARROW_DATASET=OFF \
> > -DARROW_JEMALLOC=OFF \
> > -DARROW_JSON=ON \
> > -DARROW_USE_GLOG=OFF \
> > -DARROW_WITH_BZ2=OFF \
> > -DARROW_WITH_ZLIB=OFF \
> > -DARROW_WITH_ZSTD=OFF \
> > -DARROW_WITH_LZ4=OFF \
> > -DARROW_WITH_SNAPPY=OFF \
> > -DARROW_WITH_BROTLI=OFF \
> > -DARROW_BUILD_UTILITIES=OFF
> >
> > Aside from the issue of how to obtain and link Boost, here's a couple of
> things:
> >
> > * COMPUTE and DATASET IMHO should be off by default
> > * All compression libraries should be turned off
> > * GLOG should be off by default
> > * Utilities should be off (they are used for integration testing)
> > * Jemalloc should probably be off, but we should make it clear that
> > opting in will yield better performance
> >
> > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > the build. I opened ARROW-6590 to fix this
> >
> > Aside from potentially changing these defaults, there's some things in
> > the build that we might want to turn into optional pieces:
> >
> > * We should see if we can make boost::filesystem not mandatory in the
> > barebones build, if only to satisfy the peanut gallery
> > * double-conversion is used in the CSV module. I think that
> > double-conversion_ep and the CSV module should both be made opt-in
> > * rapidjson_ep should be made optional. JSON support is only needed
> > for integration testing
> >
> > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> > is not mandatory.
> >
> > In general, enabling optional components is primarily relevant for
> > packagers. If we implement these changes, a number of package build
> > scripts will have to change.
> >
> > Thanks,
> > Wes
> >
>


Re: [DISCUSS] Changing C++ build system default options to produce more barebones builds

2019-09-18 Thread Uwe L. Korn
Hello,

I can think of two other alternatives that make it more visible what Arrow core 
is and what are the optional components:

* Error out when no component is selected instead of building just the core 
Arrow. Here we could add an explanative message that list all components and 
for each component 2-3 words what it does and what it requires. This would make 
the first-time experience much better.
* Split the CMake project into several subprojects. By correctly structuring 
the CMakefiles, we should be able to separate out the Arrow components into 
separate CMake projects that can be built independently if needed while all 
using the same third-party toolchain. We would still have a top-level 
CMakeLists.txt that is invoked just like the current one but through having 
subprojects, you would not anymore be bound to use the single top-level one. 
This would also have some benefit for packagers that could separate out the 
build of individual Arrow modules. Furthermore, it would also make it easier 
for PoC/academic projects to just take the Arrow Core sources and drop it in as 
a CMake subproject; while this is not a good solution for production-grade 
software, it is quite common practice to do this in research.
I really like this approach and I think this is something we should have as a 
long-term target, I'm also happy to implement given the time but I think one 
CMake refactor per year is the maximum I can do and that was already eaten up 
by the dependency detection. Also, I'm unsure about how much this would block 
us at the moment vs the marketing benefit of having a more modular Arrow; 
currently I'm leaning on the side that the marketing/adoption benefit would be 
much larger but we lack someone frustration-tolerant to do the refactoring.

Uwe

On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> hi folks,
> 
> Lately there seem to be more and more people suggesting that the
> optional components in the Arrow C++ project are getting in the way of
> using the "core" which implements the columnar format and IPC
> protocol. I am not sure I agree with this argument, but in general I
> think it would be a good idea to make all optional components in the
> project "opt in" rather than "opt out"
> 
> To demonstrate where things currently stand, I created a Dockerfile to
> try to make the smallest possible and most dependency-free build
> 
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> 
> Here is the output of this build
> 
> https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> 
> First, let's look at the CMake invocation
> 
> cmake .. -DBOOST_SOURCE=BUNDLED \
> -DARROW_BOOST_USE_SHARED=OFF \
> -DARROW_COMPUTE=OFF \
> -DARROW_DATASET=OFF \
> -DARROW_JEMALLOC=OFF \
> -DARROW_JSON=ON \
> -DARROW_USE_GLOG=OFF \
> -DARROW_WITH_BZ2=OFF \
> -DARROW_WITH_ZLIB=OFF \
> -DARROW_WITH_ZSTD=OFF \
> -DARROW_WITH_LZ4=OFF \
> -DARROW_WITH_SNAPPY=OFF \
> -DARROW_WITH_BROTLI=OFF \
> -DARROW_BUILD_UTILITIES=OFF
> 
> Aside from the issue of how to obtain and link Boost, here's a couple of 
> things:
> 
> * COMPUTE and DATASET IMHO should be off by default
> * All compression libraries should be turned off
> * GLOG should be off by default
> * Utilities should be off (they are used for integration testing)
> * Jemalloc should probably be off, but we should make it clear that
> opting in will yield better performance
> 
> I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> the build. I opened ARROW-6590 to fix this
> 
> Aside from potentially changing these defaults, there's some things in
> the build that we might want to turn into optional pieces:
> 
> * We should see if we can make boost::filesystem not mandatory in the
> barebones build, if only to satisfy the peanut gallery
> * double-conversion is used in the CSV module. I think that
> double-conversion_ep and the CSV module should both be made opt-in
> * rapidjson_ep should be made optional. JSON support is only needed
> for integration testing
> 
> We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> is not mandatory.
> 
> In general, enabling optional components is primarily relevant for
> packagers. If we implement these changes, a number of package build
> scripts will have to change.
> 
> Thanks,
> Wes
>