[jira] [Created] (ARROW-6474) Provide mechanism for python to write out old format

2019-09-05 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6474:
--

 Summary: Provide mechanism for python to write out old format
 Key: ARROW-6474
 URL: https://issues.apache.org/jira/browse/ARROW-6474
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Micah Kornfield
 Fix For: 0.15.0


I think this needs to be an environment variable, so it can be made to work 
with old version of the Java library pyspark integration.

 

 [~bryanc] can you check if this captures the requirements?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6473) [Format] Clarify dictionary encoding edge cases

2019-09-05 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6473:
--

 Summary: [Format] Clarify dictionary encoding edge cases
 Key: ARROW-6473
 URL: https://issues.apache.org/jira/browse/ARROW-6473
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, Format
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Several recent threads on the mailing list:

1.  Edge case for all null columns and interleaved dictionaries

2. Semantics non-delta dictionaries (and relation to the file format).

3.  Propose a forward compatible enum so dictionaries can represented as other 
types besides for a "flat" vector.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS] IPC buffer layout for Null type

2019-09-05 Thread Micah Kornfield
Hi Wes and others,
I don't have a sense of where Null arrays get created in the existing code
base?

Also, do you think it is worth the effort make this backwards compatible.
We could in theory tie the buffer count to having the continuation value
for alignment.

The one area were I'm slightly concerned is we seem to have users in the
wild who are depending on backwards compatibility, and I'm try to better
understand the odds that we break them.

Thanks,
Micah

On Thu, Sep 5, 2019 at 7:25 AM Wes McKinney  wrote:

> hi folks,
>
> One of the as-yet-untested (in integration tests) parts of the
> columnar specification is the Null layout. In C++ we additionally
> implemented this by writing two length-0 "placeholder" buffers in the
> RecordBatch data header, but since the Null layout has no memory
> allocated nor any buffers in-memory it may be more proper to write no
> buffers (since the length of the Null layout is all you need to
> reconstruct it). There are 3 implementations of the placeholder
> version (C++, Go, JS, maybe also C#) but it never got implemented in
> Java. While technically this would break old serialized data, I would
> not expect this to be very frequently occurring in many of the
> currently-deployed Arrow applications
>
> Here is my C++ patch
>
> https://github.com/apache/arrow/pull/5287
>
> I'm not sure we need to formalize this with a vote but I'm interested
> in the community's feedback on how to proceed here.
>
> - Wes
>


Re: Timeline for 0.15.0 release

2019-09-05 Thread Micah Kornfield
Just for reference [1] has a dashboard of the current issues:

https://cwiki.apache.org/confluence/display/ARROW/Arrow+0.15.0+Release

On Thu, Sep 5, 2019 at 3:43 PM Wes McKinney  wrote:

> hi all,
>
> It doesn't seem like we're going to be in a position to release at the
> beginning of next week. I hope that one more week of work (or less)
> will be enough to get us there. Aside from merging the alignment
> changes, we need to make sure that our packaging jobs required for the
> release candidate are all working.
>
> If folks could remove issues from the 0.15.0 backlog that they don't
> think they will finish by end of next week that would help focus
> efforts (there are currently 78 issues in 0.15.0 still). I am looking
> to tackle a few small features related to dictionaries while the
> release window is still open.
>
> - Wes
>
> On Tue, Aug 27, 2019 at 3:48 PM Wes McKinney  wrote:
> >
> > hi,
> >
> > I think we should try to release the week of September 9, so
> > development work should be completed by end of next week.
> >
> > Does that seem reasonable?
> >
> > I plan to get up a patch for the protocol alignment changes for C++ in
> > the next couple of days -- I think that getting the alignment work
> > done is the main barrier to releasing.
> >
> > Thanks
> > Wes
> >
> > On Mon, Aug 19, 2019 at 12:25 PM Ji Liu 
> wrote:
> > >
> > > Hi, Wes, on the java side, I can think of several bugs that need to be
> fixed or reminded.
> > >
> > > i. ARROW-6040: Dictionary entries are required in IPC streams even
> when empty[1]
> > > This one is under review now, however through this PR we find that
> there seems a bug in java reading and writing dictionaries in IPC which is
> Inconsistent with spec[2] since it assumes all dictionaries are at the
> start of stream (see details in PR comments,  and this fix may not catch up
> with version 0.15). @Micah Kornfield
> > >
> > > ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON
> files[3]
> > > Java side code already checked in, other implementations seems not.
> > >
> > > iii. ARROW-6202: OutOfMemory in JdbcAdapter[4]
> > > Caused by trying to load all records in one contiguous batch, fixed by
> providing iterator API for iteratively reading in ARROW-6219[5].
> > >
> > > Thanks,
> > > Ji Liu
> > >
> > > [1] https://github.com/apache/arrow/pull/4960
> > > [2] https://arrow.apache.org/docs/ipc.html
> > > [3] https://issues.apache.org/jira/browse/ARROW-1875
> > > [4] https://issues.apache.org/jira/browse/ARROW-6202[5]
> https://issues.apache.org/jira/browse/ARROW-6219
> > >
> > >
> > >
> > > --
> > > From:Wes McKinney 
> > > Send Time:2019年8月19日(星期一) 23:03
> > > To:dev 
> > > Subject:Re: Timeline for 0.15.0 release
> > >
> > > I'm going to work some on organizing the 0.15.0 backlog some this
> > > week, if anyone wants to help with grooming (particularly for
> > > languages other than C++/Python where I'm focusing) that would be
> > > helpful. There have been almost 500 JIRA issues opened since the
> > > 0.14.0 release, so we should make sure to check whether there's any
> > > regressions or other serious bugs that we should try to fix for
> > > 0.15.0.
> > >
> > > On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney 
> wrote:
> > > >
> > > > The Windows wheel issue in 0.14.1 seems to be
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-6015
> > > >
> > > > I think the root cause could be the Windows changes in
> > > >
> > > >
> https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
> > > >
> > > > I would be appreciative if a volunteer would look into what was wrong
> > > > with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> > > > will be broken, too
> > > >
> > > > The bad wheels can be found at
> > > >
> > > > https://bintray.com/apache/arrow/python#files/python%2F0.14.1
> > > >
> > > > On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou 
> wrote:
> > > > >
> > > > > On Thu, 15 Aug 2019 11:17:07 -0700
> > > > > Micah Kornfield  wrote:
> > > > > > >
> > > > > > > In C++ they are
> > > > > > > independent, we could have 32-bit array lengths and
> variable-length
> > > > > > > types with 64-bit offsets if we wanted (we just wouldn't be
> able to
> > > > > > > have a List child with more than INT32_MAX elements).
> > > > > >
> > > > > > I think the point is we could do this in C++ but we don't.  I'm
> not sure we
> > > > > > would have introduced the "Large" types if we did.
> > > > >
> > > > > 64-bit offsets take twice as much space as 32-bit offsets, so if
> you're
> > > > > storing lots of small-ish lists or strings, 32-bit offsets are
> > > > > preferrable.  So even with 64-bit array lengths from the start it
> would
> > > > > still be beneficial to have types with 32-bit offsets.
> > > > >
> > > > > > Going with the limited address space in Java and calling it a
> reference
> > > > > > implementation seems suboptimal. If a consumer uses a 

Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Micah Kornfield
Congrats everyone.

On Thu, Sep 5, 2019 at 7:06 PM Ji Liu  wrote:

> Congratulations!
>
> Thanks,
> Ji Liu
>
>
> --
> From:Fan Liya 
> Send Time:2019年9月6日(星期五) 09:28
> To:dev 
> Subject:Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and
> Neal Richardson
>
> Big congratulations to Ben, Kenta and Neal!
>
> Best,
> Liya Fan
>
> On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney  wrote:
>
> > hi all,
> >
> > on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
> > and Neal have accepted invitations to become Arrow committers. Welcome
> > and thank you for all your contributions!
> >
>


Re: New Users on JIRA

2019-09-05 Thread paddy horan
Thanks on both counts Wes!

From: Wes McKinney 
Sent: Thursday, September 5, 2019 10:52 PM
To: dev 
Subject: Re: New Users on JIRA

hi Paddy,

I keep all the e-mail in Gmail, it's easy to search there.

The Pony Mail interface works well too

https://lists.apache.org/list.html?dev@arrow.apache.org

To assign issues to new users

* Navigate to "JIRA Administration > Projects" in the top right
* Click on "Apache Arrow"
* Click "Users and Roles" on the left
* Click "Add users to role"
* Type user name or username
* Make sure to select "Contributor"
* Click Add

I just took care of this one.

- Wes

On Thu, Sep 5, 2019 at 9:44 PM paddy horan  wrote:
>
> Hi All,
>
> I have the same issue again where there is a new user (hengruo) that needs 
> permissions changed so I can assign an issue.  I know that this was discussed 
> recently which leads me to another question.
>
> How do others find previous conversations in the mailing list archives?  I 
> find it pretty tedious to navigate the archive when looking for specific 
> threads.  Do others keep the mail in their e-mail clients for future 
> searching or is there some search functionality or tool I am missing?
>
> Thanks,
> Paddy


Re: New Users on JIRA

2019-09-05 Thread Wes McKinney
hi Paddy,

I keep all the e-mail in Gmail, it's easy to search there.

The Pony Mail interface works well too

https://lists.apache.org/list.html?dev@arrow.apache.org

To assign issues to new users

* Navigate to "JIRA Administration > Projects" in the top right
* Click on "Apache Arrow"
* Click "Users and Roles" on the left
* Click "Add users to role"
* Type user name or username
* Make sure to select "Contributor"
* Click Add

I just took care of this one.

- Wes

On Thu, Sep 5, 2019 at 9:44 PM paddy horan  wrote:
>
> Hi All,
>
> I have the same issue again where there is a new user (hengruo) that needs 
> permissions changed so I can assign an issue.  I know that this was discussed 
> recently which leads me to another question.
>
> How do others find previous conversations in the mailing list archives?  I 
> find it pretty tedious to navigate the archive when looking for specific 
> threads.  Do others keep the mail in their e-mail clients for future 
> searching or is there some search functionality or tool I am missing?
>
> Thanks,
> Paddy


New Users on JIRA

2019-09-05 Thread paddy horan
Hi All,

I have the same issue again where there is a new user (hengruo) that needs 
permissions changed so I can assign an issue.  I know that this was discussed 
recently which leads me to another question.

How do others find previous conversations in the mailing list archives?  I find 
it pretty tedious to navigate the archive when looking for specific threads.  
Do others keep the mail in their e-mail clients for future searching or is 
there some search functionality or tool I am missing?

Thanks,
Paddy


Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Ji Liu
Congratulations!

Thanks,
Ji Liu


--
From:Fan Liya 
Send Time:2019年9月6日(星期五) 09:28
To:dev 
Subject:Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal 
Richardson

Big congratulations to Ben, Kenta and Neal!

Best,
Liya Fan

On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney  wrote:

> hi all,
>
> on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
> and Neal have accepted invitations to become Arrow committers. Welcome
> and thank you for all your contributions!
>


Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Fan Liya
Big congratulations to Ben, Kenta and Neal!

Best,
Liya Fan

On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney  wrote:

> hi all,
>
> on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
> and Neal have accepted invitations to become Arrow committers. Welcome
> and thank you for all your contributions!
>


[jira] [Created] (ARROW-6472) [Java] ValueVector#accept may has potential cast exception

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6472:
-

 Summary: [Java] ValueVector#accept may has potential cast exception
 Key: ARROW-6472
 URL: https://issues.apache.org/jira/browse/ARROW-6472
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Per discussion 
[https://github.com/apache/arrow/pull/5195#issuecomment-528425302]

We may use API this way:
{code:java}
RangeEqualsVisitor visitor = new RangeEqualsVisitor(vector1, vector2);
vector3.accept(visitor, range){code}
if vector1/vector2 are say, {{StructVector}}s and vector3 is an {{IntVector}} - 
things can go bad. we'll use the {{compareBaseFixedWidthVectors()}} and do 
wrong type-casts for vector1/vector2.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: Timeline for 0.15.0 release

2019-09-05 Thread Wes McKinney
hi all,

It doesn't seem like we're going to be in a position to release at the
beginning of next week. I hope that one more week of work (or less)
will be enough to get us there. Aside from merging the alignment
changes, we need to make sure that our packaging jobs required for the
release candidate are all working.

If folks could remove issues from the 0.15.0 backlog that they don't
think they will finish by end of next week that would help focus
efforts (there are currently 78 issues in 0.15.0 still). I am looking
to tackle a few small features related to dictionaries while the
release window is still open.

- Wes

On Tue, Aug 27, 2019 at 3:48 PM Wes McKinney  wrote:
>
> hi,
>
> I think we should try to release the week of September 9, so
> development work should be completed by end of next week.
>
> Does that seem reasonable?
>
> I plan to get up a patch for the protocol alignment changes for C++ in
> the next couple of days -- I think that getting the alignment work
> done is the main barrier to releasing.
>
> Thanks
> Wes
>
> On Mon, Aug 19, 2019 at 12:25 PM Ji Liu  wrote:
> >
> > Hi, Wes, on the java side, I can think of several bugs that need to be 
> > fixed or reminded.
> >
> > i. ARROW-6040: Dictionary entries are required in IPC streams even when 
> > empty[1]
> > This one is under review now, however through this PR we find that there 
> > seems a bug in java reading and writing dictionaries in IPC which is 
> > Inconsistent with spec[2] since it assumes all dictionaries are at the 
> > start of stream (see details in PR comments,  and this fix may not catch up 
> > with version 0.15). @Micah Kornfield
> >
> > ii. ARROW-1875: Write 64-bit ints as strings in integration test JSON 
> > files[3]
> > Java side code already checked in, other implementations seems not.
> >
> > iii. ARROW-6202: OutOfMemory in JdbcAdapter[4]
> > Caused by trying to load all records in one contiguous batch, fixed by 
> > providing iterator API for iteratively reading in ARROW-6219[5].
> >
> > Thanks,
> > Ji Liu
> >
> > [1] https://github.com/apache/arrow/pull/4960
> > [2] https://arrow.apache.org/docs/ipc.html
> > [3] https://issues.apache.org/jira/browse/ARROW-1875
> > [4] https://issues.apache.org/jira/browse/ARROW-6202[5] 
> > https://issues.apache.org/jira/browse/ARROW-6219
> >
> >
> >
> > --
> > From:Wes McKinney 
> > Send Time:2019年8月19日(星期一) 23:03
> > To:dev 
> > Subject:Re: Timeline for 0.15.0 release
> >
> > I'm going to work some on organizing the 0.15.0 backlog some this
> > week, if anyone wants to help with grooming (particularly for
> > languages other than C++/Python where I'm focusing) that would be
> > helpful. There have been almost 500 JIRA issues opened since the
> > 0.14.0 release, so we should make sure to check whether there's any
> > regressions or other serious bugs that we should try to fix for
> > 0.15.0.
> >
> > On Thu, Aug 15, 2019 at 6:23 PM Wes McKinney  wrote:
> > >
> > > The Windows wheel issue in 0.14.1 seems to be
> > >
> > > https://issues.apache.org/jira/browse/ARROW-6015
> > >
> > > I think the root cause could be the Windows changes in
> > >
> > > https://github.com/apache/arrow/commit/223ae744cc2a12c60cecb5db593263a03c13f85a
> > >
> > > I would be appreciative if a volunteer would look into what was wrong
> > > with the 0.14.1 wheels on Windows. Otherwise 0.15.0 Windows wheels
> > > will be broken, too
> > >
> > > The bad wheels can be found at
> > >
> > > https://bintray.com/apache/arrow/python#files/python%2F0.14.1
> > >
> > > On Thu, Aug 15, 2019 at 1:28 PM Antoine Pitrou  
> > > wrote:
> > > >
> > > > On Thu, 15 Aug 2019 11:17:07 -0700
> > > > Micah Kornfield  wrote:
> > > > > >
> > > > > > In C++ they are
> > > > > > independent, we could have 32-bit array lengths and variable-length
> > > > > > types with 64-bit offsets if we wanted (we just wouldn't be able to
> > > > > > have a List child with more than INT32_MAX elements).
> > > > >
> > > > > I think the point is we could do this in C++ but we don't.  I'm not 
> > > > > sure we
> > > > > would have introduced the "Large" types if we did.
> > > >
> > > > 64-bit offsets take twice as much space as 32-bit offsets, so if you're
> > > > storing lots of small-ish lists or strings, 32-bit offsets are
> > > > preferrable.  So even with 64-bit array lengths from the start it would
> > > > still be beneficial to have types with 32-bit offsets.
> > > >
> > > > > Going with the limited address space in Java and calling it a 
> > > > > reference
> > > > > implementation seems suboptimal. If a consumer uses a "Large" type
> > > > > presumably it is because they need the ability to store more than 
> > > > > INT32_MAX
> > > > > child elements in a column, otherwise it is just wasting space [1].
> > > >
> > > > Probably. Though if the individual elements (lists or strings) are
> > > > large, not much space is wasted in proportion, so it may be simpler in
> > > > such a 

Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Wes McKinney
hi Krisztian,

Anyone who's developing in the project can see that the Buildbot setup
is working well (at least for Linux builds) and giving much more
timely feedback, which has been very helpful.

I'm concerned about the "ursabot" approach for a few reasons:

* If we are to centralize our tooling for Arrow CI builds, why can we
not have the build tool itself under Arrow governance?
* The current "ursabot" tool has GPL dependencies. Can these be
factored out into plugins so that the tool itself is ASF-compatible?
* This is a bit nitpicky but the name "ursabot" bears the name mark of
an organization that funds developers in this project. I'm concerned
about this, as I would about a tool named "clouderabot", "dremiobot",
"databricksbot", "googlebot", "ibmbot" or anything like that. It's
different from using a tool developed by an unaffiliated third party

In any case, I think putting the build configurations for the current
Ursa Labs-managed build cluster in the Apache Arrow repository is a
good idea, but there are likely a number of issues that we need to
address to be able to contemplate having a hard dependency between the
CI that we depend on to merge patches and this tool.

- Wes

On Thu, Sep 5, 2019 at 8:17 AM Antoine Pitrou  wrote:
>
>
> Le 05/09/2019 à 15:04, Krisztián Szűcs a écrit :
> >>
> >> If going with buildbot, this means that the various build steps need to
> >> be generic like in Travis-CI (e.g. "install", "setup", "before-test",
> >> "test", "after-test"...) and their contents expressed outside of the
> >> buildmaster configuration per se.
> >>
> > This is partially resolved with the Builder abstraction, see an example
> > here [1]. We just need to add and reload these Builder configurations
> > dynamically on certain events, like when someone changes a builder
> > from a PR.
>
> This is inside the buildmaster process, right?  I don't understand how
> you plan to change those dynamically without affecting all concurrent
> builds.
>
> Regards
>
> Antoine.


[jira] [Created] (ARROW-6471) [Python] arrow_to_pandas.cc has separate code paths for populating list values into an object array

2019-09-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6471:
---

 Summary: [Python] arrow_to_pandas.cc has separate code paths for 
populating list values into an object array
 Key: ARROW-6471
 URL: https://issues.apache.org/jira/browse/ARROW-6471
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney


See patch for ARROW-6369 https://github.com/apache/arrow/pull/5301. There are 
two different code paths for writing list values into a {{PyObject**}} 
output buffer. This seems like it could be simplified



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6470) Segmentation fault when trying to serialzie empty SerializeRecordBatch

2019-09-05 Thread Wamsi Viswanath (Jira)
Wamsi Viswanath created ARROW-6470:
--

 Summary: Segmentation fault when trying to serialzie empty 
SerializeRecordBatch 
 Key: ARROW-6470
 URL: https://issues.apache.org/jira/browse/ARROW-6470
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Wamsi Viswanath


Below is a simple reproducible example, please let me know if the behavior is 
valid:

 
{color:#ffa759}int{color} {color:#ffd580}main{color}{color:#cbccc6}(){color} 
{color:#cbccc6}{{color}
{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#cbccc6}shared_ptr{color}{color:#f29e74}<{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#cbccc6}Schema{color}{color:#f29e74}>{color}{color:#cbccc6}
 schema {color}{color:#f29e74}={color}
{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#ffd580}schema{color}{color:#cbccc6}({{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#ffd580}field{color}{color:#cbccc6}({color}{color:#bae67e}"int_"{color}{color:#cbccc6},{color}
 
{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#ffd580}int32{color}{color:#cbccc6}(){color}{color:#cbccc6},{color}
 {color:#ffcc66}false{color}{color:#cbccc6})}){color}{color:#cbccc6};{color}
{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#cbccc6}vector{color}{color:#f29e74}<{color}{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#cbccc6}shared_ptr{color}{color:#f29e74}<{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#cbccc6}Array{color}{color:#f29e74}>>{color}{color:#cbccc6}
 arrays {color}{color:#f29e74}={color} 
{color:#cbccc6}{}{color}{color:#cbccc6};{color}

{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#cbccc6}shared_ptr{color}{color:#f29e74}<{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#cbccc6}RecordBatch{color}{color:#f29e74}>{color}{color:#cbccc6}
 record_batch {color}{color:#f29e74}={color}
{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#73d0ff}RecordBatch{color}{color:#cbccc6}::{color}{color:#ffd580}Make{color}{color:#cbccc6}({color}{color:#cbccc6}schema{color}{color:#cbccc6},{color}
 
{color:#cbccc6}arrays{color}{color:#cbccc6}[{color}{color:#ffcc66}0{color}{color:#cbccc6}]{color}{color:#cbccc6}->{color}{color:#ffd580}length{color}{color:#cbccc6}(){color}{color:#cbccc6},{color}{color:#cbccc6}
 arrays{color}{color:#cbccc6}){color}{color:#cbccc6};{color}
{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#cbccc6}shared_ptr{color}{color:#f29e74}<{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#cbccc6}Buffer{color}{color:#f29e74}>{color}{color:#cbccc6}
 serialized_buffer{color}{color:#cbccc6};{color}
{color:#ffa759}if{color} 
{color:#cbccc6}({color}{color:#f29e74}!{color}{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#73d0ff}ipc{color}{color:#cbccc6}::{color}{color:#ffd580}SerializeRecordBatch{color}{color:#cbccc6}({color}
{color:#f29e74}*{color}{color:#cbccc6}record_batch{color}{color:#cbccc6},{color}
 
{color:#73d0ff}arrow{color}{color:#cbccc6}::{color}{color:#ffd580}default_memory_pool{color}{color:#cbccc6}(){color}{color:#cbccc6},{color}
 
{color:#f29e74}&{color}{color:#cbccc6}serialized_buffer{color}{color:#cbccc6}){color}
{color:#cbccc6} .{color}{color:#ffd580}ok{color}{color:#cbccc6}()){color} 
{color:#cbccc6}{{color}
{color:#ffa759}throw{color} 
{color:#73d0ff}std{color}{color:#cbccc6}::{color}{color:#ffd580}runtime_error{color}{color:#cbccc6}({color}{color:#bae67e}"Error:
 Serializing Records."{color}{color:#cbccc6}){color}{color:#cbccc6};{color}
{color:#cbccc6}}{color}
{color:#cbccc6}}{color}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6469) PyArrow HDFS documentation does not mention HDFS short circuit readings

2019-09-05 Thread Paulo Roberto Cerioni (Jira)
Paulo Roberto Cerioni created ARROW-6469:


 Summary: PyArrow HDFS documentation does not mention HDFS short 
circuit readings
 Key: ARROW-6469
 URL: https://issues.apache.org/jira/browse/ARROW-6469
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Paulo Roberto Cerioni


Due to PyArrow using libhdfs underneath, it is expected that files reading from 
HDFS are going to make use of short circuit readings.

However, the PyArrow documentation does not explain whether this feature is 
supported (and on what situations) and if that works without any configuration.

For instance, I'm interested in the use case in which we make use of short 
circuit feature to read some of the columns from a Parquet file located in HDFS 
into a dataframe.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-05 Thread Wes McKinney
hi all,

on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
and Neal have accepted invitations to become Arrow committers. Welcome
and thank you for all your contributions!


[jira] [Created] (ARROW-6468) [C++] Remove unused hashing routines

2019-09-05 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6468:
-

 Summary: [C++] Remove unused hashing routines
 Key: ARROW-6468
 URL: https://issues.apache.org/jira/browse/ARROW-6468
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou


The adoption of xxh3 for hashing (in ARROW-6385) probably left around some 
specialized but unused hashing functions (e.g. CRC-based hashing, perhaps also 
murmurhash). We should probably remove them if no problem surfaces with xxh3.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6467) [Website] Transition to new .asf.yaml machinery for website publishing

2019-09-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6467:
---

 Summary: [Website] Transition to new .asf.yaml machinery for 
website publishing
 Key: ARROW-6467
 URL: https://issues.apache.org/jira/browse/ARROW-6467
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Wes McKinney


The ASF is providing a new configuration option for website publishing

https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories

This is timely since I've found deploys to be slow of late via the current 
mechanism

https://issues.apache.org/jira/browse/INFRA-18987



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6466) [Developer] Refactor integration/integration_test.py into a proper Python package

2019-09-05 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6466:
---

 Summary: [Developer] Refactor integration/integration_test.py into 
a proper Python package
 Key: ARROW-6466
 URL: https://issues.apache.org/jira/browse/ARROW-6466
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Wes McKinney


This could also facilitate writing unit tests for the integration tests.

Maybe this could be a part of archery?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[DISCUSS] IPC buffer layout for Null type

2019-09-05 Thread Wes McKinney
hi folks,

One of the as-yet-untested (in integration tests) parts of the
columnar specification is the Null layout. In C++ we additionally
implemented this by writing two length-0 "placeholder" buffers in the
RecordBatch data header, but since the Null layout has no memory
allocated nor any buffers in-memory it may be more proper to write no
buffers (since the length of the Null layout is all you need to
reconstruct it). There are 3 implementations of the placeholder
version (C++, Go, JS, maybe also C#) but it never got implemented in
Java. While technically this would break old serialized data, I would
not expect this to be very frequently occurring in many of the
currently-deployed Arrow applications

Here is my C++ patch

https://github.com/apache/arrow/pull/5287

I'm not sure we need to formalize this with a vote but I'm interested
in the community's feedback on how to proceed here.

- Wes


[jira] [Created] (ARROW-6465) [Python] Improve Windows build instructions

2019-09-05 Thread ARF (Jira)
ARF created ARROW-6465:
--

 Summary: [Python] Improve Windows build instructions
 Key: ARROW-6465
 URL: https://issues.apache.org/jira/browse/ARROW-6465
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: ARF
Assignee: ARF


The current instructions for building the pyarrow python extension are 
incomplete.

Problems include:
 * missing re2, llvm, clang prerequisites
 * missing info on which MSVC toolsets are supported
 * missing info on how the build commands to different MSVC toolsets
 * missing warning about currently broken Windows build config

The linked PR amends the python developer documentation with above.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Antoine Pitrou


Le 05/09/2019 à 15:04, Krisztián Szűcs a écrit :
>>
>> If going with buildbot, this means that the various build steps need to
>> be generic like in Travis-CI (e.g. "install", "setup", "before-test",
>> "test", "after-test"...) and their contents expressed outside of the
>> buildmaster configuration per se.
>>
> This is partially resolved with the Builder abstraction, see an example
> here [1]. We just need to add and reload these Builder configurations
> dynamically on certain events, like when someone changes a builder
> from a PR.

This is inside the buildmaster process, right?  I don't understand how
you plan to change those dynamically without affecting all concurrent
builds.

Regards

Antoine.


[jira] [Created] (ARROW-6464) [Java] Refactor FixedSizeListVector#splitAndTransfer with slice API

2019-09-05 Thread Ji Liu (Jira)
Ji Liu created ARROW-6464:
-

 Summary: [Java] Refactor FixedSizeListVector#splitAndTransfer with 
slice API
 Key: ARROW-6464
 URL: https://issues.apache.org/jira/browse/ARROW-6464
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Ji Liu
Assignee: Ji Liu


Currently {{FixedSizeListVector#splitAndTransfer}} actually use 
{{copyValueSafe}} which has memory copy, we should use slice API instead.

Meanwhile, {{splitAndTransfer}} in all classes should position index check at 
beginning.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Krisztián Szűcs
Hey Antoine,

On Thu, Sep 5, 2019 at 2:54 PM Antoine Pitrou  wrote:

>
> Le 05/09/2019 à 14:43, Uwe L. Korn a écrit :
> > Hello Krisztián,
> >
> >> Am 05.09.2019 um 14:22 schrieb Krisztián Szűcs <
> szucs.kriszt...@gmail.com>:
> >>
> >>> * The build configuration is automatically updated on a merge to
> master?
> >>>
> >> Not yet, but this can be automatized too with buildbot itself.
> >
> > This is something I would  actually like to have before getting rid of
> the Travis jobs. Otherwise we would be constrainted quite a bit in
> development when master CI breaks because of an environment issue until one
> of the few people who can update the config become available.
>
> I would go further and say that PRs and branches need to be able to run
> different build configurations.  We are moving too fast to afford an
> inflexible centralized configuration.
>
Agree. I haven't had time to work on it yet, although I have a couple of
solutions in mind. Once we decide to move on with this proposal we
can allocate time to resolve it.

>
> If going with buildbot, this means that the various build steps need to
> be generic like in Travis-CI (e.g. "install", "setup", "before-test",
> "test", "after-test"...) and their contents expressed outside of the
> buildmaster configuration per se.
>
This is partially resolved with the Builder abstraction, see an example
here [1]. We just need to add and reload these Builder configurations
dynamically on certain events, like when someone changes a builder
from a PR.

[1]:
https://github.com/apache/arrow/blob/305e7387d429f095019c74f17e0c9c7cb443bb70/ci/buildbot/arrow/builders.py#L366


>
> Regards
>
> Antoine.
>


Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Antoine Pitrou


Le 05/09/2019 à 14:43, Uwe L. Korn a écrit :
> Hello Krisztián,
> 
>> Am 05.09.2019 um 14:22 schrieb Krisztián Szűcs :
>>
>>> * The build configuration is automatically updated on a merge to master?
>>>
>> Not yet, but this can be automatized too with buildbot itself.
> 
> This is something I would  actually like to have before getting rid of the 
> Travis jobs. Otherwise we would be constrainted quite a bit in development 
> when master CI breaks because of an environment issue until one of the few 
> people who can update the config become available.

I would go further and say that PRs and branches need to be able to run
different build configurations.  We are moving too fast to afford an
inflexible centralized configuration.

If going with buildbot, this means that the various build steps need to
be generic like in Travis-CI (e.g. "install", "setup", "before-test",
"test", "after-test"...) and their contents expressed outside of the
buildmaster configuration per se.

Regards

Antoine.


Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Uwe L. Korn
Hello Krisztián,

> Am 05.09.2019 um 14:22 schrieb Krisztián Szűcs :
> 
>> * The build configuration is automatically updated on a merge to master?
>> 
> Not yet, but this can be automatized too with buildbot itself.

This is something I would  actually like to have before getting rid of the 
Travis jobs. Otherwise we would be constrainted quite a bit in development when 
master CI breaks because of an environment issue until one of the few people 
who can update the config become available.

Uwe 


> 
>> 
>> And then a not so simple one: What will happen to our current
>> docker-compose setup? From the PR it seems like we do similar things with
>> ursabot but not using the central docker-compose.yml?
>> 
> Currently we're using docker-compose to run one-off containers rather
> than long running, multi-container services (which docker-compose is
> designed for). Ursabot already supports the features we need from
> docker-compose, so it can effectively replace the docker-compose
> setup as well. We have low-level control over the docker API, so we
> are able to tailor it to our requirements.
> 
>> 
>> 
>> Cheers
>> Uwe
>> 
>>> Am 29.08.2019 um 14:19 schrieb Krisztián Szűcs <
>> szucs.kriszt...@gmail.com>:
>>> 
>>> Hi,
>>> 
>>> Arrow's current continuous integration setup utilizes multiple CI
>>> providers,
>>> tools, and scripts:
>>> 
>>> - Unit tests are running on Travis and Appveyor
>>> - Binary packaging builds are running on crossbow, an abstraction over
>>> multiple
>>>  CI providers driven through a GitHub repository
>>> - For local tests and tasks, there is a docker-compose setup, or of
>> course
>>> you
>>>  can maintain your own environment
>>> 
>>> This setup has run into some limitations:
>>> - It’s slow: the CI parallelism of Travis has degraded over the last
>>> couple of
>>>  months. Testing a PR takes more than an hour, which is a long time for
>>> both
>>>  the maintainers and the contributors, and it has a negative effect on
>>> the
>>>  development throughput.
>>> - Build configurations are not portable, they are tied to specific
>>> services.
>>>  You can’t just take a Travis script and run it somewhere else.
>>> - Because they’re not portable, build configurations are duplicated in
>>> several
>>>  places.
>>> - The Travis, Appveyor and crossbow builds are not reproducible locally,
>>> so
>>>  developing them requires the slow git push cycles.
>>> - Public CI has limited platform support, just for example ARM machines
>>> are
>>>  not available.
>>> - Public CI also has limited hardware support, no GPUs are available
>>> 
>>> Resolving all of the issues above is complicated, but is a must for the
>>> long
>>> term sustainability of Arrow.
>>> 
>>> For some time, we’ve been working on a tool called Ursabot[1], a library
>> on
>>> top
>>> of the CI framework Buildbot[2]. Buildbot is well maintained and widely
>>> used
>>> for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
>>> Buildbot
>>> is not another hosted CI service like Travis or Appveyor: it is an
>>> extensible
>>> framework to implement various automations like continuous integration
>>> tasks.
>>> 
>>> You’ve probably noticed additional “Ursabot” builds appearing on pull
>>> requests,
>>> in addition to the Travis and Appveyor builds. We’ve been testing the
>>> framework
>>> with a fully featured CI server at ci.ursalabs.org. This service runs
>> build
>>> configurations we can’t run on Travis, does it faster than Travis, and
>> has
>>> the
>>> GitHub comment bot integration for ad hoc build triggering.
>>> 
>>> While we’re not prepared to propose moving all CI to a self-hosted setup,
>>> our
>>> work has demonstrated the potential of using buildbot to resolve Arrow’s
>>> continuous integration challenges:
>>> - The docker-based builders are reusing the docker images, which
>> eliminate
>>>  slow dependency installation steps. Some builds on this setup, run on
>>>  Ursa Labs’s infrastructure, run 20 minutes faster than the comparable
>>>  Travis-CI jobs.
>>> - It’s scalable. We can deploy buildbot wherever and add more masters and
>>>  workers, which we can’t do with public CI.
>>> - It’s platform and CI-provider independent. Builds can be run on
>>> arbitrary
>>>  architectures, operating systems, and hardware: Python is the only
>>>  requirement. Additionally builds specified in buildbot/ursabot can be
>>> run
>>>  anywhere: not only on custom buildbot infrastructure but also on
>> Travis,
>>> or
>>>  even on your own machine.
>>> - It improves reproducibility and encourages consolidation of
>>> configuration.
>>>  You can run the exact job locally that ran on Travis, and you can even
>>> get
>>>  an interactive shell in the build so you can debug a test failure. And
>>>  because you can run the same job anywhere, we wouldn’t need to have
>>>  duplicated, Travis-specific or the docker-compose build configuration
>>> stored
>>>  separately.
>>> - It’s extensible. More exotic features like a comment bot, 

Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Krisztián Szűcs
Hey Uwe,

On Thu, Sep 5, 2019 at 1:49 PM Uwe L. Korn  wrote:

> Hello Krisztián,
>
> I like this proposal. CI coverage and response time is a crucial thing for
> the health of the project. In general I like the consolidation and local
> reproducibility of tge builds. Some questions I wanted to ask to make sure
> I understand your proposal correctly (hopefully they all can be answered
> with a simple yes):
>
> * Windows builds will stay in Appveyor for now?
>
Yes. Afterwards I'd go with the following steps:
1. Port the AppVeyor configurations to buildbot and run them on
AppVeyor with `ursabot project build windows-builder-name`
2. Once we have windows workers, and they are reliable, we can
decommission the AppVeyor builds.

> * MacOS builds will stay in Travis?
>
Yes, same as above.

> * All other builds will be removed from Travis?

Not all of the Travis builds are ported to buildbot yet, namely: c_glib,
ruby, and the format integration tests.
I suggest an incremental procedure, if the travis build is ported to
buildbot, we can choose to still run it on travis or we can choose
disable it. In this case Travis would only be a hosting provider.

> * Machines are currently run and funded by UrsaLabs but others could also
> sponsor an instance that could be added to the setup?
>
Exactly, either in the cloud or a bare machines, buildbot enables
us to scale our cluster pretty easily.

> * The build configuration is automatically updated on a merge to master?
>
Not yet, but this can be automatized too with buildbot itself.

>
> And then a not so simple one: What will happen to our current
> docker-compose setup? From the PR it seems like we do similar things with
> ursabot but not using the central docker-compose.yml?
>
Currently we're using docker-compose to run one-off containers rather
than long running, multi-container services (which docker-compose is
designed for). Ursabot already supports the features we need from
docker-compose, so it can effectively replace the docker-compose
setup as well. We have low-level control over the docker API, so we
are able to tailor it to our requirements.

>
>
> Cheers
> Uwe
>
> > Am 29.08.2019 um 14:19 schrieb Krisztián Szűcs <
> szucs.kriszt...@gmail.com>:
> >
> > Hi,
> >
> > Arrow's current continuous integration setup utilizes multiple CI
> > providers,
> > tools, and scripts:
> >
> > - Unit tests are running on Travis and Appveyor
> > - Binary packaging builds are running on crossbow, an abstraction over
> > multiple
> >   CI providers driven through a GitHub repository
> > - For local tests and tasks, there is a docker-compose setup, or of
> course
> > you
> >   can maintain your own environment
> >
> > This setup has run into some limitations:
> > - It’s slow: the CI parallelism of Travis has degraded over the last
> > couple of
> >   months. Testing a PR takes more than an hour, which is a long time for
> > both
> >   the maintainers and the contributors, and it has a negative effect on
> > the
> >   development throughput.
> > - Build configurations are not portable, they are tied to specific
> > services.
> >   You can’t just take a Travis script and run it somewhere else.
> > - Because they’re not portable, build configurations are duplicated in
> > several
> >   places.
> > - The Travis, Appveyor and crossbow builds are not reproducible locally,
> > so
> >   developing them requires the slow git push cycles.
> > - Public CI has limited platform support, just for example ARM machines
> > are
> >   not available.
> > - Public CI also has limited hardware support, no GPUs are available
> >
> > Resolving all of the issues above is complicated, but is a must for the
> > long
> > term sustainability of Arrow.
> >
> > For some time, we’ve been working on a tool called Ursabot[1], a library
> on
> > top
> > of the CI framework Buildbot[2]. Buildbot is well maintained and widely
> > used
> > for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
> > Buildbot
> > is not another hosted CI service like Travis or Appveyor: it is an
> > extensible
> > framework to implement various automations like continuous integration
> > tasks.
> >
> > You’ve probably noticed additional “Ursabot” builds appearing on pull
> > requests,
> > in addition to the Travis and Appveyor builds. We’ve been testing the
> > framework
> > with a fully featured CI server at ci.ursalabs.org. This service runs
> build
> > configurations we can’t run on Travis, does it faster than Travis, and
> has
> > the
> > GitHub comment bot integration for ad hoc build triggering.
> >
> > While we’re not prepared to propose moving all CI to a self-hosted setup,
> > our
> > work has demonstrated the potential of using buildbot to resolve Arrow’s
> > continuous integration challenges:
> > - The docker-based builders are reusing the docker images, which
> eliminate
> >   slow dependency installation steps. Some builds on this setup, run on
> >   Ursa Labs’s infrastructure, run 20 minutes faster than 

Re: [Discuss][Java] Support conversions between delta vector and partial sum vector

2019-09-05 Thread Fan Liya
Hi Micah,

Thanks for your comments.

I am aware that you have invested lots of time and effort in reviewing the
algorithm related code.
We really appreciate it. Thank you so much.

I agree with you that the plan document is a good idea.
In general, the algorithms are driven by applications, so it is difficult
to give a precise plan.
However, I am going to prepare a document about the
requirement/design/implementation of the algorithms.

Hope that will make discussions/code review more efficient.

Best,
Liya Fan

On Thu, Sep 5, 2019 at 11:46 AM Fan Liya  wrote:

> Hi Wes,
>
> Thanks a lot for the comments.
> You are right. This can be applied to the data encoding/compression, and I
> think this is one of the building blocks for encoding/compression.
>
> In the short term, it will provide conversions between the two memory
> layouts of run length vectors.
> In the mid term, it can help to reduce the network traffic for
> varchar/varbinary vectors.
> In the long term, it will provide compression for more scenarios.
>
> The basic idea is based on the observations that, a vector usually has a
> smaller width after converting to a delta vector.
>
> For example, for a varchar vector with a large number of elements, the
> offset buffer will use 4 bytes for each element.
> However it is likely that, the strings in the vectors are not big (less
> than 65536 in length). So by converting the offset buffer to a delta
> vector, we can use a int vector with 2 byte width.
>
> Best,
> Liya Fan
>
>
> On Thu, Sep 5, 2019 at 3:05 AM Wes McKinney  wrote:
>
>> hi,
>>
>> Having utility algorithms to perform data transformations seems fine
>> if there is a use for them and maintaining the code in the Arrow
>> libraries makes sense.
>>
>> I don't understand point #2 "We can transform them to delta vectors
>> before IPC". It sounds like you are proposing a data compression
>> technique. Should this be a part of the
>> sparseness/encoding/compression discussion?
>>
>> - Wes
>>
>> On Sun, Sep 1, 2019 at 10:14 PM Fan Liya  wrote:
>> >
>> > Dear all,
>> >
>> > We want to support a feature for conversions between delta vector and
>> > partial sum vector. Please give your valuable feedback.
>> >
>> > Best,
>> >
>> > Liya Fan
>> >
>> > What is a delta vector/partial sum vector?
>> >
>> > Given an integer vector a with length n, its partial sum vector is
>> another
>> > integer vector b with length n + 1, with values defined as:
>> >
>> > b(0) = initial sum
>> > b(i ) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n
>> >
>> > Given an integer vector with length n + 1, its delta vector is another
>> > integer vector b with length n, with values defined as:
>> >
>> > b(i ) = a(i ) - a(i - 1), i = 0, 1, ... , n -1
>> >
>> > In this issue, we provide utilities to convert between vector and
>> partial
>> > sum vector. It is interesting to note that the two operations
>> corresponding
>> > to the discrete integration and differentian.
>> >
>> > These conversions have wide applications. For example,
>> >
>> >1.
>> >
>> >The run-length vector proposed by Micah is based on the partial sum
>> >vector, while the deduplication functionality is based on delta
>> vector.
>> >This issue provides conversions between them.
>> >2.
>> >
>> >The current VarCharVector/VarBinaryVector implementations are based
>> on
>> >partial sum vector. We can transform them to delta vectors before
>> IPC, to
>> >reduce network traffic.
>> >3.
>> >
>> >Converting to delta can be considered as a way for data compression.
>> To
>> >further reduce the data volume, the operation can be applied more
>> than
>> >once, to further reduce data volume.
>> >
>> > Points to discuss:
>> > The API should be provided at the level of vector or ArrowBuf, or both?
>> > 1. If it is based on vector, there can be performance overhead due to
>> > virtual method calls.
>> > 2. If it is base on ArrowBuf, some underlying details (type width) are
>> > exposed to the end user, which is not compliant with the principle of
>> > encapsulation.
>>
>


Re: [PROPOSAL] Consolidate Arrow's CI configuration

2019-09-05 Thread Uwe L. Korn
Hello Krisztián, 

I like this proposal. CI coverage and response time is a crucial thing for the 
health of the project. In general I like the consolidation and local 
reproducibility of tge builds. Some questions I wanted to ask to make sure I 
understand your proposal correctly (hopefully they all can be answered with a 
simple yes):

* Windows builds will stay in Appveyor for now?
* MacOS builds will stay in Travis?
* All other builds will be removed from Travis?
* Machines are currently run and funded by UrsaLabs but others could also 
sponsor an instance that could be added to the setup?
* The build configuration is automatically updated on a merge to master?

And then a not so simple one: What will happen to our current docker-compose 
setup? From the PR it seems like we do similar things with ursabot but not 
using the central docker-compose.yml?


Cheers
Uwe

> Am 29.08.2019 um 14:19 schrieb Krisztián Szűcs :
> 
> Hi,
> 
> Arrow's current continuous integration setup utilizes multiple CI
> providers,
> tools, and scripts:
> 
> - Unit tests are running on Travis and Appveyor
> - Binary packaging builds are running on crossbow, an abstraction over
> multiple
>   CI providers driven through a GitHub repository
> - For local tests and tasks, there is a docker-compose setup, or of course
> you
>   can maintain your own environment
> 
> This setup has run into some limitations:
> - It’s slow: the CI parallelism of Travis has degraded over the last
> couple of
>   months. Testing a PR takes more than an hour, which is a long time for
> both
>   the maintainers and the contributors, and it has a negative effect on
> the
>   development throughput.
> - Build configurations are not portable, they are tied to specific
> services.
>   You can’t just take a Travis script and run it somewhere else.
> - Because they’re not portable, build configurations are duplicated in
> several
>   places.
> - The Travis, Appveyor and crossbow builds are not reproducible locally,
> so
>   developing them requires the slow git push cycles.
> - Public CI has limited platform support, just for example ARM machines
> are
>   not available.
> - Public CI also has limited hardware support, no GPUs are available
> 
> Resolving all of the issues above is complicated, but is a must for the
> long
> term sustainability of Arrow.
> 
> For some time, we’ve been working on a tool called Ursabot[1], a library on
> top
> of the CI framework Buildbot[2]. Buildbot is well maintained and widely
> used
> for complex projects, including CPython, Webkit, LLVM, MariaDB, etc.
> Buildbot
> is not another hosted CI service like Travis or Appveyor: it is an
> extensible
> framework to implement various automations like continuous integration
> tasks.
> 
> You’ve probably noticed additional “Ursabot” builds appearing on pull
> requests,
> in addition to the Travis and Appveyor builds. We’ve been testing the
> framework
> with a fully featured CI server at ci.ursalabs.org. This service runs build
> configurations we can’t run on Travis, does it faster than Travis, and has
> the
> GitHub comment bot integration for ad hoc build triggering.
> 
> While we’re not prepared to propose moving all CI to a self-hosted setup,
> our
> work has demonstrated the potential of using buildbot to resolve Arrow’s
> continuous integration challenges:
> - The docker-based builders are reusing the docker images, which eliminate
>   slow dependency installation steps. Some builds on this setup, run on
>   Ursa Labs’s infrastructure, run 20 minutes faster than the comparable
>   Travis-CI jobs.
> - It’s scalable. We can deploy buildbot wherever and add more masters and
>   workers, which we can’t do with public CI.
> - It’s platform and CI-provider independent. Builds can be run on
> arbitrary
>   architectures, operating systems, and hardware: Python is the only
>   requirement. Additionally builds specified in buildbot/ursabot can be
> run
>   anywhere: not only on custom buildbot infrastructure but also on Travis,
> or
>   even on your own machine.
> - It improves reproducibility and encourages consolidation of
> configuration.
>   You can run the exact job locally that ran on Travis, and you can even
> get
>   an interactive shell in the build so you can debug a test failure. And
>   because you can run the same job anywhere, we wouldn’t need to have
>   duplicated, Travis-specific or the docker-compose build configuration
> stored
>   separately.
> - It’s extensible. More exotic features like a comment bot, benchmark
>   database, benchmark dashboard, artifact store, integrating other systems
> are
>   easily implementable within the same system.
> 
> I’m proposing to donate the build configuration we’ve been iterating on in
> Ursabot to the Arrow codebase. Here [3] is a patch that adds the
> configuration.
> This will enable us to explore consolidating build configuration using the
> buildbot framework. A next step after to explore that would be to port a
> Travis
> 

[jira] [Created] (ARROW-6463) [C++][Python] Rename arrow::fs::Selector to FileSelector

2019-09-05 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-6463:
--

 Summary: [C++][Python] Rename arrow::fs::Selector to FileSelector
 Key: ARROW-6463
 URL: https://issues.apache.org/jira/browse/ARROW-6463
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


In both the C++ implementation and the python binding.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6462) [C++] Can't build with bundled double-conversion on CentOS 6 x86_64

2019-09-05 Thread Sutou Kouhei (Jira)
Sutou Kouhei created ARROW-6462:
---

 Summary: [C++] Can't build with bundled double-conversion on 
CentOS 6 x86_64
 Key: ARROW-6462
 URL: https://issues.apache.org/jira/browse/ARROW-6462
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sutou Kouhei
Assignee: Sutou Kouhei


https://travis-ci.org/ursa-labs/crossbow/builds/581001313#L8163

{noformat}
-- Installing: 
/root/rpmbuild/BUILD/apache-arrow-0.14.0.dev451/cpp/build/double-conversion_ep/src/double-conversion_ep/lib64/libdouble-conversion.a
...
make[2]: *** No rule to make target 
'double-conversion_ep/src/double-conversion_ep/lib/libdouble-conversion.a', 
needed by 'release/libarrow.so.15.0.0'.  Stop.
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)