Re: Human-readable version of Arrow Schema?

2020-01-08 Thread Kohei KaiGai
Hello,

pg2arrow [*1] has '--dump' mode to print out schema definition of the
given Apache Arrow file.
Does it make sense for you?

$ ./pg2arrow --dump ~/hoge.arrow
[Footer]
{Footer: version=V4, schema={Schema: endianness=little,
fields=[{Field: name="id", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64},
children=[], custom_metadata=[]}, {Field: name="b", nullable=true,
type={Decimal: precision=11, scale=7}, children=[],
custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct},
children=[{Field: name="x", nullable=true, type={Int32}, children=[],
custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32},
children=[], custom_metadata=[]}, {Field: name="z", nullable=true,
type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]},
{Field: name="d", nullable=true, type={Utf8},
dictionary={DictionaryEncoding: id=0, indexType={Int32},
isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e",
nullable=true, type={Timestamp: unit=us}, children=[],
custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8},
children=[], custom_metadata=[]}, {Field: name="random",
nullable=true, type={Float64}, children=[], custom_metadata=[]}],
custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random()
FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184
bodyLength=128}], recordBatches=[{Block: offset=1232,
metaDataLength=648 bodyLength=386112}]}
[Dictionary Batch 0]
{Block: offset=920, metaDataLength=184 bodyLength=128}
{Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch:
length=6, nodes=[{FieldNode: length=6, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64},
{Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128}
[Record Batch 0]
{Block: offset=1232, metaDataLength=648 bodyLength=386112}
{Message: version=V4, body={RecordBatch: length=3000,
nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode:
length=3000, null_count=60}, {FieldNode: length=3000, null_count=62},
{FieldNode: length=3000, null_count=0}, {FieldNode: length=3000,
null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode:
length=3000, null_count=0}, {FieldNode: length=3000, null_count=0},
{FieldNode: length=3000, null_count=64}, {FieldNode: length=3000,
null_count=0}, {FieldNode: length=3000, null_count=0}],
buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0,
length=12032}, {Buffer: offset=12032, length=384}, {Buffer:
offset=12416, length=24000}, {Buffer: offset=36416, length=384},
{Buffer: offset=36800, length=48000}, {Buffer: offset=84800,
length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184,
length=12032}, {Buffer: offset=97216, length=384}, {Buffer:
offset=97600, length=12032}, {Buffer: offset=109632, length=0},
{Buffer: offset=109632, length=12032}, {Buffer: offset=121664,
length=96000}, {Buffer: offset=217664, length=0}, {Buffer:
offset=217664, length=12032}, {Buffer: offset=229696, length=384},
{Buffer: offset=230080, length=24000}, {Buffer: offset=254080,
length=0}, {Buffer: offset=254080, length=12032}, {Buffer:
offset=266112, length=96000}, {Buffer: offset=362112, length=0},
{Buffer: offset=362112, length=24000}]}, bodyLength=386112}

[*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow

2019年12月7日(土) 6:26 Christian Hudon :
>
> Hi,
>
> For the uses I would like to make of Arrow, I would need a human-readable
> and -writable version of an Arrow Schema, that could be converted to and
> from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't
> see anything to that effect, with the closest being the ToString() method
> on DataType instances, but which is meant for debugging only. (I need an
> expression of an Arrow Schema that people can read, and that can live
> outside of the code for a particular operation.)
>
> Is a text representation of an Arrow Schema something that is being worked
> on now? If not, would you folks be interested in me putting up an initial
> proposal for discussion? Any design constraints I should pay attention to,
> then?
>
> Thanks,
>
>   Christian
> --
>
>
> │ Christian Hudon
>
> │ Applied Research Scientist
>
>Element AI, 6650 Saint-Urbain #500
>
>Montréal, QC, H2S 3G9, Canada
>Elementai.com



-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 


[jira] [Created] (ARROW-7525) [Python][CI] Build PyArrow on VS2019

2020-01-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7525:
--

 Summary: [Python][CI] Build PyArrow on VS2019 
 Key: ARROW-7525
 URL: https://issues.apache.org/jira/browse/ARROW-7525
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Krisztian Szucs


Enable ARROW_PARQUET cmake flag. Additional patching might be required, see 
https://github.com/microsoft/vcpkg/pull/8263/files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7524) [C++][CI] Build parquet support in the VS2019 GitHub Actions job

2020-01-08 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-7524:
--

 Summary: [C++][CI] Build parquet support in the VS2019 GitHub 
Actions job
 Key: ARROW-7524
 URL: https://issues.apache.org/jira/browse/ARROW-7524
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Continuous Integration
Reporter: Krisztian Szucs


Enable ARROW_PARQUET cmake flag. Additional patching might be required, see 
https://github.com/microsoft/vcpkg/pull/8263/files



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Human-readable version of Arrow Schema?

2020-01-08 Thread Micah Kornfield
The C-interface representation is probably slightly less readable then the
JSON implementation if I understand the flatbuffer to JSON conversion
properly.  But as Antoine pointed out it depends on the use-case.

FWIW, flatbuffers maintainers indicated forward/backward compatibility is
intended to be preserved in the JSON encoding as well.

On Sat, Jan 4, 2020 at 2:16 PM Jacques Nadeau  wrote:

> What do people think about using the C interface representation?
>
> On Sun, Dec 29, 2019 at 12:42 PM Micah Kornfield 
> wrote:
>
>> I opened https://github.com/google/flatbuffers/issues/5688 to try to get
>> some clarity.
>>
>> On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney 
>> wrote:
>>
>> > On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield 
>> > wrote:
>> > >>
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >
>> > > Does it pay to follow-up with the flatbuffer project to understand if
>> > the forward/backward compatibility guarantees the flatbuffers provide
>> > extend to their JSON format?
>> >
>> > I spent a few minutes looking at the Flatbuffers codebase and
>> > documentation and did not find anything, so this seems like useful
>> > information to have regardless.
>> >
>> > >
>> > > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney 
>> > wrote:
>> > >>
>> > >> I'd be open to looking at a proposal for a human-readable text
>> > >> representation, but I'm definitely wary about making any kind of
>> > >> cross-version compatibility guarantees (beyond "we will do our
>> best").
>> > >> If we were to make the same kinds of forward/backward compatibility
>> > >> guarantees as with Flatbuffers it could create a lot of work for
>> > >> maintainers.
>> > >>
>> > >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield <
>> emkornfi...@gmail.com>
>> > wrote:
>> > >> >
>> > >> > >
>> > >> > > With these two together, it would seem not too difficult to
>> create
>> > a text
>> > >> > > representation for Arrow schemas that (at some point) has some
>> > >> > > compatibility guarantees, but maybe I'm missing something?
>> > >> >
>> > >> >
>> > >> > I think the main risk is if somehow flatbuffers JSON parsing
>> doesn't
>> > handle
>> > >> > backward compatible changes to the arrow schema message.  Given the
>> > way the
>> > >> > documentation is describing the JSON functionality I think this
>> would
>> > be
>> > >> > considered a bug.
>> > >> >
>> > >> > The one downside to calling the "schema" canonical is the
>> flatbuffers
>> > JSON
>> > >> > functionality only appears to be available in C++ and Java via JNI,
>> > so it
>> > >> > wouldn't have cross language support.  I think this issue is more
>> one
>> > of
>> > >> > semantics though (i.e. does the JSON description become part of the
>> > "Arrow
>> > >> > spec" or does it live as a C++/Python only feature).
>> > >> >
>> > >> > -Micah
>> > >> >
>> > >> >
>> > >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon <
>> > chr...@elementai.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Micah: I didn't know that Flatbuffers supported serialization
>> > to/from JSON,
>> > >> > > thanks. That seems like a very good start, at least. I'll aim to
>> > create a
>> > >> > > draft pull request that at least wires everything up in Arrow so
>> we
>> > can
>> > >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make
>> > it easier
>> > >> > > for me to see how Arrow schemas would look in JSON with that.
>> > >> > >
>> > >> > > Otherwise, I'm still gathering requirements internally here. For
>> > example,
>> > >> > > one thing that would be nice would be to be able to output a JSON
>> > Schema
>> > >> > > from at least a subset of the Arrow schema. (That way our users
>> > could start
>> > >> > > by passing around JSON with a given schema, and transition pieces
>> > of a
>> > >> > > workflow to Arrow as they're ready.) But that part can also be
>> done
>> > outside
>> > >> > > of the Arrow code, if deemed not relevant to have in the Arrow
>> > codebase
>> > >> > > itself.
>> > >> > >
>> > >> > > One core requirement for us, however, would be eventual
>> > compatibility
>> > >> > > between Arrow versions for a given text representation of a
>> schema.
>> > >> > > Meaning, if you have a text description of a given Arrow schema,
>> > you can
>> > >> > > load it into different versions of Arrow and it creates a valid
>> > Schema
>> > >> > > Flatbuffer description, that Arrow can use. Wes, were you
>> thinking
>> > of that,
>> > >> > > or of something else, when you wrote "only makes sense if it is
>> > offered
>> > >> > > without any backward/forward compatibility guarantees"?
>> > >> > >
>> > >> > > For the now, or me, assuming the JSON serialization done by the
>> > Flatbuffer
>> > >> > > libraries is usable, it seems we have all the pieces to make this
>> > happen:
>> > >> > > 1) The binary Schema.fbs data structures has to be compatible
>> > between
>> > 

Re: [C++] "nonexistent" or "non-existent"

2020-01-08 Thread Micah Kornfield
I also think we have "KeyError" which I think might model the same concept?

On Mon, Jan 6, 2020 at 7:35 AM Wes McKinney  wrote:

> I agree using a different terminology than "nonexistent" like
> "NotFound" would be good. If we use "nonexistent" then the hyphen-free
> spelling seems preferred
>
> On Sun, Dec 29, 2019 at 2:23 PM Micah Kornfield 
> wrote:
> >
> > I'm not sure if all of the examples refer to the same thing, but "Not
> > Found" (from http 404 error) is the most common way of expressing at
> least
> > the first concept I think.
> >
> > On Sat, Dec 28, 2019 at 11:45 AM Neal Richardson <
> > neal.p.richard...@gmail.com> wrote:
> >
> > > IMO while "nonexistent" is the right word, neither are particularly
> > > readable or obvious in code. Is there a better word/phrase?
> > >
> > > On Fri, Dec 27, 2019 at 5:34 PM Sutou Kouhei 
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I found that we use both "nonexistent" and "non-existent" in
> > > > our C++ code base. I think that we should use one of them
> > > > instead of mixing them.
> > > >
> > > > "nonexistent":
> > > >
> > > >   * Public API:
> > > > * cpp/src/plasma/:
> plasma::PlasmaErrorCode::PlasmaObjectNonexistent
> > > >
> > > > "non-existent":
> > > >
> > > >   * Public API:
> > > > * cpp/src/arrow/filesystem/: arrow::fs::FileType::NonExistent
> > > >   * Internal:
> > > > * cpp/src/arrow/util/io_util.h: allow_non_existent
> > > >   * Test:
> > > > * cpp/src/gandiva/tests/: non_existent_function
> > > >
> > > >
> > > > Which should we use?
> > > > (Personally, I prefer "nonexistent" to "non-existent".)
> > > >
> > > >
> > > > Thanks,
> > > > --
> > > > kou
> > > >
> > >
>


[jira] [Created] (ARROW-7522) Broken Record Batch returned from a function call

2020-01-08 Thread Chengxin Ma (Jira)
Chengxin Ma created ARROW-7522:
--

 Summary: Broken Record Batch returned from a function call
 Key: ARROW-7522
 URL: https://issues.apache.org/jira/browse/ARROW-7522
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Plasma
Affects Versions: 0.15.1
 Environment: macOS
Reporter: Chengxin Ma


Scenario: retrieving Record Batch from Plasma with known Object ID.

The following code snippet works well:
{code:java}
int main(int argc, char **argv)
{
plasma::ObjectID object_id = 
plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");

// Start up and connect a Plasma client.
plasma::PlasmaClient client;
ARROW_CHECK_OK(client.Connect("/tmp/store"));

plasma::ObjectBuffer object_buffer;
ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer));

// Retrieve object data.
auto buffer = object_buffer.data;

arrow::io::BufferReader buffer_reader(buffer); 
std::shared_ptr record_batch_stream_reader;
ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, 
_batch_stream_reader));

std::shared_ptr record_batch;
arrow::Status status = record_batch_stream_reader->ReadNext(_batch);

std::cout << "record_batch->column_name(0): " << 
record_batch->column_name(0) << std::endl;
std::cout << "record_batch->num_columns(): " << record_batch->num_columns() 
<< std::endl;
std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << 
std::endl;
std::cout << "record_batch->column(0)->length(): "
  << record_batch->column(0)->length() << std::endl;
std::cout << "record_batch->column(0)->ToString(): "
  << record_batch->column(0)->ToString() << std::endl;
}
{code}
{{record_batch->column(0)->ToString()}} would incur a segmentation fault if 
retrieving Record Batch is wrapped in a function:
{code:java}
std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID 
object_id)
{
// Start up and connect a Plasma client.
plasma::PlasmaClient client;
ARROW_CHECK_OK(client.Connect("/tmp/store"));

plasma::ObjectBuffer object_buffer;
ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer));

// Retrieve object data.
auto buffer = object_buffer.data;

arrow::io::BufferReader buffer_reader(buffer);
std::shared_ptr record_batch_stream_reader;
ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, 
_batch_stream_reader));

std::shared_ptr record_batch;
arrow::Status status = record_batch_stream_reader->ReadNext(_batch);

// Disconnect the client.
ARROW_CHECK_OK(client.Disconnect());

return record_batch;
}

int main(int argc, char **argv)
{
plasma::ObjectID object_id = 
plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");

std::shared_ptr record_batch = 
GetRecordBatchFromPlasma(object_id);

std::cout << "record_batch->column_name(0): " << 
record_batch->column_name(0) << std::endl;
std::cout << "record_batch->num_columns(): " << record_batch->num_columns() 
<< std::endl;
std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << 
std::endl;
std::cout << "record_batch->column(0)->length(): "
  << record_batch->column(0)->length() << std::endl;
std::cout << "record_batch->column(0)->ToString(): "
  << record_batch->column(0)->ToString() << std::endl;
}
{code}
The meta info of the Record Batch such as number of columns and rows is still 
available, but I can't see the content of the columns.

{{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the Record 
Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But why can I 
still see the meta info of this Record Batch?
 What is the proper way to get the Record Batch if we insist using a function?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype

2020-01-08 Thread Neville Dipale (Jira)
Neville Dipale created ARROW-7521:
-

 Summary: [Rust] Remove tuple on FixedSizeList datatype
 Key: ARROW-7521
 URL: https://issues.apache.org/jira/browse/ARROW-7521
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


The FixedSizeList datatype takes a tuple of Box and length, but this 
could be simplified to take the two values without a tuple.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7520) Arrow / R - too many batches causes a crash

2020-01-08 Thread Christian (Jira)
Christian created ARROW-7520:


 Summary: Arrow / R - too many batches causes a crash
 Key: ARROW-7520
 URL: https://issues.apache.org/jira/browse/ARROW-7520
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 0.15.1
 Environment: - Session info 
---

setting  value  

 version  R version 3.6.1 (2019-07-05)

os   Windows 10 x64  

 system   x86_64, mingw32

 ui   RStudio

 language (EN)   

 collate  English_United States.1252 

 ctype    English_United States.1252 

 tz   America/New_York   

 date 2020-01-08 

 

- Packages 
---

! package  * version date   lib source  
   

   acepack    1.4.1   2016-10-29 [1] CRAN (R 3.6.1) 
   

   arrow    * 0.15.1.1    2019-11-05 [1] CRAN (R 3.6.2) 
   

   askpass    1.1 2019-01-13 [1] CRAN (R 3.6.1)     


   assertthat 0.2.1   2019-03-21 [1] CRAN (R 3.6.1) 
   

   backports  1.1.5   2019-10-02 [1] CRAN (R 3.6.1) 
   

   base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)     


   bit    1.1-14  2018-05-29 [1] CRAN (R 3.6.0) 
   

   bit64  0.9-7   2017-05-08 [1] CRAN (R 3.6.0) 
   

   blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1) 


   callr  3.3.1   2019-07-18 [1] CRAN (R 3.6.1) 
   

   cellranger 1.1.0   2016-07-27 [1] CRAN (R 3.6.1) 
   

   checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)     


   cli    1.1.0   2019-03-19 [1] CRAN (R 3.6.1) 
   

   cluster    2.1.0   2019-06-19 [2] CRAN (R 3.6.1)     


   codetools  0.2-16  2018-12-24 [2] CRAN (R 3.6.1) 
   

   colorspace 1.4-1   2019-03-18 [1] CRAN (R 3.6.1) 
   

   commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)     


   crayon 1.3.4   2017-09-16 [1] CRAN (R 3.6.1) 
   

   credentials    1.1 2019-03-12 [1] CRAN (R 3.6.2) 
   

   curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) 


   data.table 1.12.2  2019-04-07 [1] CRAN (R 3.6.1) 
   

   DBI  * 1.0.0   2018-05-02 [1] CRAN (R 3.6.1) 
   

   desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)     


   devtools * 2.2.0   2019-09-07 [1] CRAN (R 3.6.1) 
   

   digest 0.6.23  2019-11-23 [1] CRAN (R 3.6.1) 
   

   dplyr    * 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)     


   DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) 
   

   ellipsis   0.3.0   2019-09-20 [1] CRAN (R 3.6.1) 
   

   evaluate   0.14    2019-05-28 [1] CRAN (R 3.6.1)     


   foreign    0.8-71  2018-07-20 [2] CRAN (R 3.6.1) 
   

   Formula  * 1.2-3   2018-05-03 [1] CRAN (R 3.6.0) 
   

   fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1) 


   fst  * 0.9.0   2019-04-09 [1] CRAN (R 3.6.1) 
   

   future   * 1.15.0-9000 2019-11-19 [1] Github 
(HenrikBengtsson/future@bc241c7)

   ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)     


   globals    0.12.4  2018-10-11 [1] CRAN (R 3.6.0) 
   

   glue * 1.3.1   2019-03-12 [1] CRAN (R 3.6.1) 
   

   gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)     


   gt   * 0.1.0   2019-11-27 [1] Github (rstudio/gt@284bbe5)
   

   gtable 0.3.0   2019-03-25 [1] CRAN (R 3.6.1) 
   

   Hmisc    * 4.3-0   2019-11-07 [1] CRAN (R 3.6.1)     


   htmlTable  1.13.2  2019-09-22 [1] CRAN (R 3.6.1) 
   

 D htmltools  0.3.6.9004  2019-09-20 [1] Github (rstudio/htmltools@c49b29c) 
   

   htmlwidgets    1.3 2018-09-30 [1] CRAN (R 3.6.1) 


   

Re: Arrow / R - too many batches causes a crash

2020-01-08 Thread Wes McKinney
Can you please open a JIRA issue?

On Wed, Jan 8, 2020 at 12:37 PM Christian Klar 
wrote:

> Hi,
>
>
>
> At the bottom please find the session_info.
>
>
>
> When creating north of 200-300 batches, the writing to the arrow file
> crashes R – it doesn’t even show an error message. Rstudio just aborts.
>
>
>
> I have the feeling that maybe each batch becomes a stream and R has issues
> with the connections, but that’s a total guess.
>
>
>
> Any help would be appreciated.
>
>
>
> ##
>
>
>
> Here is the function. When running it with 3000 it crashes immediately.
>
>
>
> Before that I ran it with 100, and then increased it slowly, and then it
> randomly crashed again.
>
>
>
>
> write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000)
>
>
>
> write_arrow_custom <- function(df,targetarrow,nrbatches) {
>
>   ct <- nrbatches
>
>   idxs <- c(0:ct)/ct*nrow(df)
>
>   idxs <- round(idxs,0) %>% as.integer()
>
>   idxs[length(idxs)] <- nrow(df)
>
>   df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>%
> mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!
> is.na(colto)) %>% mutate(R=row_number())
>
>   stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>%
> sum()==nrow(df))
>
>   table_df <- Table$create(name=rownames(df[1,]),df[1,])
>
>   writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema)
>
>   df_nav %>% dlply(c('R'),function(df_nav){
>
> catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...'))
>
> tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],]
>
> writer$write_batch(record_batch(name = rownames(tmp), tmp))
>
> NULL
>
>   }) -> batch_lst
>
>   writer$close()
>
>   rm(batch_lst)
>
>   gc()
>
> }
>
>
>
>
>
> ##
>
>
>
>
>
>
>
> - Session info
> ---
>
> setting  value
>
>  version  R version 3.6.1 (2019-07-05)
>
> os   Windows 10 x64
>
>  system   x86_64, mingw32
>
>  ui   RStudio
>
>  language (EN)
>
>  collate  English_United States.1252
>
>  ctypeEnglish_United States.1252
>
>  tz   America/New_York
>
>  date 2020-01-08
>
>
>
> - Packages
> ---
>
> ! package  * version date   lib source
>
>
>acepack1.4.1   2016-10-29 [1] CRAN (R
> 3.6.1)
>
>arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2)
>
>
>askpass1.1 2019-01-13 [1] CRAN (R 3.6.1)
>
>
>assertthat 0.2.1   2019-03-21 [1] CRAN (R
> 3.6.1)
>
>backports  1.1.5   2019-10-02 [1] CRAN (R
> 3.6.1)
>
>base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)
>
>
>bit1.1-14  2018-05-29 [1] CRAN (R
> 3.6.0)
>
>bit64  0.9-7   2017-05-08 [1] CRAN (R
> 3.6.0)
>
>blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1)
>
>
>callr  3.3.1   2019-07-18 [1] CRAN (R
> 3.6.1)
>
>cellranger 1.1.0   2016-07-27 [1] CRAN (R
> 3.6.1)
>
>checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)
>
>
>cli1.1.0   2019-03-19 [1] CRAN (R
> 3.6.1)
>
>cluster2.1.0   2019-06-19 [2] CRAN (R 3.6.1)
>
>
>codetools  0.2-16  2018-12-24 [2] CRAN (R
> 3.6.1)
>
>colorspace 1.4-1   2019-03-18 [1] CRAN (R
> 3.6.1)
>
>commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)
>
>
>crayon 1.3.4   2017-09-16 [1] CRAN (R
> 3.6.1)
>
>credentials1.1 2019-03-12 [1] CRAN (R
> 3.6.2)
>
>curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1)
>
>
>data.table 1.12.2  2019-04-07 [1] CRAN (R
> 3.6.1)
>
>DBI  * 1.0.0   2018-05-02 [1] CRAN (R
> 3.6.1)
>
>desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
>
>
>devtools * 2.2.0   2019-09-07 [1] CRAN (R
> 3.6.1)
>
>digest 0.6.23  2019-11-23 [1] CRAN (R
> 3.6.1)
>
>dplyr* 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)
>
>
>DT 0.9 2019-09-17 [1] CRAN (R
> 3.6.1)
>
>ellipsis   0.3.0   2019-09-20 [1] CRAN (R
> 3.6.1)
>
>evaluate   0.142019-05-28 [1] CRAN (R 3.6.1)
>
>
>foreign0.8-71  2018-07-20 [2] CRAN (R
> 3.6.1)
>
>Formula  * 1.2-3   2018-05-03 [1] CRAN (R
> 3.6.0)
>
>fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
>
>
>fst  * 0.9.0   2019-04-09 [1] CRAN (R
> 3.6.1)
>
>future   * 1.15.0-9000 2019-11-19 [1] Github
> (HenrikBengtsson/future@bc241c7)
>
>ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
>
>
>globals0.12.4  2018-10-11 [1] CRAN (R
> 3.6.0)
>
>glue * 1.3.1   2019-03-12 [1] CRAN (R
> 3.6.1)
>
>gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)
>

[jira] [Created] (ARROW-7519) [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1

2020-01-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7519:
---

 Summary: [Python] Build wheels, conda packages with 
PYARROW_WITH_DATASET=1
 Key: ARROW-7519
 URL: https://issues.apache.org/jira/browse/ARROW-7519
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.16.0


We should 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages

2020-01-08 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-7518:
---

 Summary: [Python] Use PYARROW_WITH_HDFS when building wheels, 
conda packages
 Key: ARROW-7518
 URL: https://issues.apache.org/jira/browse/ARROW-7518
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.16.0


This new module is not enabled in the package builds



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Arrow / R - too many batches causes a crash

2020-01-08 Thread Christian Klar
Hi,

At the bottom please find the session_info.

When creating north of 200-300 batches, the writing to the arrow file crashes R 
– it doesn’t even show an error message. Rstudio just aborts.

I have the feeling that maybe each batch becomes a stream and R has issues with 
the connections, but that’s a total guess.

Any help would be appreciated.

##

Here is the function. When running it with 3000 it crashes immediately.

Before that I ran it with 100, and then increased it slowly, and then it 
randomly crashed again.

write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000)

write_arrow_custom <- function(df,targetarrow,nrbatches) {
  ct <- nrbatches
  idxs <- c(0:ct)/ct*nrow(df)
  idxs <- round(idxs,0) %>% as.integer()
  idxs[length(idxs)] <- nrow(df)
  df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% 
mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% 
filter(!is.na(colto)) %>% mutate(R=row_number())
  stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% 
sum()==nrow(df))
  table_df <- Table$create(name=rownames(df[1,]),df[1,])
  writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema)
  df_nav %>% dlply(c('R'),function(df_nav){
catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...'))
tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],]
writer$write_batch(record_batch(name = rownames(tmp), tmp))
NULL
  }) -> batch_lst
  writer$close()
  rm(batch_lst)
  gc()
}

[cid:image001.jpg@01D5C628.B003ACC0]

##



- Session info 
---
setting  value
 version  R version 3.6.1 (2019-07-05)
os   Windows 10 x64
 system   x86_64, mingw32
 ui   RStudio
 language (EN)
 collate  English_United States.1252
 ctypeEnglish_United States.1252
 tz   America/New_York
 date 2020-01-08

- Packages 
---
! package  * version date   lib source
   acepack1.4.1   2016-10-29 [1] CRAN (R 3.6.1)
   arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2)
   askpass1.1 2019-01-13 [1] CRAN (R 3.6.1)
   assertthat 0.2.1   2019-03-21 [1] CRAN (R 3.6.1)
   backports  1.1.5   2019-10-02 [1] CRAN (R 3.6.1)
   base64enc  0.1-3   2015-07-28 [1] CRAN (R 3.6.0)
   bit1.1-14  2018-05-29 [1] CRAN (R 3.6.0)
   bit64  0.9-7   2017-05-08 [1] CRAN (R 3.6.0)
   blob   1.2.0   2019-07-09 [1] CRAN (R 3.6.1)
   callr  3.3.1   2019-07-18 [1] CRAN (R 3.6.1)
   cellranger 1.1.0   2016-07-27 [1] CRAN (R 3.6.1)
   checkmate  1.9.4   2019-07-04 [1] CRAN (R 3.6.1)
   cli1.1.0   2019-03-19 [1] CRAN (R 3.6.1)
   cluster2.1.0   2019-06-19 [2] CRAN (R 3.6.1)
   codetools  0.2-16  2018-12-24 [2] CRAN (R 3.6.1)
   colorspace 1.4-1   2019-03-18 [1] CRAN (R 3.6.1)
   commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1)
   crayon 1.3.4   2017-09-16 [1] CRAN (R 3.6.1)
   credentials1.1 2019-03-12 [1] CRAN (R 3.6.2)
   curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1)
   data.table 1.12.2  2019-04-07 [1] CRAN (R 3.6.1)
   DBI  * 1.0.0   2018-05-02 [1] CRAN (R 3.6.1)
   desc   1.2.0   2018-05-01 [1] CRAN (R 3.6.1)
   devtools * 2.2.0   2019-09-07 [1] CRAN (R 3.6.1)
   digest 0.6.23  2019-11-23 [1] CRAN (R 3.6.1)
   dplyr* 0.8.3   2019-07-04 [1] CRAN (R 3.6.1)
   DT 0.9 2019-09-17 [1] CRAN (R 3.6.1)
   ellipsis   0.3.0   2019-09-20 [1] CRAN (R 3.6.1)
   evaluate   0.142019-05-28 [1] CRAN (R 3.6.1)
   foreign0.8-71  2018-07-20 [2] CRAN (R 3.6.1)
   Formula  * 1.2-3   2018-05-03 [1] CRAN (R 3.6.0)
   fs 1.3.1   2019-05-06 [1] CRAN (R 3.6.1)
   fst  * 0.9.0   2019-04-09 [1] CRAN (R 3.6.1)
   future   * 1.15.0-9000 2019-11-19 [1] Github 
(HenrikBengtsson/future@bc241c7)
   ggplot2  * 3.2.1   2019-08-10 [1] CRAN (R 3.6.1)
   globals0.12.4  2018-10-11 [1] CRAN (R 3.6.0)
   glue * 1.3.1   2019-03-12 [1] CRAN (R 3.6.1)
   gridExtra  2.3 2017-09-09 [1] CRAN (R 3.6.1)
   gt   * 0.1.0   2019-11-27 [1] Github (rstudio/gt@284bbe5)
   gtable 0.3.0   2019-03-25 [1] CRAN (R 3.6.1)
   Hmisc* 4.3-0   2019-11-07 [1] CRAN (R 3.6.1)
   htmlTable  1.13.2  2019-09-22 [1] CRAN (R 3.6.1)
 D htmltools  0.3.6.9004  2019-09-20 [1] Github (rstudio/htmltools@c49b29c)
   htmlwidgets1.3 2018-09-30 [1] CRAN (R 3.6.1)
   jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1)
   knitr  1.252019-09-18 [1] CRAN 

Re: [DRAFT] Apache Arrow Board Report January 2020

2020-01-08 Thread Wes McKinney
Not sure what happened there. The two words after "grow" can be removed

## Description:

The mission of Apache Arrow is the creation and maintenance of software related
to columnar in-memory processing and data interchange

## Issues:

There are no issues requiring board attention at this time.

## Membership Data:
Apache Arrow was founded 2016-01-19 (4 years ago)
There are currently 50 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 7:4.

Community changes, past quarter:
- No new PMC members. Last addition was Micah Kornfield on 2019-08-21.
- Eric Erhardt was added as committer on 2019-10-18
- Joris Van den Bossche was added as committer on 2019-12-06

## Project Activity:

* We have completed our initial migration away from Travis CI for
  continuous integration and patch validation to use the new
  GitHub Actions (GHA) service. We are much happier with the
  compute resource allocation provided by GitHub but longer term
  we are concerned that the generous free allocation may not
  continue and would be interested to know what kinds of
  guarantees (if any) GitHub may make to the ASF regarding GHA.
* We are not out of the woods on CI/CD as there are features of Apache Arrow
  that we cannot test in GitHub Actions. We are still considering options for
  running these optional test workloads as well as other kinds of periodic
  workloads like benchmarking
* We hope to make a 1.0.0 release of the project in early 2020. We had thought
  that our next major release after 0.15.0 would be 1.0.0 but we have not yet
  completed some necessary work items that the community has agreed are
  essential to graduate to 1.0.0

Recent releases:
0.15.0 was released on 2019-10-05.
0.14.1 was released on 2019-07-21.
0.14.0 was released on 2019-07-04.

## Community Health:

The developer community is healthy and continues to grow.

On Wed, Jan 8, 2020 at 12:12 PM Todd Hendricks  wrote:
>
> Hi Wes,
>
> Looks like there is a cutoff sentence at the end of the Community Health
> section.
>
> On Wed, Jan 8, 2020 at 10:01 AM Wes McKinney  wrote:
>
> > Here is an updated draft. If there is no more feedback, this can be
> > submitted to the board
> >
> > ## Description:
> >
> > The mission of Apache Arrow is the creation and maintenance of software
> > related
> > to columnar in-memory processing and data interchange
> >
> > ## Issues:
> >
> > There are no issues requiring board attention at this time.
> >
> > ## Membership Data:
> > Apache Arrow was founded 2016-01-19 (4 years ago)
> > There are currently 50 committers and 28 PMC members in this project.
> > The Committer-to-PMC ratio is roughly 7:4.
> >
> > Community changes, past quarter:
> > - No new PMC members. Last addition was Micah Kornfield on 2019-08-21.
> > - Eric Erhardt was added as committer on 2019-10-18
> > - Joris Van den Bossche was added as committer on 2019-12-06
> >
> > ## Project Activity:
> >
> > * We have completed our initial migration away from Travis CI for
> >   continuous integration and patch validation to use the new
> >   GitHub Actions (GHA) service. We are much happier with the
> >   compute resource allocation provided by GitHub but longer term
> >   we are concerned that the generous free allocation may not
> >   continue and would be interested to know what kinds of
> >   guarantees (if any) GitHub may make to the ASF regarding GHA.
> > * We are not out of the woods on CI/CD as there are features of Apache
> > Arrow
> >   that we cannot test in GitHub Actions. We are still considering options
> > for
> >   running these optional test workloads as well as other kinds of periodic
> >   workloads like benchmarking
> > * We hope to make a 1.0.0 release of the project in early 2020. We had
> > thought
> >   that our next major release after 0.15.0 would be 1.0.0 but we have not
> > yet
> >   completed some necessary work items that the community has agreed are
> >   essential to graduate to 1.0.0
> >
> > Recent releases:
> > 0.15.0 was released on 2019-10-05.
> > 0.14.1 was released on 2019-07-21.
> > 0.14.0 was released on 2019-07-04.
> >
> > ## Community Health:
> >
> > The developer community is healthy and continues to grow.THe co
> >
> > On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou  wrote:
> > >
> > >
> > > Perhaps also mention that we're dependent on enough capacity on GitHub
> > > Actions currently.  I'm not sure how long their generosity will last :-)
> > >
> > >
> > > Le 06/01/2020 à 18:14, Wes McKinney a écrit :
> > > > There is still the question of how to manage CI tasks (e.g.
> > > > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions.
> > > > We should probably mention that we've migrated off Travis CI, though.
> > > >
> > > > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou 
> > wrote:
> > > >>
> > > >>
> > > >> Do we consider the CI issue solved?
> > > >>
> > > >>
> > > >> Le 06/01/2020 à 18:02, Wes McKinney a écrit :
> > > >>> Hi folks -- our quarterly ASF board report is due 

Re: [DRAFT] Apache Arrow Board Report January 2020

2020-01-08 Thread Todd Hendricks
Hi Wes,

Looks like there is a cutoff sentence at the end of the Community Health
section.

On Wed, Jan 8, 2020 at 10:01 AM Wes McKinney  wrote:

> Here is an updated draft. If there is no more feedback, this can be
> submitted to the board
>
> ## Description:
>
> The mission of Apache Arrow is the creation and maintenance of software
> related
> to columnar in-memory processing and data interchange
>
> ## Issues:
>
> There are no issues requiring board attention at this time.
>
> ## Membership Data:
> Apache Arrow was founded 2016-01-19 (4 years ago)
> There are currently 50 committers and 28 PMC members in this project.
> The Committer-to-PMC ratio is roughly 7:4.
>
> Community changes, past quarter:
> - No new PMC members. Last addition was Micah Kornfield on 2019-08-21.
> - Eric Erhardt was added as committer on 2019-10-18
> - Joris Van den Bossche was added as committer on 2019-12-06
>
> ## Project Activity:
>
> * We have completed our initial migration away from Travis CI for
>   continuous integration and patch validation to use the new
>   GitHub Actions (GHA) service. We are much happier with the
>   compute resource allocation provided by GitHub but longer term
>   we are concerned that the generous free allocation may not
>   continue and would be interested to know what kinds of
>   guarantees (if any) GitHub may make to the ASF regarding GHA.
> * We are not out of the woods on CI/CD as there are features of Apache
> Arrow
>   that we cannot test in GitHub Actions. We are still considering options
> for
>   running these optional test workloads as well as other kinds of periodic
>   workloads like benchmarking
> * We hope to make a 1.0.0 release of the project in early 2020. We had
> thought
>   that our next major release after 0.15.0 would be 1.0.0 but we have not
> yet
>   completed some necessary work items that the community has agreed are
>   essential to graduate to 1.0.0
>
> Recent releases:
> 0.15.0 was released on 2019-10-05.
> 0.14.1 was released on 2019-07-21.
> 0.14.0 was released on 2019-07-04.
>
> ## Community Health:
>
> The developer community is healthy and continues to grow.THe co
>
> On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou  wrote:
> >
> >
> > Perhaps also mention that we're dependent on enough capacity on GitHub
> > Actions currently.  I'm not sure how long their generosity will last :-)
> >
> >
> > Le 06/01/2020 à 18:14, Wes McKinney a écrit :
> > > There is still the question of how to manage CI tasks (e.g.
> > > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions.
> > > We should probably mention that we've migrated off Travis CI, though.
> > >
> > > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou 
> wrote:
> > >>
> > >>
> > >> Do we consider the CI issue solved?
> > >>
> > >>
> > >> Le 06/01/2020 à 18:02, Wes McKinney a écrit :
> > >>> Hi folks -- our quarterly ASF board report is due in 2 days. What
> > >>> items would we like to add in the below sections?
> > >>>
> > >>> ## Description:
> > >>>
> > >>> The mission of Apache Arrow is the creation and maintenance of
> software related
> > >>> to columnar in-memory processing and data interchange
> > >>>
> > >>> ## Issues:
> > >>>
> > >>> There are no issues requiring board attention at this time.
> > >>>
> > >>> ## Membership Data:
> > >>> Apache Arrow was founded 2016-01-19 (4 years ago)
> > >>> There are currently 50 committers and 28 PMC members in this project.
> > >>> The Committer-to-PMC ratio is roughly 7:4.
> > >>>
> > >>> Community changes, past quarter:
> > >>> - No new PMC members. Last addition was Micah Kornfield on
> 2019-08-21.
> > >>> - Eric Erhardt was added as committer on 2019-10-18
> > >>> - Joris Van den Bossche was added as committer on 2019-12-06
> > >>>
> > >>> ## Project Activity:
> > >>>
> > >>> NEED COMMUNITY INPUT
> > >>>
> > >>> Recent releases:
> > >>> 0.15.0 was released on 2019-10-05.
> > >>> 0.14.1 was released on 2019-07-21.
> > >>> 0.14.0 was released on 2019-07-04.
> > >>>
> > >>> ## Community Health:
> > >>>
> > >>> NEED COMMUNITY INPUT
> > >>>
>


Re: [DRAFT] Apache Arrow Board Report January 2020

2020-01-08 Thread Wes McKinney
Here is an updated draft. If there is no more feedback, this can be
submitted to the board

## Description:

The mission of Apache Arrow is the creation and maintenance of software related
to columnar in-memory processing and data interchange

## Issues:

There are no issues requiring board attention at this time.

## Membership Data:
Apache Arrow was founded 2016-01-19 (4 years ago)
There are currently 50 committers and 28 PMC members in this project.
The Committer-to-PMC ratio is roughly 7:4.

Community changes, past quarter:
- No new PMC members. Last addition was Micah Kornfield on 2019-08-21.
- Eric Erhardt was added as committer on 2019-10-18
- Joris Van den Bossche was added as committer on 2019-12-06

## Project Activity:

* We have completed our initial migration away from Travis CI for
  continuous integration and patch validation to use the new
  GitHub Actions (GHA) service. We are much happier with the
  compute resource allocation provided by GitHub but longer term
  we are concerned that the generous free allocation may not
  continue and would be interested to know what kinds of
  guarantees (if any) GitHub may make to the ASF regarding GHA.
* We are not out of the woods on CI/CD as there are features of Apache Arrow
  that we cannot test in GitHub Actions. We are still considering options for
  running these optional test workloads as well as other kinds of periodic
  workloads like benchmarking
* We hope to make a 1.0.0 release of the project in early 2020. We had thought
  that our next major release after 0.15.0 would be 1.0.0 but we have not yet
  completed some necessary work items that the community has agreed are
  essential to graduate to 1.0.0

Recent releases:
0.15.0 was released on 2019-10-05.
0.14.1 was released on 2019-07-21.
0.14.0 was released on 2019-07-04.

## Community Health:

The developer community is healthy and continues to grow.THe co

On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou  wrote:
>
>
> Perhaps also mention that we're dependent on enough capacity on GitHub
> Actions currently.  I'm not sure how long their generosity will last :-)
>
>
> Le 06/01/2020 à 18:14, Wes McKinney a écrit :
> > There is still the question of how to manage CI tasks (e.g.
> > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions.
> > We should probably mention that we've migrated off Travis CI, though.
> >
> > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou  wrote:
> >>
> >>
> >> Do we consider the CI issue solved?
> >>
> >>
> >> Le 06/01/2020 à 18:02, Wes McKinney a écrit :
> >>> Hi folks -- our quarterly ASF board report is due in 2 days. What
> >>> items would we like to add in the below sections?
> >>>
> >>> ## Description:
> >>>
> >>> The mission of Apache Arrow is the creation and maintenance of software 
> >>> related
> >>> to columnar in-memory processing and data interchange
> >>>
> >>> ## Issues:
> >>>
> >>> There are no issues requiring board attention at this time.
> >>>
> >>> ## Membership Data:
> >>> Apache Arrow was founded 2016-01-19 (4 years ago)
> >>> There are currently 50 committers and 28 PMC members in this project.
> >>> The Committer-to-PMC ratio is roughly 7:4.
> >>>
> >>> Community changes, past quarter:
> >>> - No new PMC members. Last addition was Micah Kornfield on 2019-08-21.
> >>> - Eric Erhardt was added as committer on 2019-10-18
> >>> - Joris Van den Bossche was added as committer on 2019-12-06
> >>>
> >>> ## Project Activity:
> >>>
> >>> NEED COMMUNITY INPUT
> >>>
> >>> Recent releases:
> >>> 0.15.0 was released on 2019-10-05.
> >>> 0.14.1 was released on 2019-07-21.
> >>> 0.14.0 was released on 2019-07-04.
> >>>
> >>> ## Community Health:
> >>>
> >>> NEED COMMUNITY INPUT
> >>>


[jira] [Created] (ARROW-7517) [C++] Builder does not honour dictionary type provided during initialization

2020-01-08 Thread Wamsi Viswanath (Jira)
Wamsi Viswanath created ARROW-7517:
--

 Summary: [C++] Builder does not honour dictionary type provided 
during initialization
 Key: ARROW-7517
 URL: https://issues.apache.org/jira/browse/ARROW-7517
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.15.0
Reporter: Wamsi Viswanath


Below is an example for reproducing the issue:

[https://gist.github.com/wamsiv/d48ec37a9a9b5f4d484de6ff86a3870d]

Builder automatically optimizes the dictionary type depending upon the number 
of unique values provided which results in schema mismatch.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7516) [C#] .NET Benchmarks are broken

2020-01-08 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-7516:
---

 Summary: [C#] .NET Benchmarks are broken
 Key: ARROW-7516
 URL: https://issues.apache.org/jira/browse/ARROW-7516
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Reporter: Eric Erhardt


See [https://github.com/apache/arrow/pull/6030#issuecomment-571877721]

 

It looks like the issue is that in the Benchmarks, `Length` is specified as 
`1_000_000`, and there has only been ~730,000 days since `DateTime.Min`, so 
this line fails:

https://github.com/apache/arrow/blob/4634c89fc77f70fb5b5d035d6172263a4604da82/csharp/test/Apache.Arrow.Tests/TestData.cs#L130

A simple fix would be to cap what we pass into `AddDays` to some number like 
`100_000`, or so.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Arrow sync call January 8 at 12:00 US/Eastern, 17:00 UTC

2020-01-08 Thread Neal Richardson
Attendees:
* Ben Kietzman
* Wes McKinney
* Prudhvi Porandla
* Neal Richardson
* François Saint-Jacques

Discussion:
* Blockers for 1.0 release, how to get them done, what is required
* 0.16 backlog triage

On Tue, Jan 7, 2020 at 9:01 AM Neal Richardson 
wrote:

> Hi all,
> Happy 2020! Reminder that our biweekly call is in 24 hours at
> https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will
> be sent out to the mailing list afterwards.
>
> Neal
>


[jira] [Created] (ARROW-7515) [C++] Rename nonexistent and non_existent to not_found

2020-01-08 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7515:
---

 Summary: [C++] Rename nonexistent and non_existent to not_found
 Key: ARROW-7515
 URL: https://issues.apache.org/jira/browse/ARROW-7515
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)