Re: Human-readable version of Arrow Schema?
Hello, pg2arrow [*1] has '--dump' mode to print out schema definition of the given Apache Arrow file. Does it make sense for you? $ ./pg2arrow --dump ~/hoge.arrow [Footer] {Footer: version=V4, schema={Schema: endianness=little, fields=[{Field: name="id", nullable=true, type={Int32}, children=[], custom_metadata=[]}, {Field: name="a", nullable=true, type={Float64}, children=[], custom_metadata=[]}, {Field: name="b", nullable=true, type={Decimal: precision=11, scale=7}, children=[], custom_metadata=[]}, {Field: name="c", nullable=true, type={Struct}, children=[{Field: name="x", nullable=true, type={Int32}, children=[], custom_metadata=[]}, {Field: name="y", nullable=true, type={Float32}, children=[], custom_metadata=[]}, {Field: name="z", nullable=true, type={Utf8}, children=[], custom_metadata=[]}], custom_metadata=[]}, {Field: name="d", nullable=true, type={Utf8}, dictionary={DictionaryEncoding: id=0, indexType={Int32}, isOrdered=false}, children=[], custom_metadata=[]}, {Field: name="e", nullable=true, type={Timestamp: unit=us}, children=[], custom_metadata=[]}, {Field: name="f", nullable=true, type={Utf8}, children=[], custom_metadata=[]}, {Field: name="random", nullable=true, type={Float64}, children=[], custom_metadata=[]}], custom_metadata=[{KeyValue: key="sql_command" value="SELECT *,random() FROM t"}]}, dictionaries=[{Block: offset=920, metaDataLength=184 bodyLength=128}], recordBatches=[{Block: offset=1232, metaDataLength=648 bodyLength=386112}]} [Dictionary Batch 0] {Block: offset=920, metaDataLength=184 bodyLength=128} {Message: version=V4, body={DictionaryBatch: id=0, data={RecordBatch: length=6, nodes=[{FieldNode: length=6, null_count=0}], buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=64}, {Buffer: offset=64, length=64}]}, isDelta=false}, bodyLength=128} [Record Batch 0] {Block: offset=1232, metaDataLength=648 bodyLength=386112} {Message: version=V4, body={RecordBatch: length=3000, nodes=[{FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=60}, {FieldNode: length=3000, null_count=62}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=56}, {FieldNode: length=3000, null_count=66}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=64}, {FieldNode: length=3000, null_count=0}, {FieldNode: length=3000, null_count=0}], buffers=[{Buffer: offset=0, length=0}, {Buffer: offset=0, length=12032}, {Buffer: offset=12032, length=384}, {Buffer: offset=12416, length=24000}, {Buffer: offset=36416, length=384}, {Buffer: offset=36800, length=48000}, {Buffer: offset=84800, length=0}, {Buffer: offset=84800, length=384}, {Buffer: offset=85184, length=12032}, {Buffer: offset=97216, length=384}, {Buffer: offset=97600, length=12032}, {Buffer: offset=109632, length=0}, {Buffer: offset=109632, length=12032}, {Buffer: offset=121664, length=96000}, {Buffer: offset=217664, length=0}, {Buffer: offset=217664, length=12032}, {Buffer: offset=229696, length=384}, {Buffer: offset=230080, length=24000}, {Buffer: offset=254080, length=0}, {Buffer: offset=254080, length=12032}, {Buffer: offset=266112, length=96000}, {Buffer: offset=362112, length=0}, {Buffer: offset=362112, length=24000}]}, bodyLength=386112} [*1] https://heterodb.github.io/pg-strom/arrow_fdw/#using-pg2arrow 2019年12月7日(土) 6:26 Christian Hudon : > > Hi, > > For the uses I would like to make of Arrow, I would need a human-readable > and -writable version of an Arrow Schema, that could be converted to and > from the Arrow Schema C++ object. Going through the doc for 0.15.1, I don't > see anything to that effect, with the closest being the ToString() method > on DataType instances, but which is meant for debugging only. (I need an > expression of an Arrow Schema that people can read, and that can live > outside of the code for a particular operation.) > > Is a text representation of an Arrow Schema something that is being worked > on now? If not, would you folks be interested in me putting up an initial > proposal for discussion? Any design constraints I should pay attention to, > then? > > Thanks, > > Christian > -- > > > │ Christian Hudon > > │ Applied Research Scientist > >Element AI, 6650 Saint-Urbain #500 > >Montréal, QC, H2S 3G9, Canada >Elementai.com -- HeteroDB, Inc / The PG-Strom Project KaiGai Kohei
[jira] [Created] (ARROW-7525) [Python][CI] Build PyArrow on VS2019
Krisztian Szucs created ARROW-7525: -- Summary: [Python][CI] Build PyArrow on VS2019 Key: ARROW-7525 URL: https://issues.apache.org/jira/browse/ARROW-7525 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Krisztian Szucs Enable ARROW_PARQUET cmake flag. Additional patching might be required, see https://github.com/microsoft/vcpkg/pull/8263/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7524) [C++][CI] Build parquet support in the VS2019 GitHub Actions job
Krisztian Szucs created ARROW-7524: -- Summary: [C++][CI] Build parquet support in the VS2019 GitHub Actions job Key: ARROW-7524 URL: https://issues.apache.org/jira/browse/ARROW-7524 Project: Apache Arrow Issue Type: Improvement Components: C++, Continuous Integration Reporter: Krisztian Szucs Enable ARROW_PARQUET cmake flag. Additional patching might be required, see https://github.com/microsoft/vcpkg/pull/8263/files -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Human-readable version of Arrow Schema?
The C-interface representation is probably slightly less readable then the JSON implementation if I understand the flatbuffer to JSON conversion properly. But as Antoine pointed out it depends on the use-case. FWIW, flatbuffers maintainers indicated forward/backward compatibility is intended to be preserved in the JSON encoding as well. On Sat, Jan 4, 2020 at 2:16 PM Jacques Nadeau wrote: > What do people think about using the C interface representation? > > On Sun, Dec 29, 2019 at 12:42 PM Micah Kornfield > wrote: > >> I opened https://github.com/google/flatbuffers/issues/5688 to try to get >> some clarity. >> >> On Tue, Dec 24, 2019 at 12:13 PM Wes McKinney >> wrote: >> >> > On Tue, Dec 24, 2019 at 2:47 AM Micah Kornfield >> > wrote: >> > >> >> > >> If we were to make the same kinds of forward/backward compatibility >> > >> guarantees as with Flatbuffers it could create a lot of work for >> > >> maintainers. >> > > >> > > Does it pay to follow-up with the flatbuffer project to understand if >> > the forward/backward compatibility guarantees the flatbuffers provide >> > extend to their JSON format? >> > >> > I spent a few minutes looking at the Flatbuffers codebase and >> > documentation and did not find anything, so this seems like useful >> > information to have regardless. >> > >> > > >> > > On Sun, Dec 15, 2019 at 11:17 AM Wes McKinney >> > wrote: >> > >> >> > >> I'd be open to looking at a proposal for a human-readable text >> > >> representation, but I'm definitely wary about making any kind of >> > >> cross-version compatibility guarantees (beyond "we will do our >> best"). >> > >> If we were to make the same kinds of forward/backward compatibility >> > >> guarantees as with Flatbuffers it could create a lot of work for >> > >> maintainers. >> > >> >> > >> On Thu, Dec 12, 2019 at 12:43 AM Micah Kornfield < >> emkornfi...@gmail.com> >> > wrote: >> > >> > >> > >> > > >> > >> > > With these two together, it would seem not too difficult to >> create >> > a text >> > >> > > representation for Arrow schemas that (at some point) has some >> > >> > > compatibility guarantees, but maybe I'm missing something? >> > >> > >> > >> > >> > >> > I think the main risk is if somehow flatbuffers JSON parsing >> doesn't >> > handle >> > >> > backward compatible changes to the arrow schema message. Given the >> > way the >> > >> > documentation is describing the JSON functionality I think this >> would >> > be >> > >> > considered a bug. >> > >> > >> > >> > The one downside to calling the "schema" canonical is the >> flatbuffers >> > JSON >> > >> > functionality only appears to be available in C++ and Java via JNI, >> > so it >> > >> > wouldn't have cross language support. I think this issue is more >> one >> > of >> > >> > semantics though (i.e. does the JSON description become part of the >> > "Arrow >> > >> > spec" or does it live as a C++/Python only feature). >> > >> > >> > >> > -Micah >> > >> > >> > >> > >> > >> > On Tue, Dec 10, 2019 at 10:51 AM Christian Hudon < >> > chr...@elementai.com> >> > >> > wrote: >> > >> > >> > >> > > Micah: I didn't know that Flatbuffers supported serialization >> > to/from JSON, >> > >> > > thanks. That seems like a very good start, at least. I'll aim to >> > create a >> > >> > > draft pull request that at least wires everything up in Arrow so >> we >> > can >> > >> > > load/save a Schema.fbs instance from/to JSON. At least it'll make >> > it easier >> > >> > > for me to see how Arrow schemas would look in JSON with that. >> > >> > > >> > >> > > Otherwise, I'm still gathering requirements internally here. For >> > example, >> > >> > > one thing that would be nice would be to be able to output a JSON >> > Schema >> > >> > > from at least a subset of the Arrow schema. (That way our users >> > could start >> > >> > > by passing around JSON with a given schema, and transition pieces >> > of a >> > >> > > workflow to Arrow as they're ready.) But that part can also be >> done >> > outside >> > >> > > of the Arrow code, if deemed not relevant to have in the Arrow >> > codebase >> > >> > > itself. >> > >> > > >> > >> > > One core requirement for us, however, would be eventual >> > compatibility >> > >> > > between Arrow versions for a given text representation of a >> schema. >> > >> > > Meaning, if you have a text description of a given Arrow schema, >> > you can >> > >> > > load it into different versions of Arrow and it creates a valid >> > Schema >> > >> > > Flatbuffer description, that Arrow can use. Wes, were you >> thinking >> > of that, >> > >> > > or of something else, when you wrote "only makes sense if it is >> > offered >> > >> > > without any backward/forward compatibility guarantees"? >> > >> > > >> > >> > > For the now, or me, assuming the JSON serialization done by the >> > Flatbuffer >> > >> > > libraries is usable, it seems we have all the pieces to make this >> > happen: >> > >> > > 1) The binary Schema.fbs data structures has to be compatible >> > between >> >
Re: [C++] "nonexistent" or "non-existent"
I also think we have "KeyError" which I think might model the same concept? On Mon, Jan 6, 2020 at 7:35 AM Wes McKinney wrote: > I agree using a different terminology than "nonexistent" like > "NotFound" would be good. If we use "nonexistent" then the hyphen-free > spelling seems preferred > > On Sun, Dec 29, 2019 at 2:23 PM Micah Kornfield > wrote: > > > > I'm not sure if all of the examples refer to the same thing, but "Not > > Found" (from http 404 error) is the most common way of expressing at > least > > the first concept I think. > > > > On Sat, Dec 28, 2019 at 11:45 AM Neal Richardson < > > neal.p.richard...@gmail.com> wrote: > > > > > IMO while "nonexistent" is the right word, neither are particularly > > > readable or obvious in code. Is there a better word/phrase? > > > > > > On Fri, Dec 27, 2019 at 5:34 PM Sutou Kouhei > wrote: > > > > > > > Hi, > > > > > > > > I found that we use both "nonexistent" and "non-existent" in > > > > our C++ code base. I think that we should use one of them > > > > instead of mixing them. > > > > > > > > "nonexistent": > > > > > > > > * Public API: > > > > * cpp/src/plasma/: > plasma::PlasmaErrorCode::PlasmaObjectNonexistent > > > > > > > > "non-existent": > > > > > > > > * Public API: > > > > * cpp/src/arrow/filesystem/: arrow::fs::FileType::NonExistent > > > > * Internal: > > > > * cpp/src/arrow/util/io_util.h: allow_non_existent > > > > * Test: > > > > * cpp/src/gandiva/tests/: non_existent_function > > > > > > > > > > > > Which should we use? > > > > (Personally, I prefer "nonexistent" to "non-existent".) > > > > > > > > > > > > Thanks, > > > > -- > > > > kou > > > > > > > >
[jira] [Created] (ARROW-7522) Broken Record Batch returned from a function call
Chengxin Ma created ARROW-7522: -- Summary: Broken Record Batch returned from a function call Key: ARROW-7522 URL: https://issues.apache.org/jira/browse/ARROW-7522 Project: Apache Arrow Issue Type: Bug Components: C++, C++ - Plasma Affects Versions: 0.15.1 Environment: macOS Reporter: Chengxin Ma Scenario: retrieving Record Batch from Plasma with known Object ID. The following code snippet works well: {code:java} int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, _batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(_batch); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} {{record_batch->column(0)->ToString()}} would incur a segmentation fault if retrieving Record Batch is wrapped in a function: {code:java} std::shared_ptr GetRecordBatchFromPlasma(plasma::ObjectID object_id) { // Start up and connect a Plasma client. plasma::PlasmaClient client; ARROW_CHECK_OK(client.Connect("/tmp/store")); plasma::ObjectBuffer object_buffer; ARROW_CHECK_OK(client.Get(_id, 1, -1, _buffer)); // Retrieve object data. auto buffer = object_buffer.data; arrow::io::BufferReader buffer_reader(buffer); std::shared_ptr record_batch_stream_reader; ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(_reader, _batch_stream_reader)); std::shared_ptr record_batch; arrow::Status status = record_batch_stream_reader->ReadNext(_batch); // Disconnect the client. ARROW_CHECK_OK(client.Disconnect()); return record_batch; } int main(int argc, char **argv) { plasma::ObjectID object_id = plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF"); std::shared_ptr record_batch = GetRecordBatchFromPlasma(object_id); std::cout << "record_batch->column_name(0): " << record_batch->column_name(0) << std::endl; std::cout << "record_batch->num_columns(): " << record_batch->num_columns() << std::endl; std::cout << "record_batch->num_rows(): " << record_batch->num_rows() << std::endl; std::cout << "record_batch->column(0)->length(): " << record_batch->column(0)->length() << std::endl; std::cout << "record_batch->column(0)->ToString(): " << record_batch->column(0)->ToString() << std::endl; } {code} The meta info of the Record Batch such as number of columns and rows is still available, but I can't see the content of the columns. {{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the Record Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But why can I still see the meta info of this Record Batch? What is the proper way to get the Record Batch if we insist using a function? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7521) [Rust] Remove tuple on FixedSizeList datatype
Neville Dipale created ARROW-7521: - Summary: [Rust] Remove tuple on FixedSizeList datatype Key: ARROW-7521 URL: https://issues.apache.org/jira/browse/ARROW-7521 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Neville Dipale The FixedSizeList datatype takes a tuple of Box and length, but this could be simplified to take the two values without a tuple. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7520) Arrow / R - too many batches causes a crash
Christian created ARROW-7520: Summary: Arrow / R - too many batches causes a crash Key: ARROW-7520 URL: https://issues.apache.org/jira/browse/ARROW-7520 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.15.1 Environment: - Session info --- setting value version R version 3.6.1 (2019-07-05) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/New_York date 2020-01-08 - Packages --- ! package * version date lib source acepack 1.4.1 2016-10-29 [1] CRAN (R 3.6.1) arrow * 0.15.1.1 2019-11-05 [1] CRAN (R 3.6.2) askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bit 1.1-14 2018-05-29 [1] CRAN (R 3.6.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.1) cluster 2.1.0 2019-06-19 [2] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) credentials 1.1 2019-03-12 [1] CRAN (R 3.6.2) curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) foreign 0.8-71 2018-07-20 [2] CRAN (R 3.6.1) Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) future * 1.15.0-9000 2019-11-19 [1] Github (HenrikBengtsson/future@bc241c7) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) globals 0.12.4 2018-10-11 [1] CRAN (R 3.6.0) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) Hmisc * 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) D htmltools 0.3.6.9004 2019-09-20 [1] Github (rstudio/htmltools@c49b29c) htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.6.1)
Re: Arrow / R - too many batches causes a crash
Can you please open a JIRA issue? On Wed, Jan 8, 2020 at 12:37 PM Christian Klar wrote: > Hi, > > > > At the bottom please find the session_info. > > > > When creating north of 200-300 batches, the writing to the arrow file > crashes R – it doesn’t even show an error message. Rstudio just aborts. > > > > I have the feeling that maybe each batch becomes a stream and R has issues > with the connections, but that’s a total guess. > > > > Any help would be appreciated. > > > > ## > > > > Here is the function. When running it with 3000 it crashes immediately. > > > > Before that I ran it with 100, and then increased it slowly, and then it > randomly crashed again. > > > > > write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) > > > > write_arrow_custom <- function(df,targetarrow,nrbatches) { > > ct <- nrbatches > > idxs <- c(0:ct)/ct*nrow(df) > > idxs <- round(idxs,0) %>% as.integer() > > idxs[length(idxs)] <- nrow(df) > > df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% > mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(! > is.na(colto)) %>% mutate(R=row_number()) > > stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% > sum()==nrow(df)) > > table_df <- Table$create(name=rownames(df[1,]),df[1,]) > > writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) > > df_nav %>% dlply(c('R'),function(df_nav){ > > catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...')) > > tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] > > writer$write_batch(record_batch(name = rownames(tmp), tmp)) > > NULL > > }) -> batch_lst > > writer$close() > > rm(batch_lst) > > gc() > > } > > > > > > ## > > > > > > > > - Session info > --- > > setting value > > version R version 3.6.1 (2019-07-05) > > os Windows 10 x64 > > system x86_64, mingw32 > > ui RStudio > > language (EN) > > collate English_United States.1252 > > ctypeEnglish_United States.1252 > > tz America/New_York > > date 2020-01-08 > > > > - Packages > --- > > ! package * version date lib source > > >acepack1.4.1 2016-10-29 [1] CRAN (R > 3.6.1) > >arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2) > > >askpass1.1 2019-01-13 [1] CRAN (R 3.6.1) > > >assertthat 0.2.1 2019-03-21 [1] CRAN (R > 3.6.1) > >backports 1.1.5 2019-10-02 [1] CRAN (R > 3.6.1) > >base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) > > >bit1.1-14 2018-05-29 [1] CRAN (R > 3.6.0) > >bit64 0.9-7 2017-05-08 [1] CRAN (R > 3.6.0) > >blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) > > >callr 3.3.1 2019-07-18 [1] CRAN (R > 3.6.1) > >cellranger 1.1.0 2016-07-27 [1] CRAN (R > 3.6.1) > >checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) > > >cli1.1.0 2019-03-19 [1] CRAN (R > 3.6.1) > >cluster2.1.0 2019-06-19 [2] CRAN (R 3.6.1) > > >codetools 0.2-16 2018-12-24 [2] CRAN (R > 3.6.1) > >colorspace 1.4-1 2019-03-18 [1] CRAN (R > 3.6.1) > >commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) > > >crayon 1.3.4 2017-09-16 [1] CRAN (R > 3.6.1) > >credentials1.1 2019-03-12 [1] CRAN (R > 3.6.2) > >curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) > > >data.table 1.12.2 2019-04-07 [1] CRAN (R > 3.6.1) > >DBI * 1.0.0 2018-05-02 [1] CRAN (R > 3.6.1) > >desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) > > >devtools * 2.2.0 2019-09-07 [1] CRAN (R > 3.6.1) > >digest 0.6.23 2019-11-23 [1] CRAN (R > 3.6.1) > >dplyr* 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) > > >DT 0.9 2019-09-17 [1] CRAN (R > 3.6.1) > >ellipsis 0.3.0 2019-09-20 [1] CRAN (R > 3.6.1) > >evaluate 0.142019-05-28 [1] CRAN (R 3.6.1) > > >foreign0.8-71 2018-07-20 [2] CRAN (R > 3.6.1) > >Formula * 1.2-3 2018-05-03 [1] CRAN (R > 3.6.0) > >fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) > > >fst * 0.9.0 2019-04-09 [1] CRAN (R > 3.6.1) > >future * 1.15.0-9000 2019-11-19 [1] Github > (HenrikBengtsson/future@bc241c7) > >ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) > > >globals0.12.4 2018-10-11 [1] CRAN (R > 3.6.0) > >glue * 1.3.1 2019-03-12 [1] CRAN (R > 3.6.1) > >gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) >
[jira] [Created] (ARROW-7519) [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1
Wes McKinney created ARROW-7519: --- Summary: [Python] Build wheels, conda packages with PYARROW_WITH_DATASET=1 Key: ARROW-7519 URL: https://issues.apache.org/jira/browse/ARROW-7519 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.16.0 We should -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7518) [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages
Wes McKinney created ARROW-7518: --- Summary: [Python] Use PYARROW_WITH_HDFS when building wheels, conda packages Key: ARROW-7518 URL: https://issues.apache.org/jira/browse/ARROW-7518 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.16.0 This new module is not enabled in the package builds -- This message was sent by Atlassian Jira (v8.3.4#803005)
Arrow / R - too many batches causes a crash
Hi, At the bottom please find the session_info. When creating north of 200-300 batches, the writing to the arrow file crashes R – it doesn’t even show an error message. Rstudio just aborts. I have the feeling that maybe each batch becomes a stream and R has issues with the connections, but that’s a total guess. Any help would be appreciated. ## Here is the function. When running it with 3000 it crashes immediately. Before that I ran it with 100, and then increased it slowly, and then it randomly crashed again. write_arrow_custom(data.frame(A=c(1:10),B=c(1:10)),'C:/Temp/test.arrow',3000) write_arrow_custom <- function(df,targetarrow,nrbatches) { ct <- nrbatches idxs <- c(0:ct)/ct*nrow(df) idxs <- round(idxs,0) %>% as.integer() idxs[length(idxs)] <- nrow(df) df_nav <- idxs %>% as.data.frame() %>% rename(colfrom=1) %>% mutate(colto=lead(colfrom)) %>% mutate(colfrom=colfrom+1) %>% filter(!is.na(colto)) %>% mutate(R=row_number()) stopifnot(df_nav %>% mutate(chk=colto-colfrom+1) %>% '$'('chk') %>% sum()==nrow(df)) table_df <- Table$create(name=rownames(df[1,]),df[1,]) writer <- RecordBatchFileWriter$create(targetarrow,table_df$schema) df_nav %>% dlply(c('R'),function(df_nav){ catl(glue('{df_nav$colfrom[1]}:{df_nav$colto[1]} / {df_nav$R[1]}...')) tmp <- df[df_nav$colfrom[1]:df_nav$colto[1],] writer$write_batch(record_batch(name = rownames(tmp), tmp)) NULL }) -> batch_lst writer$close() rm(batch_lst) gc() } [cid:image001.jpg@01D5C628.B003ACC0] ## - Session info --- setting value version R version 3.6.1 (2019-07-05) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctypeEnglish_United States.1252 tz America/New_York date 2020-01-08 - Packages --- ! package * version date lib source acepack1.4.1 2016-10-29 [1] CRAN (R 3.6.1) arrow* 0.15.1.12019-11-05 [1] CRAN (R 3.6.2) askpass1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bit1.1-14 2018-05-29 [1] CRAN (R 3.6.0) bit64 0.9-7 2017-05-08 [1] CRAN (R 3.6.0) blob 1.2.0 2019-07-09 [1] CRAN (R 3.6.1) callr 3.3.1 2019-07-18 [1] CRAN (R 3.6.1) cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.6.1) checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.6.1) cli1.1.0 2019-03-19 [1] CRAN (R 3.6.1) cluster2.1.0 2019-06-19 [2] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.1) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) commonmark 1.7 2018-12-01 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) credentials1.1 2019-03-12 [1] CRAN (R 3.6.2) curl * 4.2 2019-09-24 [1] CRAN (R 3.6.1) data.table 1.12.2 2019-04-07 [1] CRAN (R 3.6.1) DBI * 1.0.0 2018-05-02 [1] CRAN (R 3.6.1) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) devtools * 2.2.0 2019-09-07 [1] CRAN (R 3.6.1) digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1) dplyr* 0.8.3 2019-07-04 [1] CRAN (R 3.6.1) DT 0.9 2019-09-17 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.142019-05-28 [1] CRAN (R 3.6.1) foreign0.8-71 2018-07-20 [2] CRAN (R 3.6.1) Formula * 1.2-3 2018-05-03 [1] CRAN (R 3.6.0) fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1) fst * 0.9.0 2019-04-09 [1] CRAN (R 3.6.1) future * 1.15.0-9000 2019-11-19 [1] Github (HenrikBengtsson/future@bc241c7) ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.6.1) globals0.12.4 2018-10-11 [1] CRAN (R 3.6.0) glue * 1.3.1 2019-03-12 [1] CRAN (R 3.6.1) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gt * 0.1.0 2019-11-27 [1] Github (rstudio/gt@284bbe5) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) Hmisc* 4.3-0 2019-11-07 [1] CRAN (R 3.6.1) htmlTable 1.13.2 2019-09-22 [1] CRAN (R 3.6.1) D htmltools 0.3.6.9004 2019-09-20 [1] Github (rstudio/htmltools@c49b29c) htmlwidgets1.3 2018-09-30 [1] CRAN (R 3.6.1) jsonlite * 1.6 2018-12-07 [1] CRAN (R 3.6.1) knitr 1.252019-09-18 [1] CRAN
Re: [DRAFT] Apache Arrow Board Report January 2020
Not sure what happened there. The two words after "grow" can be removed ## Description: The mission of Apache Arrow is the creation and maintenance of software related to columnar in-memory processing and data interchange ## Issues: There are no issues requiring board attention at this time. ## Membership Data: Apache Arrow was founded 2016-01-19 (4 years ago) There are currently 50 committers and 28 PMC members in this project. The Committer-to-PMC ratio is roughly 7:4. Community changes, past quarter: - No new PMC members. Last addition was Micah Kornfield on 2019-08-21. - Eric Erhardt was added as committer on 2019-10-18 - Joris Van den Bossche was added as committer on 2019-12-06 ## Project Activity: * We have completed our initial migration away from Travis CI for continuous integration and patch validation to use the new GitHub Actions (GHA) service. We are much happier with the compute resource allocation provided by GitHub but longer term we are concerned that the generous free allocation may not continue and would be interested to know what kinds of guarantees (if any) GitHub may make to the ASF regarding GHA. * We are not out of the woods on CI/CD as there are features of Apache Arrow that we cannot test in GitHub Actions. We are still considering options for running these optional test workloads as well as other kinds of periodic workloads like benchmarking * We hope to make a 1.0.0 release of the project in early 2020. We had thought that our next major release after 0.15.0 would be 1.0.0 but we have not yet completed some necessary work items that the community has agreed are essential to graduate to 1.0.0 Recent releases: 0.15.0 was released on 2019-10-05. 0.14.1 was released on 2019-07-21. 0.14.0 was released on 2019-07-04. ## Community Health: The developer community is healthy and continues to grow. On Wed, Jan 8, 2020 at 12:12 PM Todd Hendricks wrote: > > Hi Wes, > > Looks like there is a cutoff sentence at the end of the Community Health > section. > > On Wed, Jan 8, 2020 at 10:01 AM Wes McKinney wrote: > > > Here is an updated draft. If there is no more feedback, this can be > > submitted to the board > > > > ## Description: > > > > The mission of Apache Arrow is the creation and maintenance of software > > related > > to columnar in-memory processing and data interchange > > > > ## Issues: > > > > There are no issues requiring board attention at this time. > > > > ## Membership Data: > > Apache Arrow was founded 2016-01-19 (4 years ago) > > There are currently 50 committers and 28 PMC members in this project. > > The Committer-to-PMC ratio is roughly 7:4. > > > > Community changes, past quarter: > > - No new PMC members. Last addition was Micah Kornfield on 2019-08-21. > > - Eric Erhardt was added as committer on 2019-10-18 > > - Joris Van den Bossche was added as committer on 2019-12-06 > > > > ## Project Activity: > > > > * We have completed our initial migration away from Travis CI for > > continuous integration and patch validation to use the new > > GitHub Actions (GHA) service. We are much happier with the > > compute resource allocation provided by GitHub but longer term > > we are concerned that the generous free allocation may not > > continue and would be interested to know what kinds of > > guarantees (if any) GitHub may make to the ASF regarding GHA. > > * We are not out of the woods on CI/CD as there are features of Apache > > Arrow > > that we cannot test in GitHub Actions. We are still considering options > > for > > running these optional test workloads as well as other kinds of periodic > > workloads like benchmarking > > * We hope to make a 1.0.0 release of the project in early 2020. We had > > thought > > that our next major release after 0.15.0 would be 1.0.0 but we have not > > yet > > completed some necessary work items that the community has agreed are > > essential to graduate to 1.0.0 > > > > Recent releases: > > 0.15.0 was released on 2019-10-05. > > 0.14.1 was released on 2019-07-21. > > 0.14.0 was released on 2019-07-04. > > > > ## Community Health: > > > > The developer community is healthy and continues to grow.THe co > > > > On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou wrote: > > > > > > > > > Perhaps also mention that we're dependent on enough capacity on GitHub > > > Actions currently. I'm not sure how long their generosity will last :-) > > > > > > > > > Le 06/01/2020 à 18:14, Wes McKinney a écrit : > > > > There is still the question of how to manage CI tasks (e.g. > > > > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions. > > > > We should probably mention that we've migrated off Travis CI, though. > > > > > > > > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou > > wrote: > > > >> > > > >> > > > >> Do we consider the CI issue solved? > > > >> > > > >> > > > >> Le 06/01/2020 à 18:02, Wes McKinney a écrit : > > > >>> Hi folks -- our quarterly ASF board report is due
Re: [DRAFT] Apache Arrow Board Report January 2020
Hi Wes, Looks like there is a cutoff sentence at the end of the Community Health section. On Wed, Jan 8, 2020 at 10:01 AM Wes McKinney wrote: > Here is an updated draft. If there is no more feedback, this can be > submitted to the board > > ## Description: > > The mission of Apache Arrow is the creation and maintenance of software > related > to columnar in-memory processing and data interchange > > ## Issues: > > There are no issues requiring board attention at this time. > > ## Membership Data: > Apache Arrow was founded 2016-01-19 (4 years ago) > There are currently 50 committers and 28 PMC members in this project. > The Committer-to-PMC ratio is roughly 7:4. > > Community changes, past quarter: > - No new PMC members. Last addition was Micah Kornfield on 2019-08-21. > - Eric Erhardt was added as committer on 2019-10-18 > - Joris Van den Bossche was added as committer on 2019-12-06 > > ## Project Activity: > > * We have completed our initial migration away from Travis CI for > continuous integration and patch validation to use the new > GitHub Actions (GHA) service. We are much happier with the > compute resource allocation provided by GitHub but longer term > we are concerned that the generous free allocation may not > continue and would be interested to know what kinds of > guarantees (if any) GitHub may make to the ASF regarding GHA. > * We are not out of the woods on CI/CD as there are features of Apache > Arrow > that we cannot test in GitHub Actions. We are still considering options > for > running these optional test workloads as well as other kinds of periodic > workloads like benchmarking > * We hope to make a 1.0.0 release of the project in early 2020. We had > thought > that our next major release after 0.15.0 would be 1.0.0 but we have not > yet > completed some necessary work items that the community has agreed are > essential to graduate to 1.0.0 > > Recent releases: > 0.15.0 was released on 2019-10-05. > 0.14.1 was released on 2019-07-21. > 0.14.0 was released on 2019-07-04. > > ## Community Health: > > The developer community is healthy and continues to grow.THe co > > On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou wrote: > > > > > > Perhaps also mention that we're dependent on enough capacity on GitHub > > Actions currently. I'm not sure how long their generosity will last :-) > > > > > > Le 06/01/2020 à 18:14, Wes McKinney a écrit : > > > There is still the question of how to manage CI tasks (e.g. > > > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions. > > > We should probably mention that we've migrated off Travis CI, though. > > > > > > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou > wrote: > > >> > > >> > > >> Do we consider the CI issue solved? > > >> > > >> > > >> Le 06/01/2020 à 18:02, Wes McKinney a écrit : > > >>> Hi folks -- our quarterly ASF board report is due in 2 days. What > > >>> items would we like to add in the below sections? > > >>> > > >>> ## Description: > > >>> > > >>> The mission of Apache Arrow is the creation and maintenance of > software related > > >>> to columnar in-memory processing and data interchange > > >>> > > >>> ## Issues: > > >>> > > >>> There are no issues requiring board attention at this time. > > >>> > > >>> ## Membership Data: > > >>> Apache Arrow was founded 2016-01-19 (4 years ago) > > >>> There are currently 50 committers and 28 PMC members in this project. > > >>> The Committer-to-PMC ratio is roughly 7:4. > > >>> > > >>> Community changes, past quarter: > > >>> - No new PMC members. Last addition was Micah Kornfield on > 2019-08-21. > > >>> - Eric Erhardt was added as committer on 2019-10-18 > > >>> - Joris Van den Bossche was added as committer on 2019-12-06 > > >>> > > >>> ## Project Activity: > > >>> > > >>> NEED COMMUNITY INPUT > > >>> > > >>> Recent releases: > > >>> 0.15.0 was released on 2019-10-05. > > >>> 0.14.1 was released on 2019-07-21. > > >>> 0.14.0 was released on 2019-07-04. > > >>> > > >>> ## Community Health: > > >>> > > >>> NEED COMMUNITY INPUT > > >>> >
Re: [DRAFT] Apache Arrow Board Report January 2020
Here is an updated draft. If there is no more feedback, this can be submitted to the board ## Description: The mission of Apache Arrow is the creation and maintenance of software related to columnar in-memory processing and data interchange ## Issues: There are no issues requiring board attention at this time. ## Membership Data: Apache Arrow was founded 2016-01-19 (4 years ago) There are currently 50 committers and 28 PMC members in this project. The Committer-to-PMC ratio is roughly 7:4. Community changes, past quarter: - No new PMC members. Last addition was Micah Kornfield on 2019-08-21. - Eric Erhardt was added as committer on 2019-10-18 - Joris Van den Bossche was added as committer on 2019-12-06 ## Project Activity: * We have completed our initial migration away from Travis CI for continuous integration and patch validation to use the new GitHub Actions (GHA) service. We are much happier with the compute resource allocation provided by GitHub but longer term we are concerned that the generous free allocation may not continue and would be interested to know what kinds of guarantees (if any) GitHub may make to the ASF regarding GHA. * We are not out of the woods on CI/CD as there are features of Apache Arrow that we cannot test in GitHub Actions. We are still considering options for running these optional test workloads as well as other kinds of periodic workloads like benchmarking * We hope to make a 1.0.0 release of the project in early 2020. We had thought that our next major release after 0.15.0 would be 1.0.0 but we have not yet completed some necessary work items that the community has agreed are essential to graduate to 1.0.0 Recent releases: 0.15.0 was released on 2019-10-05. 0.14.1 was released on 2019-07-21. 0.14.0 was released on 2019-07-04. ## Community Health: The developer community is healthy and continues to grow.THe co On Mon, Jan 6, 2020 at 11:16 AM Antoine Pitrou wrote: > > > Perhaps also mention that we're dependent on enough capacity on GitHub > Actions currently. I'm not sure how long their generosity will last :-) > > > Le 06/01/2020 à 18:14, Wes McKinney a écrit : > > There is still the question of how to manage CI tasks (e.g. > > GPU-enabled, ARM-enabled) that are unable to be run in GitHub Actions. > > We should probably mention that we've migrated off Travis CI, though. > > > > On Mon, Jan 6, 2020 at 11:07 AM Antoine Pitrou wrote: > >> > >> > >> Do we consider the CI issue solved? > >> > >> > >> Le 06/01/2020 à 18:02, Wes McKinney a écrit : > >>> Hi folks -- our quarterly ASF board report is due in 2 days. What > >>> items would we like to add in the below sections? > >>> > >>> ## Description: > >>> > >>> The mission of Apache Arrow is the creation and maintenance of software > >>> related > >>> to columnar in-memory processing and data interchange > >>> > >>> ## Issues: > >>> > >>> There are no issues requiring board attention at this time. > >>> > >>> ## Membership Data: > >>> Apache Arrow was founded 2016-01-19 (4 years ago) > >>> There are currently 50 committers and 28 PMC members in this project. > >>> The Committer-to-PMC ratio is roughly 7:4. > >>> > >>> Community changes, past quarter: > >>> - No new PMC members. Last addition was Micah Kornfield on 2019-08-21. > >>> - Eric Erhardt was added as committer on 2019-10-18 > >>> - Joris Van den Bossche was added as committer on 2019-12-06 > >>> > >>> ## Project Activity: > >>> > >>> NEED COMMUNITY INPUT > >>> > >>> Recent releases: > >>> 0.15.0 was released on 2019-10-05. > >>> 0.14.1 was released on 2019-07-21. > >>> 0.14.0 was released on 2019-07-04. > >>> > >>> ## Community Health: > >>> > >>> NEED COMMUNITY INPUT > >>>
[jira] [Created] (ARROW-7517) [C++] Builder does not honour dictionary type provided during initialization
Wamsi Viswanath created ARROW-7517: -- Summary: [C++] Builder does not honour dictionary type provided during initialization Key: ARROW-7517 URL: https://issues.apache.org/jira/browse/ARROW-7517 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.15.0 Reporter: Wamsi Viswanath Below is an example for reproducing the issue: [https://gist.github.com/wamsiv/d48ec37a9a9b5f4d484de6ff86a3870d] Builder automatically optimizes the dictionary type depending upon the number of unique values provided which results in schema mismatch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7516) [C#] .NET Benchmarks are broken
Eric Erhardt created ARROW-7516: --- Summary: [C#] .NET Benchmarks are broken Key: ARROW-7516 URL: https://issues.apache.org/jira/browse/ARROW-7516 Project: Apache Arrow Issue Type: Bug Components: C# Reporter: Eric Erhardt See [https://github.com/apache/arrow/pull/6030#issuecomment-571877721] It looks like the issue is that in the Benchmarks, `Length` is specified as `1_000_000`, and there has only been ~730,000 days since `DateTime.Min`, so this line fails: https://github.com/apache/arrow/blob/4634c89fc77f70fb5b5d035d6172263a4604da82/csharp/test/Apache.Arrow.Tests/TestData.cs#L130 A simple fix would be to cap what we pass into `AddDays` to some number like `100_000`, or so. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Arrow sync call January 8 at 12:00 US/Eastern, 17:00 UTC
Attendees: * Ben Kietzman * Wes McKinney * Prudhvi Porandla * Neal Richardson * François Saint-Jacques Discussion: * Blockers for 1.0 release, how to get them done, what is required * 0.16 backlog triage On Tue, Jan 7, 2020 at 9:01 AM Neal Richardson wrote: > Hi all, > Happy 2020! Reminder that our biweekly call is in 24 hours at > https://meet.google.com/vtm-teks-phx. All are welcome to join. Notes will > be sent out to the mailing list afterwards. > > Neal >
[jira] [Created] (ARROW-7515) [C++] Rename nonexistent and non_existent to not_found
Kenta Murata created ARROW-7515: --- Summary: [C++] Rename nonexistent and non_existent to not_found Key: ARROW-7515 URL: https://issues.apache.org/jira/browse/ARROW-7515 Project: Apache Arrow Issue Type: Task Components: C++ Reporter: Kenta Murata Assignee: Kenta Murata -- This message was sent by Atlassian Jira (v8.3.4#803005)