[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261734#comment-16261734
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

kou commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table virtual 
interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-346203089
 
 
   > It seems like respecting the schema is the right approach.
   
   I agree with you.
   
   > I could add boundschecking to `SimpleRecordBatch::column(i)` and return 
null if the index is out of bounds, would that help at all?
   
   I think that it's better that we do it in more higher layer such as GLib 
bindings layer. I think that we don't do needless checks in C++ layer for 
simplicity and performance.
   
   I'll add boundschecking in GLib bindings later. For now, I think that it's 
better that we always validate a newly created record batch. If we always 
validate it, we can assume that all record batches always have valid data.
   
   ```patch
   From d9260c09765b1cd337cda5a09497ee1b985ef623 Mon Sep 17 00:00:00 2001
   From: Kouhei Sutou 
   Date: Wed, 22 Nov 2017 09:09:30 +0900
   Subject: [PATCH] [GLib] Always validate on creating new record batch
   
   ---
c_glib/arrow-glib/record-batch.cpp | 13 ---
c_glib/arrow-glib/record-batch.h   |  3 ++-
c_glib/example/go/write-batch.go   | 10 +++--
c_glib/example/go/write-stream.go  | 10 +++--
c_glib/test/test-record-batch.rb   | 46 
+-
5 files changed, 58 insertions(+), 24 deletions(-)
   
   diff --git a/c_glib/arrow-glib/record-batch.cpp 
b/c_glib/arrow-glib/record-batch.cpp
   index f23a0cf7..73de6eeb 100644
   --- a/c_glib/arrow-glib/record-batch.cpp
   +++ b/c_glib/arrow-glib/record-batch.cpp
   @@ -135,13 +135,15 @@ garrow_record_batch_class_init(GArrowRecordBatchClass 
*klass)
 * @schema: The schema of the record batch.
 * @n_rows: The number of the rows in the record batch.
 * @columns: (element-type GArrowArray): The columns in the record batch.
   + * @error: (nullable): Return location for a #GError or %NULL.
 *
   - * Returns: A newly created #GArrowRecordBatch.
   + * Returns: (nullable): A newly created #GArrowRecordBatch or %NULL on 
error.
 */
GArrowRecordBatch *
garrow_record_batch_new(GArrowSchema *schema,
guint32 n_rows,
   -GList *columns)
   +GList *columns,
   +GError **error)
{
  std::vector arrow_columns;
  for (GList *node = columns; node; node = node->next) {
   @@ -152,7 +154,12 @@ garrow_record_batch_new(GArrowSchema *schema,
  auto arrow_record_batch =
arrow::RecordBatch::Make(garrow_schema_get_raw(schema),
 n_rows, arrow_columns);
   -  return garrow_record_batch_new_raw(_record_batch);
   +  auto status = arrow_record_batch->Validate();
   +  if (garrow_error_check(error, status, "[record-batch][new]")) {
   +return garrow_record_batch_new_raw(_record_batch);
   +  } else {
   +return NULL;
   +  }
}

/**
   diff --git a/c_glib/arrow-glib/record-batch.h 
b/c_glib/arrow-glib/record-batch.h
   index 021f894f..823a42bb 100644
   --- a/c_glib/arrow-glib/record-batch.h
   +++ b/c_glib/arrow-glib/record-batch.h
   @@ -68,7 +68,8 @@ GType garrow_record_batch_get_type(void) G_GNUC_CONST;

GArrowRecordBatch *garrow_record_batch_new(GArrowSchema *schema,
   guint32 n_rows,
   -   GList *columns);
   +   GList *columns,
   +   GError **error);

gboolean garrow_record_batch_equal(GArrowRecordBatch *record_batch,
   GArrowRecordBatch *other_record_batch);
   diff --git a/c_glib/example/go/write-batch.go 
b/c_glib/example/go/write-batch.go
   index 9dbc3c00..f4d03ed9 100644
   --- a/c_glib/example/go/write-batch.go
   +++ b/c_glib/example/go/write-batch.go
   @@ -188,7 +188,10 @@ func main() {
BuildDoubleArray(),
}

   -recordBatch := arrow.NewRecordBatch(schema, 4, columns)
   +recordBatch, err := arrow.NewRecordBatch(schema, 4, columns)
   +if err != nil {
   +log.Fatalf("Failed to create record batch #1: %v", err)
   +}
_, err = writer.WriteRecordBatch(recordBatch)
if err != nil {
log.Fatalf("Failed to write record batch #1: %v", err)
   @@ -198,7 +201,10 @@ func main() {
for i, column := range columns {
slicedColumns[i] = column.Slice(1, 3)
}
   -recordBatch = arrow.NewRecordBatch(schema, 3, slicedColumns)
   +recordBatch, err = arrow.NewRecordBatch(schema, 3, 

[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261707#comment-16261707
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-346200509
 
 
   Merging, since now the Linux build has failed only when reaching 
parquet-cpp. I will update the parquet-cpp patch and then merge that once its 
build passes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261710#comment-16261710
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm closed pull request #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.travis.yml b/.travis.yml
index 9c714a689..ddadf739a 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -55,6 +55,7 @@ matrix:
 - export ARROW_TRAVIS_VALGRIND=1
 - export ARROW_TRAVIS_PLASMA=1
 - export ARROW_TRAVIS_CLANG_FORMAT=1
+- export ARROW_BUILD_WARNING_LEVEL=CHECKIN
 - export CC="clang-4.0"
 - export CXX="clang++-4.0"
 - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh
@@ -74,6 +75,7 @@ matrix:
 before_script:
 - export ARROW_TRAVIS_USE_TOOLCHAIN=1
 - export ARROW_TRAVIS_PLASMA=1
+- export ARROW_BUILD_WARNING_LEVEL=CHECKIN
 - travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
 script:
 - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh
diff --git a/c_glib/arrow-glib/record-batch.cpp 
b/c_glib/arrow-glib/record-batch.cpp
index f381af0a2..f23a0cf75 100644
--- a/c_glib/arrow-glib/record-batch.cpp
+++ b/c_glib/arrow-glib/record-batch.cpp
@@ -150,9 +150,8 @@ garrow_record_batch_new(GArrowSchema *schema,
   }
 
   auto arrow_record_batch =
-std::make_shared(garrow_schema_get_raw(schema),
- n_rows,
- arrow_columns);
+arrow::RecordBatch::Make(garrow_schema_get_raw(schema),
+ n_rows, arrow_columns);
   return garrow_record_batch_new_raw(_record_batch);
 }
 
diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp
index 779f2ef62..e086396f8 100644
--- a/c_glib/arrow-glib/table.cpp
+++ b/c_glib/arrow-glib/table.cpp
@@ -143,8 +143,7 @@ garrow_table_new(GArrowSchema *schema,
   }
 
   auto arrow_table =
-std::make_shared(garrow_schema_get_raw(schema),
-   arrow_columns);
+arrow::Table::Make(garrow_schema_get_raw(schema), arrow_columns);
   return garrow_table_new_raw(_table);
 }
 
diff --git a/c_glib/test/test-file-writer.rb b/c_glib/test/test-file-writer.rb
index 3de8e5cf3..67aed85f7 100644
--- a/c_glib/test/test-file-writer.rb
+++ b/c_glib/test/test-file-writer.rb
@@ -19,14 +19,18 @@ class TestFileWriter < Test::Unit::TestCase
   include Helper::Buildable
 
   def test_write_record_batch
+data = [true]
+field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
+schema = Arrow::Schema.new([field])
+
 tempfile = Tempfile.open("arrow-ipc-file-writer")
 output = Arrow::FileOutputStream.new(tempfile.path, false)
 begin
-  field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
-  schema = Arrow::Schema.new([field])
   file_writer = Arrow::RecordBatchFileWriter.new(output, schema)
   begin
-record_batch = Arrow::RecordBatch.new(schema, 0, [])
+record_batch = Arrow::RecordBatch.new(schema,
+  data.size,
+  [build_boolean_array(data)])
 file_writer.write_record_batch(record_batch)
   ensure
 file_writer.close
@@ -38,8 +42,12 @@ def test_write_record_batch
 input = Arrow::MemoryMappedInputStream.new(tempfile.path)
 begin
   file_reader = Arrow::RecordBatchFileReader.new(input)
-  assert_equal(["enabled"],
+  assert_equal([field.name],
file_reader.schema.fields.collect(&:name))
+  assert_equal(Arrow::RecordBatch.new(schema,
+  data.size,
+  [build_boolean_array(data)]),
+   file_reader.read_record_batch(0))
 ensure
   input.close
 end
diff --git a/c_glib/test/test-gio-input-stream.rb 
b/c_glib/test/test-gio-input-stream.rb
index a71a37043..2adf25b3a 100644
--- a/c_glib/test/test-gio-input-stream.rb
+++ b/c_glib/test/test-gio-input-stream.rb
@@ -16,15 +16,21 @@
 # under the License.
 
 class TestGIOInputStream < Test::Unit::TestCase
+  include Helper::Buildable
+
   def test_reader_backend
+data = [true]
+field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
+schema = Arrow::Schema.new([field])
+
 tempfile = Tempfile.open("arrow-gio-input-stream")
 output = Arrow::FileOutputStream.new(tempfile.path, false)
 begin
-  field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
-  schema = Arrow::Schema.new([field])
   file_writer = 

[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260841#comment-16260841
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-346050608
 
 
   @kou it seems like there are two different issues here. 
   
   Here, a schema with 1 field was passed along with a list of 0 columns:
   
   ```diff
   -record_batch = Arrow::RecordBatch.new(schema, 0, [])
   +record_batch = Arrow::RecordBatch.new(schema,
   +  data.size,
   +  [build_boolean_array(data)])
   ```
   
   I believe this would result in segfaults even if the number of rows is 
non-zero. So having empty / length-0 record batches in the IPC writer code path 
is fine so long as the columns matches the schema. 
   
   The reason this bug was not caught before was that the 
`RecordBatch::columns_` member was being used to determine 
`RecordBatch::num_columns()`, whereas now we are using the schema. It seems 
like respecting the schema is the right approach. I could add boundschecking to 
`SimpleRecordBatch::column(i)` and return null if the index is out of bounds, 
would that help at all?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260269#comment-16260269
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

kou commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table virtual 
interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345920440
 
 
   If we allow empty record batch (no columns record batch), the following 
check is needed:
   
   ```diff
   diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc
   index 3c1db061..6af9bc4b 100644
   --- a/cpp/src/arrow/ipc/writer.cc
   +++ b/cpp/src/arrow/ipc/writer.cc
   @@ -139,9 +139,11 @@ class RecordBatchSerializer : public ArrayVisitor {
  buffers_.clear();
}

   -// Perform depth-first traversal of the row-batch
   -for (int i = 0; i < batch.num_columns(); ++i) {
   -  RETURN_NOT_OK(VisitArray(*batch.column(i)));
   +if (batch.num_rows() > 0) {
   +  // Perform depth-first traversal of the row-batch
   +  for (int i = 0; i < batch.num_columns(); ++i) {
   +RETURN_NOT_OK(VisitArray(*batch.column(i)));
   +  }
}

// The position for the start of a buffer relative to the passed frame 
of
   ```
   
   I'm OK that we deny no columns record batch. It'll simplify our code.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260263#comment-16260263
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

kou commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table virtual 
interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345919956
 
 
   @wesm I confirmed. The test creates 0 rows record batch with empty columns. 
It causes the segmentation fault. The following patch fixes this:
   
   ```diff
   diff --git a/c_glib/test/test-file-writer.rb 
b/c_glib/test/test-file-writer.rb
   index 3de8e5cf..67aed85f 100644
   --- a/c_glib/test/test-file-writer.rb
   +++ b/c_glib/test/test-file-writer.rb
   @@ -19,14 +19,18 @@ class TestFileWriter < Test::Unit::TestCase
  include Helper::Buildable

  def test_write_record_batch
   +data = [true]
   +field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
   +schema = Arrow::Schema.new([field])
   +
tempfile = Tempfile.open("arrow-ipc-file-writer")
output = Arrow::FileOutputStream.new(tempfile.path, false)
begin
   -  field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
   -  schema = Arrow::Schema.new([field])
  file_writer = Arrow::RecordBatchFileWriter.new(output, schema)
  begin
   -record_batch = Arrow::RecordBatch.new(schema, 0, [])
   +record_batch = Arrow::RecordBatch.new(schema,
   +  data.size,
   +  [build_boolean_array(data)])
file_writer.write_record_batch(record_batch)
  ensure
file_writer.close
   @@ -38,8 +42,12 @@ class TestFileWriter < Test::Unit::TestCase
input = Arrow::MemoryMappedInputStream.new(tempfile.path)
begin
  file_reader = Arrow::RecordBatchFileReader.new(input)
   -  assert_equal(["enabled"],
   +  assert_equal([field.name],
   file_reader.schema.fields.collect(&:name))
   +  assert_equal(Arrow::RecordBatch.new(schema,
   +  data.size,
   +  [build_boolean_array(data)]),
   +   file_reader.read_record_batch(0))
ensure
  input.close
end
   diff --git a/c_glib/test/test-gio-input-stream.rb 
b/c_glib/test/test-gio-input-stream.rb
   index a71a3704..2adf25b3 100644
   --- a/c_glib/test/test-gio-input-stream.rb
   +++ b/c_glib/test/test-gio-input-stream.rb
   @@ -16,15 +16,21 @@
# under the License.

class TestGIOInputStream < Test::Unit::TestCase
   +  include Helper::Buildable
   +
  def test_reader_backend
   +data = [true]
   +field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
   +schema = Arrow::Schema.new([field])
   +
tempfile = Tempfile.open("arrow-gio-input-stream")
output = Arrow::FileOutputStream.new(tempfile.path, false)
begin
   -  field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
   -  schema = Arrow::Schema.new([field])
  file_writer = Arrow::RecordBatchFileWriter.new(output, schema)
  begin
   -record_batch = Arrow::RecordBatch.new(schema, 0, [])
   +record_batch = Arrow::RecordBatch.new(schema,
   +  data.size,
   +  [build_boolean_array(data)])
file_writer.write_record_batch(record_batch)
  ensure
file_writer.close
   @@ -38,8 +44,12 @@ class TestGIOInputStream < Test::Unit::TestCase
input = Arrow::GIOInputStream.new(input_stream)
begin
  file_reader = Arrow::RecordBatchFileReader.new(input)
   -  assert_equal(["enabled"],
   +  assert_equal([field.name],
   file_reader.schema.fields.collect(&:name))
   +  assert_equal(Arrow::RecordBatch.new(schema,
   +  data.size,
   +  [build_boolean_array(data)]),
   +   file_reader.read_record_batch(0))
ensure
  input.close
end
   diff --git a/c_glib/test/test-gio-output-stream.rb 
b/c_glib/test/test-gio-output-stream.rb
   index adaa8c1b..c77598ed 100644
   --- a/c_glib/test/test-gio-output-stream.rb
   +++ b/c_glib/test/test-gio-output-stream.rb
   @@ -16,17 +16,23 @@
# under the License.

class TestGIOOutputStream < Test::Unit::TestCase
   +  include Helper::Buildable
   +
  def test_writer_backend
   +data = [true]
   +field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new)
   +schema = Arrow::Schema.new([field])
   +
tempfile = Tempfile.open("arrow-gio-output-stream")
file = Gio::File.new_for_path(tempfile.path)
output_stream = 

[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259850#comment-16259850
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345832486
 
 
   @kou I removed the dchecks from the ctor that I mentioned in favor of 
validating in `RecordBatch::Validate`. The tests are segfaulting, though, I 
stepped into gdb to look at the failing test, it looks like a record batch in 
the test suite might be malformed (the schema is larger than the actual number 
of columns)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259834#comment-16259834
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on a change in pull request #1337: ARROW-1808: [C++] Make 
RecordBatch, Table virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#discussion_r152109752
 
 

 ##
 File path: cpp/src/arrow/record_batch.h
 ##
 @@ -0,0 +1,154 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#ifndef ARROW_RECORD_BATCH_H
+#define ARROW_RECORD_BATCH_H
+
+#include 
+#include 
+#include 
+#include 
+
+#include "arrow/array.h"
+#include "arrow/type.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+
+class KeyValueMetadata;
+class Status;
+
+/// \class RecordBatch
+/// \brief Collection of equal-length arrays matching a particular Schema
+///
+/// A record batch is table-like data structure that is semantically a sequence
+/// of fields, each a contiguous Arrow array
+class ARROW_EXPORT RecordBatch {
+ public:
+  virtual ~RecordBatch() = default;
+
+  /// \param[in] schema The record batch schema
+  /// \param[in] num_rows length of fields in the record batch. Each array
+  /// should have the same length as num_rows
+  /// \param[in] columns the record batch fields as vector of arrays
+  static std::shared_ptr Make(
+  const std::shared_ptr& schema, int64_t num_rows,
+  const std::vector& columns);
+
+  /// \brief Move-based constructor for a vector of Array instances
+  static std::shared_ptr Make(const std::shared_ptr& 
schema,
+   int64_t num_rows,
+   
std::vector&& columns);
+
+  /// \brief Construct record batch from vector of internal data structures
+  /// \since 0.5.0
+  ///
+  /// This class is only provided with an rvalue-reference for the input data,
+  /// and is intended for internal use, or advanced users.
+  ///
+  /// \param schema the record batch schema
+  /// \param num_rows the number of semantic rows in the record batch. This
+  /// should be equal to the length of each field
+  /// \param columns the data for the batch's columns
+  static std::shared_ptr Make(
+  const std::shared_ptr& schema, int64_t num_rows,
+  std::vector&& columns);
+
+  /// \brief Construct record batch by copying vector of array data
+  /// \since 0.5.0
+  static std::shared_ptr Make(
+  const std::shared_ptr& schema, int64_t num_rows,
+  const std::vector& columns);
+
+  /// \brief Determine if two record batches are exactly equal
+  /// \return true if batches are equal
+  bool Equals(const RecordBatch& other) const;
+
+  /// \brief Determine if two record batches are approximately equal
+  bool ApproxEquals(const RecordBatch& other) const;
+
+  // \return the table's schema
+  /// \return true if batches are equal
+  std::shared_ptr schema() const { return schema_; }
+
+  /// \brief Retrieve an array from the record batch
+  /// \param[in] i field index, does not boundscheck
+  /// \return an Array object
+  virtual std::shared_ptr column(int i) const = 0;
+
+  /// \brief Retrieve an array's internaldata from the record batch
+  /// \param[in] i field index, does not boundscheck
+  /// \return an internal ArrayData object
+  virtual std::shared_ptr column_data(int i) const = 0;
+
+  virtual std::shared_ptr ReplaceSchemaMetadata(
+  const std::shared_ptr& metadata) const = 0;
+
+  /// \brief Name in i-th column
+  const std::string& column_name(int i) const;
 
 Review comment:
   The name is coming from the schema, so a copy not strictly necessary


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> 

[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259622#comment-16259622
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345787659
 
 
   @kou I fixed the glib compilation, but I added DCHECKs to the record batch 
constructor to assert that the schema is the same size as the columns, but this 
isn't being checked it seems in the Glib bindings:
   
   ```
   TestFileWriter: 
 test_write_record_batch:   

/home/wesm/code/arrow/cpp/src/arrow/record_batch.cc:39 Check failed: 
(static_cast(columns.size())) == (schema->num_fields()) 
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259627#comment-16259627
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345787659
 
 
   @kou I fixed the glib compilation, but I added DCHECKs to the record batch 
constructor to assert that the schema is the same size as the columns, but this 
isn't being checked it seems in the Glib bindings:
   
   ```
   TestFileWriter:
 test_write_record_batch:
   /home/wesm/code/arrow/cpp/src/arrow/record_batch.cc:39 Check failed: 
(static_cast(columns.size())) == (schema->num_fields()) 
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259620#comment-16259620
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm commented on issue #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337#issuecomment-345787659
 
 
   @kou I fixed the glib compilation, but I added DCHECKs to the record batch 
constructor to assert that the schema is the same size as the columns, but this 
isn't being checked it seems in the Glib bindings:
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259426#comment-16259426
 ] 

ASF GitHub Bot commented on ARROW-1808:
---

wesm opened a new pull request #1337: ARROW-1808: [C++] Make RecordBatch, Table 
virtual interfaces for column access
URL: https://github.com/apache/arrow/pull/1337
 
 
   While this will cause some minor API breakage in parquet-cpp and some other 
downstream users, this is reasonably long overdue. It will permit 
implementations of the RecordBatch or Table interface that do lazy IO / data 
loading or lazy materialization of columns.
   
   I will write a patch to fix up parquet-cpp, and will look to see if glib is 
easy to fix. There's no good way to go about merging this patch since a green 
build is not possible, so once we're happy with the patch, I can merge this 
patch and then work on getting a green build in parquet-cpp so we don't have a 
broken build there for too long


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-15 Thread Kouhei Sutou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254767#comment-16254767
 ] 

Kouhei Sutou commented on ARROW-1808:
-

OK. I created ARROW-1824.

> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-15 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253681#comment-16253681
 ] 

Wes McKinney commented on ARROW-1808:
-

Yes, that would be useful. We should create a separate JIRA about this

> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-14 Thread Kouhei Sutou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253069#comment-16253069
 ] 

Kouhei Sutou commented on ARROW-1808:
-

Thanks for notifying it. It's very useful information.
I like your idea. GLib bindings will follow the API changes soon when they are 
done.

I'm not sure that the following is related to this topic but I share:

It may be useful that we can use arrow::ipc modules such as 
RecordBatchStreamReader and RecordBatchStreamWriter for RecordBatch on GPU.


> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251633#comment-16251633
 ] 

Wes McKinney commented on ARROW-1808:
-

An alternative to changing the current {{arrow::RecordBatch}} implementation is 
to create a new {{arrow::VirtualRecordBatch}} class as the base class, but then 
we would need to change various APIs for classes that return 
{{std::shared_ptr}} (like the IPC loaders). 

> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1808) [C++] Make RecordBatch interface virtual to permit record batches that lazy-materialize columns

2017-11-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16251621#comment-16251621
 ] 

Wes McKinney commented on ARROW-1808:
-

[~kou] I am going to start looking at this soon, since it may cause a little 
bit of disruption in the glib bindings. This is a moderately disruptive API 
change, but long-term it will be for the best. The idea is that the current 
{{arrow::RecordBatch}} is a "simple in-memory record batch". But the 
object-boxing requirements to produce a vector of 
{{std::shared_ptr}} can be quite expensive for large record 
batches. 

Instead, we could have {{arrow::RecordBatch}} as an abstract interface with 
virtual function for column access, with the current incarnation of RecordBatch 
as a subclass. So we could also create an {{arrow::IpcRecordBatch}} that does 
late-materialization of the {{arrow::Array}} objects. So if you have 1000 
columns, you do not pay the cost of creating array objects for all of them if 
you only end up accessing a few columns in some analytics algorithm

> [C++] Make RecordBatch interface virtual to permit record batches that 
> lazy-materialize columns
> ---
>
> Key: ARROW-1808
> URL: https://issues.apache.org/jira/browse/ARROW-1808
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
> Fix For: 0.8.0
>
>
> This should be looked at soon to prevent having to define a different virtual 
> interface for record batches. There are places where we are using the record 
> batch constructor directly, and in some third party code (like MapD), so this 
> might be good to get done for 0.8.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)