from:"GitBox"

[GitHub] [arrow] nealrichardson commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-889950166


   I've reproduced the error on https://github.com/apache/arrow/pull/10839, so 
I'll merge that (assuming I haven't broken other things) and then you can 
rebase this, run the as-cran crossbow job, and we should see the build pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10802: ARROW-1568: [C++] Implement Drop Null Kernel for Arrays

2021-07-30 Thread GitBox



bkietz commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r679994244



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());

Review comment:
   Every function which returns Status or Result<> will prompt a similar 
compiler error if not explicitly handled, since it could be ignoring (for 
example) an OOM Status 
https://github.com/apache/arrow/pull/10802/checks?check_run_id=3202879387#step:5:608
   ```suggestion
 RETURN_NOT_OK(builder.Reserve(column->length() - column->null_count()));
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10802: ARROW-1568: [C++] Implement Drop Null Kernel for Arrays

2021-07-30 Thread GitBox



bkietz commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r680002741



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);

Review comment:
   I think instead of explicitly building indices of valid bits, it'd be 
more compact and performant to reuse GetTakeIndices. You can zero-copy 
construct a BooleanArray which is true where `column` is valid and false where 
`column` is null, then pass it to GetTakeIndices to acquire `indices`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera commented on a change in pull request #10802: ARROW-1568: [C++] Implement Drop Null Kernel for Arrays

2021-07-30 Thread GitBox



nirandaperera commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r680002827



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());
+
+  std::vector values;
+  for (int64_t i = 0; i < column->length(); i++) {
+if (column->IsValid(i)) {
+  builder.UnsafeAppend(static_cast(i));
+}
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> GetNotNullIndices(
+const std::shared_ptr& chunks, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(chunks->length() - chunks->null_count());
+  int64_t relative_index = 0;
+  for (int64_t chunk_index = 0; chunk_index < chunks->num_chunks(); 
++chunk_index) {
+auto column_chunk = chunks->chunk(chunk_index);
+for (int64_t col_index = 0; col_index < column_chunk->length(); 
col_index++) {
+  if (column_chunk->IsValid(col_index)) {
+builder.UnsafeAppend(static_cast(relative_index + col_index));
+  }
+}
+relative_index += column_chunk->length();
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> DropNullRecordBatch(const RecordBatch& 
batch,

Review comment:
   I'm a little confused on the sematics for `DropNull` in record batches. 
What would happen in the following case?
   
   | A | B| C|
   |---|--|--|
   | 1 | 4| null |
   | 2 | null | 6|
   | 3 | 5| null |
   

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {

Review comment:
   I feel like it would be nice to have the following signature
   ```c++
   Status GetNotNullIndices( const std::shared_ptr& column, MemoryPool* 
memory_pool, std::shared_ptr* out_array);
   ```

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());
+
+  std::vector values;
+  for (int64_t i = 0; i < column->length(); i++) {
+if (column->IsValid(i)) {
+  builder.UnsafeAppend(static_cast(i));
+}
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> GetNotNullIndices(
+const std::shared_ptr& chunks, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(chunks->length() - chunks->null_count());
+  int64_t relative_index = 0;
+  for (int64_t chunk_index = 0; chunk_index < chunks->num_chunks(); 
++chunk_index) {
+auto column_chunk = chunks->chunk(chunk_index);
+for (int64_t col_index = 0; col_index < column_chunk->length(); 
col_index++) {
+  if (column_chunk->IsValid(col_index)) {
+builder.UnsafeAppend(static_cast(relative_index + col_index));
+  }
+}
+relative_index += column_chunk->length();
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> DropNullRecordBatch(const RecordBatch& 
batch,

Review comment:
   I have a feeling that we need to make a union out of all the validity 
buffers and then call `Take` based on that?

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());
+
+  std::vector values;
+  for (int64_t i = 0; i < column->length(); i++) {
+if (column->IsValid(i)) {
+  builder.UnsafeAppend(static_cast(i));
+}
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> GetNotNullIndices(
+const std::shared_ptr& chunks, MemoryPool* memory_pool) {
+  std::shared

[GitHub] [arrow] thisisnic commented on a change in pull request #10710: ARROW-11460: [R] Use system libraries if present on Linux

2021-07-30 Thread GitBox



thisisnic commented on a change in pull request #10710:
URL: https://github.com/apache/arrow/pull/10710#discussion_r680012443



##
File path: r/inst/build_arrow_static.sh
##
@@ -45,6 +45,19 @@ else
   ARROW_DEFAULT_PARAM="OFF"
 fi
 
+if [ "$R_STATIC_DEPENDENCY_SOURCE" = "AUTO" ]; then
+  pkg-config --version >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+ARROW_DEPENDENCY_SOURCE="AUTO"
+  else
+ARROW_DEPENDENCY_SOURCE="BUNDLED"
+echo " Warning: $R_STATIC_DEPENDENCY_SOURCE set to 'AUTO' but 
pkg-config not detected"
+echo " Using bundled dependencies instead"
+  fi

Review comment:
   So atfer that, I ran this:
   `ARROW_R_DEV=TRUE DEPENDENCY_SOURCE=AUTO docker run -it 
apache/arrow-dev:r-rhub-ubuntu-gcc-release-latest bash`
   
   but couldn't find that script in the `arrow` directory.  Has something gone 
really wrong here?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



dongjoon-hyun commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-889969658


   Also cc @sunchao .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson closed pull request #10839: ARROW-13510: [CI][R][C++] Add -Wall to fedora-clang-devel as-cran checks

2021-07-30 Thread GitBox



nealrichardson closed pull request #10839:
URL: https://github.com/apache/arrow/pull/10839


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-889973030


   ok @nirandaperera, merged, please rebase and then do `@github-actions 
crossbow submit test-r-linux-as-cran`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nirandaperera commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-889974932


   @github-actions crossbow submit test-r-linux-as-cran


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-889975532


   Revision: fe0b1ec8c425c963a7470ace9fd3af14674b37dd
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-678](https://github.com/ursacomputing/crossbow/branches/all?query=actions-678)
   
   |Task|Status|
   ||--|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-678-github-test-r-linux-as-cran)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-678-github-test-r-linux-as-cran)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #10627: PARQUET-490: [C++][Parquet] Basic support for encoding DELTA_BINARY_PACKED

2021-07-30 Thread GitBox



emkornfield commented on pull request #10627:
URL: https://github.com/apache/arrow/pull/10627#issuecomment-889976082


   sorry I have had less time then I would have liked recently for Arrow 
reviews, will try to get to this soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#issuecomment-889976220


   sorry I have had less time then I would have liked recently for Arrow 
reviews, will try to get to this soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680022978



##
File path: go/parquet/internal/encoding/encoding_benchmarks_test.go
##
@@ -0,0 +1,461 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding_test
+
+import (
+   "fmt"
+   "math"
+   "testing"
+
+   "github.com/apache/arrow/go/arrow"
+   "github.com/apache/arrow/go/arrow/array"
+   "github.com/apache/arrow/go/arrow/memory"
+   "github.com/apache/arrow/go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/encoding"
+   "github.com/apache/arrow/go/parquet/internal/hashing"
+   "github.com/apache/arrow/go/parquet/internal/testutils"
+   "github.com/apache/arrow/go/parquet/schema"
+)
+
+const (
+   MINSIZE = 1024
+   MAXSIZE = 65536
+)
+
+func BenchmarkPlainEncodingBoolean(b *testing.B) {
+   for sz := MINSIZE; sz < MAXSIZE+1; sz *= 2 {

Review comment:
   there isn't a built in construct in GO benchmarks for adjusting batch 
size?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680026208



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number

Review comment:
   does it ever return less than the values provided in lvls?  If so please 
document it (if not maybe still note that this is simply for the API consumer's 
convenience?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680027195



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number
+// of values encoded.
+func (l *LevelEncoder) EncodeNoFlush(lvls []int16) int {
+   nencoded := 0
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   for _, level := range lvls {
+   if !l.rle.Put(uint64(level)) {
+   break
+   }
+   nencoded++
+   }
+   default:
+   for _, level := range lvls {
+   if l.bit.WriteValue(uint64(level), uint(l.bitWidth)) != 
nil {
+   break
+   }
+   nencoded++
+   }
+   }
+   return nencoded
+}
+
+// Flush flushes out any encoded data to the underlying writer.
+func (l *LevelEncoder) Flush() {
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rleLen = l.rle.Flush()
+   default:
+   l.bit.Flush(false)
+   }
+}
+
+// Encode enco

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680027388



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number
+// of values encoded.
+func (l *LevelEncoder) EncodeNoFlush(lvls []int16) int {
+   nencoded := 0
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   for _, level := range lvls {
+   if !l.rle.Put(uint64(level)) {
+   break
+   }
+   nencoded++
+   }
+   default:
+   for _, level := range lvls {
+   if l.bit.WriteValue(uint64(level), uint(l.bitWidth)) != 
nil {
+   break
+   }
+   nencoded++
+   }
+   }
+   return nencoded
+}
+
+// Flush flushes out any encoded data to the underlying writer.
+func (l *LevelEncoder) Flush() {
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rleLen = l.rle.Flush()
+   default:
+   l.bit.Flush(false)
+   }
+}
+
+// Encode enco

[GitHub] [arrow] nealrichardson commented on a change in pull request #10710: ARROW-11460: [R] Use system libraries if present on Linux

2021-07-30 Thread GitBox



nealrichardson commented on a change in pull request #10710:
URL: https://github.com/apache/arrow/pull/10710#discussion_r680027557



##
File path: r/inst/build_arrow_static.sh
##
@@ -45,6 +45,19 @@ else
   ARROW_DEFAULT_PARAM="OFF"
 fi
 
+if [ "$R_STATIC_DEPENDENCY_SOURCE" = "AUTO" ]; then
+  pkg-config --version >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+ARROW_DEPENDENCY_SOURCE="AUTO"
+  else
+ARROW_DEPENDENCY_SOURCE="BUNDLED"
+echo " Warning: $R_STATIC_DEPENDENCY_SOURCE set to 'AUTO' but 
pkg-config not detected"
+echo " Using bundled dependencies instead"
+  fi

Review comment:
   I think you want `ARROW_R_DEV=TRUE docker run -it 
apache/arrow-dev:r-rhub-ubuntu-gcc-release-latest -e 
ARROW_DEPENDENCY_SOURCE=AUTO /bin/bash` (get the right variable name, and pass 
it into the runtime environment with -e)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zeroshade commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



zeroshade commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680028297



##
File path: go/parquet/internal/encoding/encoding_benchmarks_test.go
##
@@ -0,0 +1,461 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding_test
+
+import (
+   "fmt"
+   "math"
+   "testing"
+
+   "github.com/apache/arrow/go/arrow"
+   "github.com/apache/arrow/go/arrow/array"
+   "github.com/apache/arrow/go/arrow/memory"
+   "github.com/apache/arrow/go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/encoding"
+   "github.com/apache/arrow/go/parquet/internal/hashing"
+   "github.com/apache/arrow/go/parquet/internal/testutils"
+   "github.com/apache/arrow/go/parquet/schema"
+)
+
+const (
+   MINSIZE = 1024
+   MAXSIZE = 65536
+)
+
+func BenchmarkPlainEncodingBoolean(b *testing.B) {
+   for sz := MINSIZE; sz < MAXSIZE+1; sz *= 2 {

Review comment:
   nope. `b.N` is the number of iterations to run under the benchmarking 
timer, so by doing this little loop it creates a separate benchmark for each of 
the batch sizes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680028272



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number
+// of values encoded.
+func (l *LevelEncoder) EncodeNoFlush(lvls []int16) int {
+   nencoded := 0
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   for _, level := range lvls {
+   if !l.rle.Put(uint64(level)) {
+   break
+   }
+   nencoded++
+   }
+   default:
+   for _, level := range lvls {
+   if l.bit.WriteValue(uint64(level), uint(l.bitWidth)) != 
nil {
+   break
+   }
+   nencoded++
+   }
+   }
+   return nencoded
+}
+
+// Flush flushes out any encoded data to the underlying writer.
+func (l *LevelEncoder) Flush() {
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rleLen = l.rle.Flush()
+   default:
+   l.bit.Flush(false)
+   }
+}
+
+// Encode enco

[GitHub] [arrow] nealrichardson commented on a change in pull request #10710: ARROW-11460: [R] Use system libraries if present on Linux

2021-07-30 Thread GitBox



nealrichardson commented on a change in pull request #10710:
URL: https://github.com/apache/arrow/pull/10710#discussion_r680028253



##
File path: r/inst/build_arrow_static.sh
##
@@ -45,6 +45,19 @@ else
   ARROW_DEFAULT_PARAM="OFF"
 fi
 
+if [ "$R_STATIC_DEPENDENCY_SOURCE" = "AUTO" ]; then
+  pkg-config --version >/dev/null 2>&1
+  if [ $? -eq 0 ]; then
+ARROW_DEPENDENCY_SOURCE="AUTO"
+  else
+ARROW_DEPENDENCY_SOURCE="BUNDLED"
+echo " Warning: $R_STATIC_DEPENDENCY_SOURCE set to 'AUTO' but 
pkg-config not detected"
+echo " Using bundled dependencies instead"
+  fi

Review comment:
   oh you also need `-v $(pwd):/arrow` or something like that, that's what 
mounts the /arrow directory based on your local dir




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zeroshade commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



zeroshade commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680029295



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number

Review comment:
   it *should* always return `len(lvls)`, if it returns less that means it 
encountered an error/issue while encoding. I'll add that to the comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson opened a new pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



nealrichardson opened a new pull request #10841:
URL: https://github.com/apache/arrow/pull/10841


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on a change in pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



nealrichardson commented on a change in pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#discussion_r680032585



##
File path: ci/scripts/r_deps.sh
##
@@ -25,7 +25,12 @@ source_dir=${1}/r
 pushd ${source_dir}
 
 # Install R package dependencies
-${R_BIN} -e "install.packages('remotes'); remotes::install_cran(c('glue', 
'rcmdcheck', 'sys'))"
-${R_BIN} -e "remotes::install_deps(dependencies = TRUE)"
+# install.packages() emits warnings if packages fail to install,
+# but we want to error/fail the build.
+# options(warn=2) turns warnings into errors
+${R_BIN} -e "options(warn=2); install.packages('remotes'); 
remotes::install_cran(c('glue', 'rcmdcheck', 'sys')); remotes::install_deps()"
+# Separately innstall the optional/test dependencies but don't error on them,

Review comment:
   ```suggestion
   # Separately install the optional/test dependencies but don't error on them,
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-889985671


   https://issues.apache.org/jira/browse/ARROW-13511


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-889985861


   @github-actions crossbow submit -g r


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-889986428


   Revision: 8c6073efab365dde7b6a373c0a13019992a3023a
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-679](https://github.com/ursacomputing/crossbow/branches/all?query=actions-679)
   
   |Task|Status|
   ||--|
   
|conda-linux-gcc-py36-cpu-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-linux-gcc-py36-cpu-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-linux-gcc-py36-cpu-r40)|
   
|conda-linux-gcc-py37-cpu-r41|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-linux-gcc-py37-cpu-r41)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-linux-gcc-py37-cpu-r41)|
   
|conda-osx-clang-py36-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-osx-clang-py36-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-osx-clang-py36-r40)|
   
|conda-osx-clang-py37-r41|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-osx-clang-py37-r41)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-osx-clang-py37-r41)|
   
|conda-win-vs2017-py36-r40|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-win-vs2017-py36-r40)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-win-vs2017-py36-r40)|
   
|conda-win-vs2017-py37-r41|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-conda-win-vs2017-py37-r41)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-conda-win-vs2017-py37-r41)|
   |homebrew-r-autobrew|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-homebrew-r-autobrew)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-homebrew-r-autobrew)|
   |test-r-devdocs|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-test-r-devdocs)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-test-r-devdocs)|
   |test-r-gcc-11|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-test-r-gcc-11)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-test-r-gcc-11)|
   |test-r-install-local|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-test-r-install-local)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-test-r-install-local)|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-test-r-linux-as-cran)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-test-r-linux-as-cran)|
   |test-r-linux-rchk|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-679-github-test-r-linux-rchk)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-679-github-test-r-linux-rchk)|
   
|test-r-linux-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-test-r-linux-valgrind)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-test-r-linux-valgrind)|
   
|test-r-minimal-build|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-test-r-minimal-build)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-test-r-minimal-build)|
   
|test-r-rhub-ubuntu-gcc-release-latest|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-test-r-rhub-ubuntu-gcc-release-latest)](https://dev.azure.com/ursacomputing/crossbow/_build/latest?definitionId=1&branchName=actions-679-azure-test-r-rhub-ubuntu-gcc-release-latest)|
   
|test-r-rocker-r-base-latest|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-679-azure-test-r-rocker-r-base-latest)](https:/

[GitHub] [arrow] emkornfield commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



emkornfield commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680034150



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number

Review comment:
   so it is up to users to check that?  should it propogate an error 
instead?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10840: ARROW-13496: [CI][R] Repair r-sanitizer job

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10840:
URL: https://github.com/apache/arrow/pull/10840#issuecomment-889987032


   As best I can tell, it's not that the job is timing out, it's that the job 
is running without printing anything to stdout for too long (this is what 
Travis CI would do). This is during the libarrow build step, which (by 
default/on cran) we keep quiet.
   
   Maybe we can fix this by setting ARROW_R_DEV and then LIBARROW_MINIMAL to 
(re)disable features that NOT_CRAN enables (which I think ARROW_R_DEV enables), 
or some other combination of env var propagation to make the C++ build verbose 
but otherwise be CRAN-like. Or, we could try porting it from azure to GHA and 
see if that works around the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zeroshade commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



zeroshade commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680037198



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number

Review comment:
   currently this is an internal only package so it is not exposed for 
non-internal consumers to call this and the column writers check the return and 
propagate an error if it doesn't match. Alternately, I could modify the 
underlying encoders to have Put return an error instead of just a bool and then 
propagate an error. I believe currently it just returns true/false for 
success/failure out of convenience and I never got around to having it return a 
proper error. 
   
   I'll take a look at how big such a change would be.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] jhorstmann commented on pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



jhorstmann commented on pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#issuecomment-889992552


   Looks good. I was trying to come up with an example where two distinct keys 
would end up as the same encoding but could not find any because any variable 
length types include the length prefix. :rocket: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] diegodfrf opened a new pull request #10842: ARROW-13469: [C++] Suppress -Wmissing-field-initializers in DayMilliseconds arrow/type.h

2021-07-30 Thread GitBox



diegodfrf opened a new pull request #10842:
URL: https://github.com/apache/arrow/pull/10842


   Set initial values to DayMilliseconds with 0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10842: ARROW-13469: [C++] Suppress -Wmissing-field-initializers in DayMilliseconds arrow/type.h

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10842:
URL: https://github.com/apache/arrow/pull/10842#issuecomment-89320


   https://issues.apache.org/jira/browse/ARROW-13469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



thisisnic commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-890009339


   I think building the things in `Suggests` has led to the build rebuilding 
the markdown content, which then causes it to fail due to other dependencies in 
the vignettes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] zeroshade commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



zeroshade commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680074956



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number

Review comment:
   updated this change to add error propagation to all the necessary spots 
(and all the subsequent calls and dependencies) so that consumers no longer 
have to rely on checking the number of values returned but can see easily if an 
error was returned by the encoders.

##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+impo

[GitHub] [arrow] zeroshade commented on a change in pull request #10716: ARROW-13330: [Go][Parquet] Add the rest of the Encoding package

2021-07-30 Thread GitBox



zeroshade commented on a change in pull request #10716:
URL: https://github.com/apache/arrow/pull/10716#discussion_r680075490



##
File path: go/parquet/internal/encoding/levels.go
##
@@ -0,0 +1,284 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package encoding
+
+import (
+   "bytes"
+   "encoding/binary"
+   "io"
+   "math/bits"
+
+   "github.com/JohnCGriffin/overflow"
+   "github.com/apache/arrow/go/arrow/bitutil"
+   "github.com/apache/arrow/go/parquet"
+   format "github.com/apache/arrow/go/parquet/internal/gen-go/parquet"
+   "github.com/apache/arrow/go/parquet/internal/utils"
+)
+
+// LevelEncoder is for handling the encoding of Definition and Repetition 
levels
+// to parquet files.
+type LevelEncoder struct {
+   bitWidth int
+   rleLen   int
+   encoding format.Encoding
+   rle  *utils.RleEncoder
+   bit  *utils.BitWriter
+}
+
+// LevelEncodingMaxBufferSize estimates the max number of bytes needed to 
encode data with the
+// specified encoding given the max level and number of buffered values 
provided.
+func LevelEncodingMaxBufferSize(encoding parquet.Encoding, maxLvl int16, 
nbuffered int) int {
+   bitWidth := bits.Len64(uint64(maxLvl))
+   nbytes := 0
+   switch encoding {
+   case parquet.Encodings.RLE:
+   nbytes = utils.MaxBufferSize(bitWidth, nbuffered) + 
utils.MinBufferSize(bitWidth)
+   case parquet.Encodings.BitPacked:
+   nbytes = int(bitutil.BytesForBits(int64(nbuffered * bitWidth)))
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+   return nbytes
+}
+
+// Reset resets the encoder allowing it to be reused and updating the maxlevel 
to the new
+// specified value.
+func (l *LevelEncoder) Reset(maxLvl int16) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle.Clear()
+   l.rle.BitWidth = l.bitWidth
+   case format.Encoding_BIT_PACKED:
+   l.bit.Clear()
+   default:
+   panic("parquet: unknown encoding type")
+   }
+}
+
+// Init is called to set up the desired encoding type, max level and 
underlying writer for a
+// level encoder to control where the resulting encoded buffer will end up.
+func (l *LevelEncoder) Init(encoding parquet.Encoding, maxLvl int16, w 
io.WriterAt) {
+   l.bitWidth = bits.Len64(uint64(maxLvl))
+   l.encoding = format.Encoding(encoding)
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rle = utils.NewRleEncoder(w, l.bitWidth)
+   case format.Encoding_BIT_PACKED:
+   l.bit = utils.NewBitWriter(w)
+   default:
+   panic("parquet: unknown encoding type for levels")
+   }
+}
+
+// EncodeNoFlush encodes the provided levels in the encoder, but doesn't flush
+// the buffer and return it yet, appending these encoded values. Returns the 
number
+// of values encoded.
+func (l *LevelEncoder) EncodeNoFlush(lvls []int16) int {
+   nencoded := 0
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   for _, level := range lvls {
+   if !l.rle.Put(uint64(level)) {
+   break
+   }
+   nencoded++
+   }
+   default:
+   for _, level := range lvls {
+   if l.bit.WriteValue(uint64(level), uint(l.bitWidth)) != 
nil {
+   break
+   }
+   nencoded++
+   }
+   }
+   return nencoded
+}
+
+// Flush flushes out any encoded data to the underlying writer.
+func (l *LevelEncoder) Flush() {
+   if l.rle == nil && l.bit == nil {
+   panic("parquet: level encoders are not initialized")
+   }
+
+   switch l.encoding {
+   case format.Encoding_RLE:
+   l.rleLen = l.rle.Flush()
+   default:
+   l.bit.Flush(false)
+   }
+}
+
+// Encode encode

[GitHub] [arrow] thisisnic edited a comment on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



thisisnic edited a comment on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-890009339


   I think building the things in `Suggests` has led to the build rebuilding 
the markdown content, which then causes `test-r-without-arrow` to fail due to 
other dependencies in the vignettes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-890013534


   > I think building the things in `Suggests` has led to the build rebuilding 
the markdown content, which then causes `test-r-without-arrow` to fail due to 
other dependencies in the vignettes.
   
   Should be unrelated to this change, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] thisisnic commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



thisisnic commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-890014347


   > > I think building the things in `Suggests` has led to the build 
rebuilding the markdown content, which then causes `test-r-without-arrow` to 
fail due to other dependencies in the vignettes.
   > 
   > Should be unrelated to this change, right?
   
   D'oh, of course
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



dongjoon-hyun commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890016415


   Thank you for your approval, @kiszk .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-890020555


   Looks like you fixed one warning but there's still another: 
https://github.com/ursacomputing/crossbow/runs/3203443397?check_suite_focus=true#step:7:192


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



dongjoon-hyun commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890020627


   Thank you for your review and approval, @sunchao !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kiszk commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



kiszk commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890023623


   cc @emkornfield and @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB opened a new pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB opened a new pull request #10843:
URL: https://github.com/apache/arrow/pull/10843


   Changes:
   - Added support for decimal128 and decimal256 json converted.
   - Added unit tests for converting decimal128 and decimal256. (Note: This is 
not providing inferring support)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#issuecomment-890023863


   https://issues.apache.org/jira/browse/ARROW-4700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB commented on pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB commented on pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#issuecomment-890024085


   Please take it a look @bkietz 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] emkornfield commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



emkornfield commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890026317


   As long as CI is passing LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] domoritz closed pull request #10825: MINOR: [JS] Correct main package description in readme

2021-07-30 Thread GitBox



domoritz closed pull request #10825:
URL: https://github.com/apache/arrow/pull/10825


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] domoritz closed pull request #10826: ARROW-12636: [JS] ESM Tree-Shaking produces broken code

2021-07-30 Thread GitBox



domoritz closed pull request #10826:
URL: https://github.com/apache/arrow/pull/10826


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



dongjoon-hyun commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890027866


   Thank you, @emkornfield . Yes, all CIs passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



alamb commented on pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#issuecomment-890031194


   Thanks for the reviews @Dandandan  and @jhorstmann ! I plan to wait another 
day or two to see if anyone else has feedback on this approach, but what I am 
thinking of doing is merging this PR (after addressing comments) so at least we 
get correct answers and then work on the more sophisticated implementation in 
parallel. 
   
   > Looks good. I was trying to come up with an example where two distinct 
keys would end up as the same encoding but could not find any because any 
variable length types include the length prefix. 🚀
   
   Thanks for double checking -- this also worried me a lot so I am glad to 
hear someone else did a double check too. 
   
   I convinced myself that since each key had entries for the same columns in 
the same order, there was no way to concoct the same bytes with different 
column values


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao closed pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



sunchao closed pull request #10838:
URL: https://github.com/apache/arrow/pull/10838


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #10841: ARROW-13511: [CI][R] Fail in the docker build step if R deps don't install

2021-07-30 Thread GitBox



nealrichardson commented on pull request #10841:
URL: https://github.com/apache/arrow/pull/10841#issuecomment-890037738


   > > > I think building the things in `Suggests` has led to the build 
rebuilding the markdown content, which then causes `test-r-without-arrow` to 
fail due to other dependencies in the vignettes.
   > > 
   > > 
   > > Should be unrelated to this change, right?
   > 
   > D'oh, of course
   
   I had originally applied `warn=2` to all dependency installation, which 
would have caused that build you noticed to fail earlier, but it failed in a 
few other places (test-r-versions failed because duckdb isn't available on R < 
3.6, etc.) and it seemed not worth trying to track down all of the quirks of 
the optional packages in all of our different builds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



sunchao commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890039107


   Merged! @emkornfield could you add @dongjoon-hyun to the Arrow contributor 
list and assign the JIRA to him? thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ElenaHenderson commented on a change in pull request #10667: ARROW-11691: [Developer][CI] Provide a consolidated .env file for benchmark-relevant environment variables

2021-07-30 Thread GitBox



ElenaHenderson commented on a change in pull request #10667:
URL: https://github.com/apache/arrow/pull/10667#discussion_r680105559



##
File path: dev/conbench_envs/README.md
##
@@ -0,0 +1,30 @@
+
+## Benchmark Builds Env
+This directory contains 
+- [benchmarks.env](../../dev/conbench_envs/benchmarks.env) - list of env vars 
used for building Arrow C++/Python/R/Java/JavaScript and running benchmarks 
using [conbench](https://ursalabs.org/blog/announcing-conbench/)
+- [utils.sh](../../dev/conbench_envs/utils.sh) - utils used by benchmark 
builds for creating conda env with Arrow C++/Python/R/Java/JavaScript built 
from source

Review comment:
   @bkietz Done. Please let me know if README.md makes sense to you about 
hooks are used by private repo and how they can be used by others. Will you 
please also approve CI workflow? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ElenaHenderson commented on a change in pull request #10667: ARROW-11691: [Developer][CI] Provide a consolidated .env file for benchmark-relevant environment variables

2021-07-30 Thread GitBox



ElenaHenderson commented on a change in pull request #10667:
URL: https://github.com/apache/arrow/pull/10667#discussion_r680105559



##
File path: dev/conbench_envs/README.md
##
@@ -0,0 +1,30 @@
+
+## Benchmark Builds Env
+This directory contains 
+- [benchmarks.env](../../dev/conbench_envs/benchmarks.env) - list of env vars 
used for building Arrow C++/Python/R/Java/JavaScript and running benchmarks 
using [conbench](https://ursalabs.org/blog/announcing-conbench/)
+- [utils.sh](../../dev/conbench_envs/utils.sh) - utils used by benchmark 
builds for creating conda env with Arrow C++/Python/R/Java/JavaScript built 
from source

Review comment:
   @bkietz Done. Please let me know if README.md makes sense to you about 
why and how hooks are used by Ursa's private repo and how they can be used by 
others. Will you please also approve CI workflow? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] dongjoon-hyun commented on pull request #10838: ARROW-13506: [C++][Java] Upgrade ORC to 1.6.9

2021-07-30 Thread GitBox



dongjoon-hyun commented on pull request #10838:
URL: https://github.com/apache/arrow/pull/10838#issuecomment-890044598


   Thank you so much, @sunchao , @kiszk , @emkornfield !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove merged pull request #789: Implement streaming versions of Dataframe.collect methods

2021-07-30 Thread GitBox



andygrove merged pull request #789:
URL: https://github.com/apache/arrow-datafusion/pull/789


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove closed issue #47: DataFrame.collect() should return async stream rather than a Vec

2021-07-30 Thread GitBox



andygrove closed issue #47:
URL: https://github.com/apache/arrow-datafusion/issues/47


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

2021-07-30 Thread GitBox



alamb commented on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-890050571


   This one seems like an incremental change towards #790  so I am going to 
merge this in and rebase 790 so the changes are clearer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

2021-07-30 Thread GitBox



alamb merged pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #376: Combine GroupByScalar and ScalarValue

2021-07-30 Thread GitBox



alamb closed issue #376:
URL: https://github.com/apache/arrow-datafusion/issues/376


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on pull request #786: Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy

2021-07-30 Thread GitBox



alamb edited a comment on pull request #786:
URL: https://github.com/apache/arrow-datafusion/pull/786#issuecomment-890050571


   This one seems like an incremental change towards #790  so I am going to 
merge this in and rebase 790 so the changes are in that PR are smaller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



alamb commented on pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#issuecomment-890051505


   Rebased given that https://github.com/apache/arrow-datafusion/pull/786 is 
merged, so that this PR just shows the delta now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10667: ARROW-11691: [Developer][CI] Provide a consolidated .env file for benchmark-relevant environment variables

2021-07-30 Thread GitBox



bkietz commented on a change in pull request #10667:
URL: https://github.com/apache/arrow/pull/10667#discussion_r679945382



##
File path: dev/conbench_envs/README.md
##
@@ -0,0 +1,209 @@
+
+# Benchmark Builds Env and Hooks
+This directory contains: 
+- [benchmarks.env](benchmarks.env) - list of env vars used for building Arrow 
C++/Python/R/Java/JavaScript and running benchmarks using 
[conbench](https://ursalabs.org/blog/announcing-conbench/).
+- [hooks.sh](hooks.sh) - hooks used by @ursabot benchmark builds that 
are triggered by `@ursabot please benchmark` PR comments. 
+
+## How to add or update Arrow build and run env vars used by `@ursabot` 
benchmark builds
+1. Create `apache/arrow` PR
+2. Update or add env var value in 
[benchmarks.env](../../dev/conbench_envs/benchmarks.env)
+3. Add `@ursabot please benchmark` comment to PR
+4. Once benchmark builds are done, benchmark results can be viewed via 
compare/runs links in the PR comment where
+- baseline = PR base HEAD commit with unaltered 
`/dev/conbench_envs/benchmarks.env`
+- contender = PR branch HEAD commit with overridden 
`/dev/conbench_envs/benchmarks.env`
+
+## Why do`@ursabot` benchmark builds need `hooks.sh`?
+`@ursabot` benchmark builds are maintained in Ursa's private repo.
+Benchmark builds use `hooks.sh` functions as hooks to create conda env with 
Arrow dependencies and build Arrow C++/Python/R/Java/JavaScript from source for 
a specific Arrow repo's commit.
+
+Defining hooks in Arrow repo allows benchmark builds for a specific Arrow 
commit to be always compatible with Arrow's files/scripts used for installing 
Arrow dependencies and building Arrow, assuming Arrow contributors will update 
`hooks.sh` when they make these changes to files/scripts used by functions in 
`hooks.sh`. 

Review comment:
   ```suggestion
   Defining hooks in Arrow repo allows benchmark builds for a specific commit 
to be
   compatible with the files/scripts *in that commit* which are used for 
installing Arrow
   dependencies and building Arrow. This allows Arrow contributors to asses the 
perfomance
   implications of different build options, dependency versions, etc by updating
   `hooks.sh`.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB edited a comment on pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB edited a comment on pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#issuecomment-890024085


   Please, take a look at it @bkietz 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



Dandandan commented on a change in pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#discussion_r680125826



##
File path: datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -508,7 +511,9 @@ fn dictionary_create_key_for_col(
 }
 
 /// Appends a sequence of [u8] bytes for the value in `col[row]` to
-/// `vec` to be used as a key into the hash map
+/// `vec` to be used as a key into the hash map.
+///
+/// NOTE: This functon does not check col.is_valid(). Caller must do so

Review comment:
   ```suggestion
   /// NOTE: This function does not check col.is_valid(). Caller must do so
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] felipeblazing commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



felipeblazing commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680124458



##
File path: cpp/src/arrow/json/converter_test.cc
##
@@ -96,5 +96,27 @@ TEST(ConverterTest, Timestamp) {
   AssertConvert(timestamp(TimeUnit::SECOND), src, src);
 }
 
+TEST(ConverterTest, Decimal128) {
+  std::string src = R"([
+  "02.00",
+  "30.00",
+  "22.00",
+"-121.00",
+null])";
+
+  AssertConvert(decimal128(38, 10), src, src);

Review comment:
   How does this verify that the value was converted to the correct value 
if we aren't passing in a decimal128 with the correct value specified?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on issue #771: Release a new version of DataFusion to crates.io

2021-07-30 Thread GitBox



andygrove commented on issue #771:
URL: 
https://github.com/apache/arrow-datafusion/issues/771#issuecomment-890059076


   Ballista is now at a point where it supports the TPC-H queries with
   reasonable performance and I am now working on improving the developer
   documentation and the user guide. I am keen on getting an official release
   soon so that we can reserve the crate names. Let me know how I can help
   with the release work. I generally only have time at weekends though.
   
   On Wed, Jul 28, 2021 at 10:43 AM QP Hou ***@***.***> wrote:
   
   > Quick update on my end, I have pushed the 4.0.0 tag with a reference to
   > 31dd3cd
   > 

   > to help with change log generation. Next up, I will look into subproject
   > changelog automation using PR labels while waiting for more feedbacks on
   > the release process proposed in #771 (comment)
   > 

   > .
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or unsubscribe
   > 

   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] felipeblazing commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



felipeblazing commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680127350



##
File path: cpp/src/arrow/json/converter_test.cc
##
@@ -96,5 +96,27 @@ TEST(ConverterTest, Timestamp) {
   AssertConvert(timestamp(TimeUnit::SECOND), src, src);
 }
 
+TEST(ConverterTest, Decimal128) {
+  std::string src = R"([
+  "02.00",
+  "30.00",
+  "22.00",
+"-121.00",
+null])";
+
+  AssertConvert(decimal128(38, 10), src, src);

Review comment:
   Namely that I would expect something more like instantiating decimals of 
the correct precision with the values you specified there and then verifying 
that they are indeed the same values in the same order.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] felipeblazing commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



felipeblazing commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680128199



##
File path: cpp/src/arrow/json/converter.cc
##
@@ -147,6 +149,41 @@ class NumericConverter : public PrimitiveConverter {
   const T& numeric_type_;
 };
 
+template 
+class DecimalConverter : public PrimitiveConverter {
+ public:
+  using value_type = typename TypeTraits::BuilderType::ValueType;
+
+  DecimalConverter(MemoryPool* pool, const std::shared_ptr& type)
+  : PrimitiveConverter(pool, type) {}
+
+  Status Convert(const std::shared_ptr& in, std::shared_ptr* 
out) override {
+if (in->type_id() == Type::NA) {
+  return MakeArrayOfNull(out_type_, in->length(), pool_).Value(out);
+}
+const auto& dict_array = GetDictionaryArray(in);
+
+using Builder = typename TypeTraits::BuilderType;
+Builder builder(out_type_, pool_);
+RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
+
+auto visit_valid = [&](string_view repr) {
+  ARROW_ASSIGN_OR_RAISE(value_type value,
+
TypeTraits::BuilderType::ValueType::FromString(repr));
+  builder.UnsafeAppend(value);
+  return Status::OK();
+};
+
+auto visit_null = [&]() {

Review comment:
   Do we really want to capture everything by reference here and instead 
just pass in what we know we need?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] felipeblazing commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



felipeblazing commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680128769



##
File path: cpp/src/arrow/json/converter.cc
##
@@ -147,6 +149,41 @@ class NumericConverter : public PrimitiveConverter {
   const T& numeric_type_;
 };
 
+template 
+class DecimalConverter : public PrimitiveConverter {
+ public:
+  using value_type = typename TypeTraits::BuilderType::ValueType;
+
+  DecimalConverter(MemoryPool* pool, const std::shared_ptr& type)
+  : PrimitiveConverter(pool, type) {}
+
+  Status Convert(const std::shared_ptr& in, std::shared_ptr* 
out) override {
+if (in->type_id() == Type::NA) {
+  return MakeArrayOfNull(out_type_, in->length(), pool_).Value(out);
+}
+const auto& dict_array = GetDictionaryArray(in);
+
+using Builder = typename TypeTraits::BuilderType;
+Builder builder(out_type_, pool_);
+RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
+
+auto visit_valid = [&](string_view repr) {

Review comment:
   why capture everything here as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] ElenaHenderson commented on a change in pull request #10667: ARROW-11691: [Developer][CI] Provide a consolidated .env file for benchmark-relevant environment variables

2021-07-30 Thread GitBox



ElenaHenderson commented on a change in pull request #10667:
URL: https://github.com/apache/arrow/pull/10667#discussion_r680129299



##
File path: dev/conbench_envs/README.md
##
@@ -0,0 +1,30 @@
+
+## Benchmark Builds Env
+This directory contains 
+- [benchmarks.env](../../dev/conbench_envs/benchmarks.env) - list of env vars 
used for building Arrow C++/Python/R/Java/JavaScript and running benchmarks 
using [conbench](https://ursalabs.org/blog/announcing-conbench/)
+- [utils.sh](../../dev/conbench_envs/utils.sh) - utils used by benchmark 
builds for creating conda env with Arrow C++/Python/R/Java/JavaScript built 
from source

Review comment:
   @bkietz Thank you so much for the review.
   
   I also added this comment. I hope it is ok.
   `- benchmark builds are run using `@ursabot please benchmark` PR comment to 
confirm that function definition updates do not break benchmark builds.`. 
   
   
https://github.com/apache/arrow/pull/10667/files#diff-04649ad45f96fd68437c83c30ad9dd25df8f6cf865d2a799b860eb7a159e11fcR47




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680131459



##
File path: cpp/src/arrow/json/converter_test.cc
##
@@ -96,5 +96,27 @@ TEST(ConverterTest, Timestamp) {
   AssertConvert(timestamp(TimeUnit::SECOND), src, src);
 }
 
+TEST(ConverterTest, Decimal128) {
+  std::string src = R"([
+  "02.00",
+  "30.00",
+  "22.00",
+"-121.00",
+null])";
+
+  AssertConvert(decimal128(38, 10), src, src);

Review comment:
   The test is confusing because we use the same string representation to 
produce the expected one. I think that rewriting this test should be clear such 
confusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680139090



##
File path: cpp/src/arrow/json/converter.cc
##
@@ -147,6 +149,41 @@ class NumericConverter : public PrimitiveConverter {
   const T& numeric_type_;
 };
 
+template 
+class DecimalConverter : public PrimitiveConverter {
+ public:
+  using value_type = typename TypeTraits::BuilderType::ValueType;
+
+  DecimalConverter(MemoryPool* pool, const std::shared_ptr& type)
+  : PrimitiveConverter(pool, type) {}
+
+  Status Convert(const std::shared_ptr& in, std::shared_ptr* 
out) override {
+if (in->type_id() == Type::NA) {
+  return MakeArrayOfNull(out_type_, in->length(), pool_).Value(out);
+}
+const auto& dict_array = GetDictionaryArray(in);
+
+using Builder = typename TypeTraits::BuilderType;
+Builder builder(out_type_, pool_);
+RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
+
+auto visit_valid = [&](string_view repr) {
+  ARROW_ASSIGN_OR_RAISE(value_type value,
+
TypeTraits::BuilderType::ValueType::FromString(repr));
+  builder.UnsafeAppend(value);
+  return Status::OK();
+};
+
+auto visit_null = [&]() {

Review comment:
   Makes sense, there we are not capturing lot of variables, so it would be 
clearer if I just capture the builder by reference.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680140494



##
File path: cpp/src/arrow/json/converter.cc
##
@@ -147,6 +149,41 @@ class NumericConverter : public PrimitiveConverter {
   const T& numeric_type_;
 };
 
+template 
+class DecimalConverter : public PrimitiveConverter {
+ public:
+  using value_type = typename TypeTraits::BuilderType::ValueType;
+
+  DecimalConverter(MemoryPool* pool, const std::shared_ptr& type)
+  : PrimitiveConverter(pool, type) {}
+
+  Status Convert(const std::shared_ptr& in, std::shared_ptr* 
out) override {
+if (in->type_id() == Type::NA) {
+  return MakeArrayOfNull(out_type_, in->length(), pool_).Value(out);
+}
+const auto& dict_array = GetDictionaryArray(in);
+
+using Builder = typename TypeTraits::BuilderType;
+Builder builder(out_type_, pool_);
+RETURN_NOT_OK(builder.Resize(dict_array.indices()->length()));
+
+auto visit_valid = [&](string_view repr) {

Review comment:
   Sure, as the other visitor I'll make that explicit. Thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #10722: ARROW-13344: [R] Initial bindings for ExecPlan/ExecNode

2021-07-30 Thread GitBox



bkietz commented on pull request #10722:
URL: https://github.com/apache/arrow/pull/10722#issuecomment-890076863


   > -MakeScalarAggregateNode and MakeGroupByNode have quite different 
function signatures, which makes working with the API confusing
   > -GroupBy doesn't let you specify the names of the output columns
   
   ARROW-13482 (#10793) unifies the factories for scalar and grouped 
aggregation. I'll ensure that PR also provides configurable output column names 
for both
   
   -Grouped aggregation functions all have to be invoked with a hash_ 
prefix to the name, which seems unnecessary because you can't call a 
non-hash-aggregation function in GroupBy and you can't call a hash_ function in 
ScalarAggregate
   
   ARROW-13451 would allow grouping of grouped and scalar aggregation kernels 
under the same function (rather than requiring distinct `sum` and `hash_sum` 
functions), which I think would address these
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nirandaperera commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-890089863


   @github-actions crossbow submit test-r-linux-as-cran


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-890090342


   Revision: 85fc16174a3acfecf24bd505aad5bbda69833111
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-680](https://github.com/ursacomputing/crossbow/branches/all?query=actions-680)
   
   |Task|Status|
   ||--|
   |test-r-linux-as-cran|[![Github 
Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-680-github-test-r-linux-as-cran)](https://github.com/ursacomputing/crossbow/actions?query=branch:actions-680-github-test-r-linux-as-cran)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #640: Release candidate verifying script seems work on macOS

2021-07-30 Thread GitBox



alamb commented on issue #640:
URL: https://github.com/apache/arrow-rs/issues/640#issuecomment-890099715


   @waynexia   i also run the verification script on mac without issue I think 
the comment is outdated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] rommelDB commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



rommelDB commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680131459



##
File path: cpp/src/arrow/json/converter_test.cc
##
@@ -96,5 +96,27 @@ TEST(ConverterTest, Timestamp) {
   AssertConvert(timestamp(TimeUnit::SECOND), src, src);
 }
 
+TEST(ConverterTest, Decimal128) {
+  std::string src = R"([
+  "02.00",
+  "30.00",
+  "22.00",
+"-121.00",
+null])";
+
+  AssertConvert(decimal128(38, 10), src, src);

Review comment:
   The test is confusing because we use the same string representation to 
produce the expected one. I think that after rewriting this test should be 
clear such confusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] pi-guy-in-the-sky opened a new issue #10844: Getting an `std::istream` from an `arrow::io::InputStream`

2021-07-30 Thread GitBox



pi-guy-in-the-sky opened a new issue #10844:
URL: https://github.com/apache/arrow/issues/10844


   I'm currently working on a tool in C++ which needs to fetch data from an S3 
bucket. Calling `arrow::fs::S3FileSystem::OpenInputStream` allows me to get an 
`arrow::io::InputStream` that can read from the file. However, an external 
library in my application needs to read from the file through an 
`std::istream`. Is there any way I can get an `std::istream` from an 
`arrow::io::InputStream`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #640: Release candidate verifying script seems work on macOS

2021-07-30 Thread GitBox



alamb commented on issue #640:
URL: https://github.com/apache/arrow-rs/issues/640#issuecomment-890101254


   FWIW the question label already exists (I added it to this issue). It isn't 
clear to me if everyone can add labels to issues in arrow-rs. Does it work for 
you @waynexia ?
   
   Perhaps we could add a template for a "question" to make it easier for 
people to file such questions: 
https://github.com/apache/arrow-rs/tree/master/.github/ISSUE_TEMPLATE
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb opened a new issue #641: Incorrect min/max statistics for strings in parquet files

2021-07-30 Thread GitBox



alamb opened a new issue #641:
URL: https://github.com/apache/arrow-rs/issues/641


   **Describe the bug**
   The statistics written by the arrow / parquet writer for String columns seem 
to be incorrect. 
   
   **To Reproduce**
   Run this code:
   
   ```rust
   
   fn main() {
   let input = vec![
   Some("andover"),
   Some("reading"),
   Some("bedford"),
   Some("tewsbury"),
   Some("lexington"),
   Some("lawrence"),
   ];
   
   let input: StringArray = input.into_iter().collect();
   println!("Staring to test with array {:?}", input);
   
   let record_batch = RecordBatch::try_from_iter(vec![
   ("city", Arc::new(input) as _)
   ]).unwrap();
   
   println!("Opening output file /tmp/test.parquet");
   let out_file = File::create("/tmp/test.parquet").unwrap();
   
   println!("Creating writer...");
   let mut writer = ArrowWriter::try_new(out_file, record_batch.schema(), 
None)
   .expect("creating writer");
   
   println!("writing...");
   writer.write(&record_batch).expect("writing");
   
   println!("closing...");
   writer.close().expect("closing");
   
   println!("done...");
   }
   ```
   
   Then examine the resulting parquet file and note the min/max values for the 
"city" column are:
   ```
   min: "andover"
   max: "lexington"
   ```
   
   ```shell
   alamb@MacBook-Pro rust_parquet % parquet-tools dump  /tmp/test.parquet 
   parquet-tools dump  /tmp/test.parquet 
   row group 0 
   
--
   city:  BINARY UNCOMPRESSED DO:4 FPO:90 SZ:130/130/1.00 VC:6 
ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: andover, max: lexi [more]...
   
   city TV=6 RL=0 DL=0 DS: 6 DE:PLAIN
   
--
   page 0:  DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 
andover, max: lexington, num_nulls not defined] [more]... VC:6
   
   BINARY city 
   
--
   *** row group 1 of 1, values 1 to 6 *** 
   value 1: R:0 D:0 V:andover
   value 2: R:0 D:0 V:reading
   value 3: R:0 D:0 V:bedford
   value 4: R:0 D:0 V:tewsbury
   value 5: R:0 D:0 V:lexington
   value 6: R:0 D:0 V:lawrence
   ```
   
   **Expected behavior**
   The parquet file produced has min/max statistics for the city column:
   ```
   min: "andover"
   max: "tewsbury"
   ```
   
   As 't' follows 'l'
   
   **Additional context**
   
   Since DataFusion now uses these statistics for pruning out row groups, this 
leads to incorrect results in DataFusion.  I found this when investigating 
https://github.com/influxdata/influxdb_iox/issues/2153


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Dandandan merged pull request #637: Speed up filter_record_batch with one array

2021-07-30 Thread GitBox



Dandandan merged pull request #637:
URL: https://github.com/apache/arrow-rs/pull/637


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] Dandandan closed issue #636: Avoid materialization of indices in filter_record_batch for single arrays

2021-07-30 Thread GitBox



Dandandan closed issue #636:
URL: https://github.com/apache/arrow-rs/issues/636


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



alamb commented on a change in pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#discussion_r680188332



##
File path: datafusion/src/physical_plan/hash_aggregate.rs
##
@@ -644,14 +645,63 @@ fn create_key_for_col(col: &ArrayRef, row: usize, vec: 
&mut Vec) -> Result<(
 }
 
 /// Create a key `Vec` that is used as key for the hashmap
+///
+/// This looks like
+/// [null_byte][col_value_bytes][null_byte][col_value_bytes]
+///
+/// Note that relatively uncommon patterns (e.g. not 0x00) are chosen
+/// for the null_byte to make debugging easier. The actual values are
+/// arbitrary.
+///
+/// For a NULL value in a column, the key looks like
+/// [0xFE]
+///
+/// For a Non-NULL value in a column, this looks like:
+/// [0xFF][byte representation of column value]
+///
+/// Example of a key with no NULL values:
+/// ```text
+///0xFF byte at the start of each column
+///   signifies the value is non-null
+///  │
+///
+///  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┐
+///
+///  │string len │  0x1234
+/// {▼   (as usize le)  "foo"▼(as u16 le)
+///   k1: "foo"╔ ═┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──╦ ═┌──┬──┐
+///   k2: 0x1234u16 FF║03│00│00│00│00│00│00│00│"f│"o│"o│FF║34│12│
+/// }  ╚ ═└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──╩ ═└──┴──┘
+/// 0  1  2  3  4  5  6  7  8  9  10 11 12 13 14
+/// ```
+///
+///  Example of a key with NULL values:
+///
+///```text
+/// 0xFE byte at the start of k1 column
+/// ┌ ─ signifies the value is NULL
+///
+/// └ ┐
+///  0x1234
+/// { ▼(as u16 le)
+///   k1: NULL  ╔ ═╔ ═┌──┬──┐
+///   k2: 0x1234u16  FE║FF║12│34│
+/// }   ╚ ═╚ ═└──┴──┘
+///   0  1  2  3
+///```
 pub(crate) fn create_key(
 group_by_keys: &[ArrayRef],
 row: usize,
 vec: &mut Vec,
 ) -> Result<()> {
 vec.clear();
 for col in group_by_keys {
-create_key_for_col(col, row, vec)?
+if !col.is_valid(row) {

Review comment:
   Thank you for the suggestion.
   
   If you don't mind I would like to spend time on 
https://github.com/apache/arrow-datafusion/issues/790 which, if successful, I 
expect to significantly remove all this code.
   
   I will attempt to add that optimization at a later date.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #793: Produce correct answers for Group BY NULL (Option 1)

2021-07-30 Thread GitBox



alamb commented on a change in pull request #793:
URL: https://github.com/apache/arrow-datafusion/pull/793#discussion_r680188812



##
File path: datafusion/src/scalar.rs
##
@@ -1347,6 +1545,29 @@ mod tests {
 "{}", result);
 }
 
+#[test]
+fn scalar_try_from_array_null() {
+let array = vec![Some(33), None].into_iter().collect::();
+let array: ArrayRef = Arc::new(array);
+
+assert_eq!(
+ScalarValue::Int64(Some(33)),
+ScalarValue::try_from_array(&array, 0).unwrap()
+);
+assert_eq!(
+ScalarValue::Int64(None),
+ScalarValue::try_from_array(&array, 1).unwrap()
+);
+}
+
+#[test]
+fn scalar_try_from_dict_datatype() {
+let data_type =
+DataType::Dictionary(Box::new(DataType::Int8), 
Box::new(DataType::Utf8));
+let data_type = &data_type;

Review comment:
   Amusingly, supporting this behavior ended up causing a test to fail when 
I brought the code into IOx and I think I traced the problem to an issue in 
parquet file statistics: https://github.com/apache/arrow-rs/issues/641 🤣  this 
was not a side effect I had anticipated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera commented on pull request #10833: ARROW-13493 [C++] Anonymous structs in an anonymous union are a GNU extension

2021-07-30 Thread GitBox



nirandaperera commented on pull request #10833:
URL: https://github.com/apache/arrow/pull/10833#issuecomment-890123629


   @nealrichardson I think this is fine now! :-) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nirandaperera opened a new pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

2021-07-30 Thread GitBox



nirandaperera opened a new pull request #10845:
URL: https://github.com/apache/arrow/pull/10845


   Adding prelim impl of semi joins


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10845: ARROW-13268 [C++][Compute] Add ExecNode for semi and anti-semi join

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10845:
URL: https://github.com/apache/arrow/pull/10845#issuecomment-890126576


   https://issues.apache.org/jira/browse/ARROW-13268


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #628: Add a note about arrow crate security / safety

2021-07-30 Thread GitBox



codecov-commenter edited a comment on pull request #628:
URL: https://github.com/apache/arrow-rs/pull/628#issuecomment-888530892


   # 
[Codecov](https://codecov.io/gh/apache/arrow-rs/pull/628?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#628](https://codecov.io/gh/apache/arrow-rs/pull/628?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (c69fda5) into 
[master](https://codecov.io/gh/apache/arrow-rs/commit/be98b1748020e92dca3cbb1dc47e364296675a6a?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (be98b17) will **decrease** coverage by `0.00%`.
   > The diff coverage is `81.81%`.
   
   > :exclamation: Current head c69fda5 differs from pull request most recent 
head 7c4298f. Consider uploading reports for the commit 7c4298f to get more 
accurate results
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/arrow-rs/pull/628/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/628?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@Coverage Diff @@
   ##   master #628  +/-   ##
   ==
   - Coverage   82.48%   82.47%   -0.01% 
   ==
 Files 167  167  
 Lines   4645246454   +2 
   ==
   - Hits3831538314   -1 
   - Misses   8137 8140   +3 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-rs/pull/628?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==)
 | `92.23% <ø> (ø)` | |
   | 
[arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=)
 | `95.51% <ø> (ø)` | |
   | 
[arrow/src/array/array\_string.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X3N0cmluZy5ycw==)
 | `97.85% <ø> (ø)` | |
   | 
[arrow/src/compute/kernels/filter.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9maWx0ZXIucnM=)
 | `90.82% <0.00%> (-1.17%)` | :arrow_down: |
   | 
[...ng/src/flight\_client\_scenarios/integration\_test.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvZmxpZ2h0X2NsaWVudF9zY2VuYXJpb3MvaW50ZWdyYXRpb25fdGVzdC5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[...ng/src/flight\_server\_scenarios/integration\_test.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvZmxpZ2h0X3NlcnZlcl9zY2VuYXJpb3MvaW50ZWdyYXRpb25fdGVzdC5ycw==)
 | `0.00% <ø> (ø)` | |
   | 
[integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvbGliLnJz)
 | `0.00% <0.00%> (ø)` | |
   | 
[parquet/benches/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9iZW5jaGVzL2Fycm93X3dyaXRlci5ycw==)
 | `0.00% <0.00%> (ø)` | |
   | 
[parquet/src/record/triplet.rs](https://codecov.io/gh/apache/arrow-rs/pull/628/diff?src=pr&el=tree&utm_medium=referral&utm_source=githu

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #792: Implement basic common subexpression eliminate optimization

2021-07-30 Thread GitBox



alamb commented on a change in pull request #792:
URL: https://github.com/apache/arrow-datafusion/pull/792#discussion_r680194021



##
File path: datafusion/src/execution/context.rs
##
@@ -691,6 +693,7 @@ impl Default for ExecutionConfig {
 Arc::new(SimplifyExpressions::new()),
 Arc::new(HashBuildProbeOrder::new()),
 Arc::new(LimitPushDown::new()),
+// Arc::new(CommonSubexprEliminate::new()),

Review comment:
   I suggest running this CSE immediately after `ConstantFolding` as it may 
simplify expressions in a way that helps other passes (e.g. make join 
conditions clearer, or simplify filter expressions so they can be pushed down)

##
File path: datafusion/src/optimizer/common_subexpr_eliminate.rs
##
@@ -0,0 +1,517 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Eliminate common sub-expression.
+
+use std::collections::{HashMap, HashSet};
+use std::sync::Arc;
+
+use arrow::datatypes::DataType;
+
+use crate::error::Result;
+use crate::execution::context::ExecutionProps;
+use crate::logical_plan::{
+col, DFField, DFSchema, Expr, ExprRewriter, ExpressionVisitor, LogicalPlan,
+Recursion, RewriteRecursion,
+};
+use crate::optimizer::optimizer::OptimizerRule;
+
+/// A map from expression's identifier to tuple including
+/// - the expression itself (cloned)
+/// - counter
+/// - DataType of this expression.
+type ExprSet = HashMap;
+
+/// Identifier type. Current implementation use describe of a expression (type 
String) as
+/// Identifier.
+///
+/// A Identifier should (ideally) be able to "hash", "accumulate", "equal" and 
"have no
+/// collision (as low as possible)"
+///
+/// Since a identifier is likely to be copied many times, it is better that a 
identifier
+/// is small or "copy". otherwise some kinds of reference count is needed. 
String description
+/// here is not such a good choose.
+type Identifier = String;
+/// Perform Common Sub-expression Elimination optimization.
+///
+/// Currently only common sub-expressions within one logical plan will
+/// be eliminated.
+pub struct CommonSubexprEliminate {}
+
+impl OptimizerRule for CommonSubexprEliminate {
+fn optimize(
+&self,
+plan: &LogicalPlan,
+execution_props: &ExecutionProps,
+) -> Result {
+optimize(plan, execution_props)
+}
+
+fn name(&self) -> &str {
+"common_sub_expression_eliminate"
+}
+}
+
+impl CommonSubexprEliminate {
+#[allow(missing_docs)]
+pub fn new() -> Self {
+Self {}
+}
+}
+
+fn optimize(plan: &LogicalPlan, execution_props: &ExecutionProps) -> 
Result {
+let mut expr_set = ExprSet::new();
+let mut affected_id = HashSet::new();
+
+match plan {
+LogicalPlan::Projection {
+expr,
+input,
+schema: _,
+} => {
+let mut arrays = vec![];
+for e in expr {
+let data_type = e.get_type(input.schema())?;
+let mut id_array = vec![];
+expr_to_identifier(e, &mut expr_set, &mut id_array, 
data_type)?;
+arrays.push(id_array);
+}
+
+return optimize(input, execution_props);

Review comment:
   I may be missing something but doesn't this basically remove the 
`LogicalPlan::Projection` step (it just returns the optimized input)

##
File path: datafusion/src/logical_plan/expr.rs
##
@@ -981,15 +983,25 @@ pub trait ExpressionVisitor: Sized {
 }
 }
 
+/// Controls how the [ExprRewriter] recursion should proceed.

Review comment:
   👍  this is a nice change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #10667: ARROW-11691: [Developer][CI] Provide a consolidated .env file for benchmark-relevant environment variables

2021-07-30 Thread GitBox



bkietz closed pull request #10667:
URL: https://github.com/apache/arrow/pull/10667


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10802: ARROW-1568: [C++] Implement Drop Null Kernel for Arrays

2021-07-30 Thread GitBox



westonpace commented on a change in pull request #10802:
URL: https://github.com/apache/arrow/pull/10802#discussion_r680168763



##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -1988,7 +1989,7 @@ class FilterMetaFunction : public MetaFunction {
 
 Result> TakeAA(const Array& values, const Array& 
indices,
   const TakeOptions& options, ExecContext* 
ctx) {
-  ARROW_ASSIGN_OR_RAISE(Datum result,
+ARROW_ASSIGN_OR_RAISE(Datum result,

Review comment:
   What's going on here?  I've found it easiest to configure the IDE to run 
format-on-save.

##
File path: python/pyarrow/tests/test_compute.py
##
@@ -974,6 +974,51 @@ def test_take_null_type():
 assert len(table.take(indices).column(0)) == 4
 
 
+@pytest.mark.parametrize(('ty', 'values'), all_array_types)
+def test_dropnull(ty, values):
+arr = pa.array(values, type=ty)
+result = arr.dropnull()
+result.validate()
+indices = [i for i in range(len(arr)) if arr[i].is_valid]
+expected = arr.take(pa.array(indices))
+assert result.equals(expected)
+
+
+def test_dropnull_chunked_array():
+arr = pa.chunked_array([["a", None], ["c", "d", None]])
+expected_drop = pa.chunked_array([["a"], ["c", "d"]])
+result = arr.dropnull()
+assert result.equals(expected_drop)
+
+
+def test_dropnull_record_batch():
+batch = pa.record_batch(
+[pa.array(["a", None, "c", "d", None])], names=["a'"])

Review comment:
   Can you make some tests for record batches with multiple columns? (I 
think this will fail if they have different numbers of null elements).

##
File path: cpp/src/arrow/compute/kernels/vector_selection.cc
##
@@ -2146,6 +2147,144 @@ class TakeMetaFunction : public MetaFunction {
   }
 };
 
+// --
+// DropNull Implementation
+
+Result> GetNotNullIndices(
+const std::shared_ptr& column, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(column->length() - column->null_count());
+
+  std::vector values;
+  for (int64_t i = 0; i < column->length(); i++) {
+if (column->IsValid(i)) {
+  builder.UnsafeAppend(static_cast(i));
+}
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> GetNotNullIndices(
+const std::shared_ptr& chunks, MemoryPool* memory_pool) {
+  std::shared_ptr indices;
+  arrow::NumericBuilder builder(memory_pool);
+  builder.Reserve(chunks->length() - chunks->null_count());
+  int64_t relative_index = 0;
+  for (int64_t chunk_index = 0; chunk_index < chunks->num_chunks(); 
++chunk_index) {
+auto column_chunk = chunks->chunk(chunk_index);
+for (int64_t col_index = 0; col_index < column_chunk->length(); 
col_index++) {
+  if (column_chunk->IsValid(col_index)) {
+builder.UnsafeAppend(static_cast(relative_index + col_index));
+  }
+}
+relative_index += column_chunk->length();
+  }
+  RETURN_NOT_OK(builder.Finish(&indices));
+  return indices;
+}
+
+Result> DropNullRecordBatch(const RecordBatch& 
batch,
+ ExecContext* ctx) {
+  int64_t length_count = 0;
+  std::vector> columns(batch.num_columns());
+  for (int i = 0; i < batch.num_columns(); ++i) {
+ARROW_ASSIGN_OR_RAISE(auto indices,
+  GetNotNullIndices(batch.column(i), 
ctx->memory_pool()));
+ARROW_ASSIGN_OR_RAISE(Datum out, Take(batch.column(i)->data(), 
Datum(indices),
+  TakeOptions::NoBoundsCheck(), ctx));
+columns[i] = out.make_array();
+length_count += columns[i]->length();
+  }
+  return RecordBatch::Make(batch.schema(), length_count, std::move(columns));
+}
+
+Result> DropNullTable(const Table& table, ExecContext* 
ctx) {
+  if (table.num_rows() == 0) {
+return Table::Make(table.schema(), table.columns(), 0);
+  }
+  const int num_columns = table.num_columns();
+  std::vector inputs(num_columns);
+
+  // Fetch table columns
+  for (int i = 0; i < num_columns; ++i) {
+inputs[i] = table.column(i)->chunks();
+  }
+  std::set notnull_indices;
+  // Rechunk inputs to allow consistent iteration over their respective chunks
+  inputs = arrow::internal::RechunkArraysConsistently(inputs);
+
+  const int64_t num_chunks = static_cast(inputs.back().size());
+  for (int col = 0; col < num_columns; ++col) {
+int64_t relative_index = 0;
+for (int64_t chunk_index = 0; chunk_index < num_chunks; ++chunk_index) {
+  const auto& column_chunk = inputs[col][chunk_index];
+  for (int64_t i = 0; i < column_chunk->length(); ++i) {
+if (!column_chunk->IsValid(i)) {
+  notnull_indices.insert(relative_index + i);
+}
+  }
+  relative_index += column_chunk->length();
+}
+  }
+  std::shared_ptr indices;

Review comment:
   Can you add a check here.  If the l

[GitHub] [arrow] bkietz commented on a change in pull request #10843: ARROW-4700: [C++] Added support for decimal128 and decimal256 json converted

2021-07-30 Thread GitBox



bkietz commented on a change in pull request #10843:
URL: https://github.com/apache/arrow/pull/10843#discussion_r680213131



##
File path: cpp/src/arrow/json/parser.cc
##
@@ -102,6 +102,8 @@ Status Kind::ForType(const DataType& type, Kind::type* 
kind) {
 Status Visit(const TimeType&) { return SetKind(Kind::kNumber); }
 Status Visit(const DateType&) { return SetKind(Kind::kNumber); }
 Status Visit(const BinaryType&) { return SetKind(Kind::kString); }
+Status Visit(const LargeBinaryType&) { return SetKind(Kind::kString); }
+Status Visit(const TimestampType&) { return SetKind(Kind::kString); }

Review comment:
   Thanks for fixing this!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou commented on pull request #10836: ARROW-13503: [GLib][Ruby][Flight] Add support for DoGet

2021-07-30 Thread GitBox



kou commented on pull request #10836:
URL: https://github.com/apache/arrow/pull/10836#issuecomment-890149852


   +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] kou closed pull request #10836: ARROW-13503: [GLib][Ruby][Flight] Add support for DoGet

2021-07-30 Thread GitBox



kou closed pull request #10836:
URL: https://github.com/apache/arrow/pull/10836


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] gcca opened a new pull request #10846: ARROW 13473: [C++] Add super-scalar impl for BitUtil::SetBitTo

2021-07-30 Thread GitBox



gcca opened a new pull request #10846:
URL: https://github.com/apache/arrow/pull/10846


   Add the super-scalar variant to set a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10846: ARROW 13473: [C++] Add super-scalar impl for BitUtil::SetBitTo

2021-07-30 Thread GitBox



github-actions[bot] commented on pull request #10846:
URL: https://github.com/apache/arrow/pull/10846#issuecomment-890159961


   
   
   Thanks for opening a pull request!
   
   If this is not a [minor 
PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). 
Could you open an issue for this pull request on JIRA? 
https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the 
[Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.)
 of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
   ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
   MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
 * [Other pull requests](https://github.com/apache/arrow/pulls/)
 * [Contribution Guidelines - How to contribute 
patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan opened a new pull request #796: Unsupported conditions in join

2021-07-30 Thread GitBox



Dandandan opened a new pull request #796:
URL: https://github.com/apache/arrow-datafusion/pull/796


   # Which issue does this PR close?
   
   
   
   Closes #.
   
# Rationale for this change
   
   
   Move more unsupported conditions in LEFT join to filters.
   This is needed to get TPC-H query 13 to run.
   
   # What changes are included in this PR?
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 173053 matches

Mail list logo