date:20220812

[GitHub] [arrow-adbc] paleolimbot commented on a diff in pull request #65: [C] Basic libpq-based driver

2022-08-12 Thread GitBox



paleolimbot commented on code in PR #65:
URL: https://github.com/apache/arrow-adbc/pull/65#discussion_r944980230


##
c/drivers/postgres/statement.cc:
##
@@ -0,0 +1,283 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "statement.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "connection.h"
+#include "util.h"
+
+namespace adbcpq {
+
+namespace {
+/// \brief An ArrowArrayStream that reads tuples from a PGresult.
+class TupleReader {
+ public:
+  explicit TupleReader(PGresult* result) : result_(result) {}
+
+  int GetSchema(struct ArrowSchema* out) {
+std::memset(out, 0, sizeof(*out));
+const int num_fields = PQnfields(result_);
+NA_RETURN_NOT_OK(ArrowSchemaInit(out, NANOARROW_TYPE_STRUCT));
+NA_RETURN_NOT_OK(ArrowSchemaAllocateChildren(out, num_fields));
+for (int i = 0; i < num_fields; i++) {
+  ArrowType field_type = NANOARROW_TYPE_NA;
+  const Oid pg_type = PQftype(result_, i);
+  switch (pg_type) {
+// TODO: at startup, query pg_type to build up this mapping instead of 
hardcoding
+// it
+case 16:  // BOOLOID
+  field_type = NANOARROW_TYPE_BOOL;
+  break;
+case 20:  // INT8OID
+  field_type = NANOARROW_TYPE_INT64;
+  break;
+case 21:  // INT2OID
+  field_type = NANOARROW_TYPE_INT16;
+  break;
+case 23:  // INT4OID
+  field_type = NANOARROW_TYPE_INT32;
+  break;
+default:
+  last_error_ =
+  StringBuilder("[libpq] Column #", i + 1, " (\"", 
PQfname(result_, i),
+"\") has unknown type code ", pg_type);
+  return ENOTSUP;
+  }
+  NA_RETURN_NOT_OK(ArrowSchemaInit(out->children[i], field_type));
+  NA_RETURN_NOT_OK(ArrowSchemaSetName(out->children[i], PQfname(result_, 
i)));
+}
+
+NA_RETURN_NOT_OK(ArrowSchemaDeepCopy(out, _));
+return 0;
+  }
+
+  int GetNext(struct ArrowArray* out) {
+if (!result_) {
+  out->release = nullptr;
+  return 0;
+}
+
+const int num_rows = PQntuples(result_);
+
+NA_RETURN_NOT_OK(ArrowArrayInit(out, NANOARROW_TYPE_STRUCT));
+NA_RETURN_NOT_OK(ArrowArrayAllocateChildren(out, schema_.n_children));
+
+std::vector fields(schema_.n_children);
+
+for (int col = 0; col < schema_.n_children; col++) {
+  NA_RETURN_NOT_OK(ArrowSchemaViewInit([col], 
schema_.children[col], nullptr));
+  NA_RETURN_NOT_OK(ArrowArrayInit(out->children[col], 
fields[col].data_type));
+  NA_RETURN_NOT_OK(
+  ArrowBitmapReserve(ArrowArrayValidityBitmap(out->children[col]), 
num_rows));
+  switch (fields[col].data_type) {
+case NANOARROW_TYPE_INT32:
+  
NA_RETURN_NOT_OK(ArrowBufferReserve(ArrowArrayBuffer(out->children[col], 1),
+  num_rows * sizeof(int32_t)));
+  break;
+default:
+  last_error_ = StringBuilder("[libpq] Column #", col + 1, " (\"",
+  schema_.children[col]->name,
+  "\") has unsupported type ", 
fields[col].data_type);
+  return ENOTSUP;
+  }
+}
+
+for (int row = 0; row < num_rows; row++) {
+  for (int col = 0; col < schema_.n_children; col++) {
+struct ArrowBitmap* bitmap = 
ArrowArrayValidityBitmap(out->children[col]);
+NA_RETURN_NOT_OK(ArrowBitmapAppend(bitmap, !PQgetisnull(result_, row, 
col), 1));
+
+switch (fields[col].data_type) {
+  case NANOARROW_TYPE_INT32: {
+struct ArrowBuffer* buffer = ArrowArrayBuffer(out->children[col], 
1);
+// TODO: assert PQgetlength is 4
+NA_RETURN_NOT_OK(ArrowBufferAppendInt32(
+buffer, ntohl(*(reinterpret_cast(PQgetvalue(result_, 
row, col));
+break;
+  }
+  default:
+last_error_ = StringBuilder(
+"[libpq] Column #", col + 1, " (\"", 
schema_.children[col]->name,
+"\") has unsupported type ", fields[col].data_type);
+return ENOTSUP;
+}
+  }
+}
+
+for (int

[jira] [Created] (ARROW-17405) [Java] C Data Interface library (.so / .dylib) able to compile with mvn command

2022-08-12 Thread David Dali Susanibar Arce (Jira)

David Dali Susanibar Arce created ARROW-17405:
-

 Summary: [Java] C Data Interface library (.so / .dylib) able to 
compile with mvn command
 Key: ARROW-17405
 URL: https://issues.apache.org/jira/browse/ARROW-17405
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation, Java
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17404) [Java] Consolidate JNI compilation #2

2022-08-12 Thread David Dali Susanibar Arce (Jira)

David Dali Susanibar Arce created ARROW-17404:
-

 Summary: [Java] Consolidate JNI compilation #2
 Key: ARROW-17404
 URL: https://issues.apache.org/jira/browse/ARROW-17404
 Project: Apache Arrow
  Issue Type: Bug
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce


*Umbrella ticket for consolidating Java JNI compilation initiative #2*

Initial part of consolidate JNI Java initiative was: [Consolidate ORC/Dataset 
code|https://issues.apache.org/jira/browse/ARROW-15174] and [Separate JNI 
CMakeLists.txt compilation|https://issues.apache.org/jira/browse/ARROW-17080].

This 2nd part consist on:
 * Make the Java library able to compile with a single mvn command
 * Make Java library able to compile from an installed libarrow
 * Migrate remaining C++ code specific to Java into the Java project: Gandiva
 * Add windows build script that produces DLLs
 * Incorporate Windows DLLs into the maven packages
 * Migrate JNI to use C-Data-Interface



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17403) [Java] C Data Interface library (.so / .dylib) able to compile with mvn command

2022-08-12 Thread David Dali Susanibar Arce (Jira)

David Dali Susanibar Arce created ARROW-17403:
-

 Summary: [Java] C Data Interface library (.so / .dylib) able to 
compile with mvn command
 Key: ARROW-17403
 URL: https://issues.apache.org/jira/browse/ARROW-17403
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation, Java
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17402) [C++] Improve Dataset Write Option Defaults

2022-08-12 Thread Kae Suarez (Jira)

Kae Suarez created ARROW-17402:
--

 Summary: [C++] Improve Dataset Write Option Defaults
 Key: ARROW-17402
 URL: https://issues.apache.org/jira/browse/ARROW-17402
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kae Suarez


Currently, for writing a table to disk directly, the defaults are suitable when 
going to a CSV, IPC, Parquet, etc. However, writing a dataset requires multiple 
options to be configured, even when defaults could be obvious, e.g., the name 
for the fragments requires user input, when often they'll be named something 
like "part\{i}.parquet." Ideally, the defaults should be adequate to write a 
Dataset, without further user configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] lidavidm opened a new pull request, #65: [C] Basic libpq-based driver

2022-08-12 Thread GitBox



lidavidm opened a new pull request, #65:
URL: https://github.com/apache/arrow-adbc/pull/65

   The driver supports basic queries (int32 only) and toggling autocommit. It 
does not yet support bulk ingestion or prepared statements.
   
   It hasn't been optimized for speed and the approach taken here will not be 
fast (it uses the per-row getters). In future PRs, we should set up some 
benchmarks and then see if DuckDB's approach makes more sense (use `COPY`). 
DuckDB also does multithreading (that might be hard for us). We may want to 
implement #61 first since then we will know whether it is safe to use `COPY` or 
not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm merged pull request #63: [C] Use nanoarrow to improve validation suite

2022-08-12 Thread GitBox



lidavidm merged PR #63:
URL: https://github.com/apache/arrow-adbc/pull/63


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot merged pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



paleolimbot merged PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader

2022-08-12 Thread Will Jones (Jira)

Will Jones created ARROW-17401:
--

 Summary: [C++] Add ReadTable method to RecordBatchFileReader
 Key: ARROW-17401
 URL: https://issues.apache.org/jira/browse/ARROW-17401
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


For convenience, it would be helpful to add an method for just reading the 
entire file as a table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

2022-08-12 Thread Will Jones (Jira)

Will Jones created ARROW-17400:
--

 Summary: [C++] Move Parquet APIs to use Result instead of Status
 Key: ARROW-17400
 URL: https://issues.apache.org/jira/browse/ARROW-17400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 9.0.0
Reporter: Will Jones


Notably, IPC and CSV have "open file" methods that return result, while opening 
a Parquet file requires passing in an out variable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



paleolimbot commented on code in PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19#discussion_r944665393


##
src/nanoarrow/typedefs_inline.h:
##
@@ -166,6 +166,32 @@ enum ArrowType {
   NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO
 };
 
+/// \brief Functional types of buffers as described in the Arrow Columnar 
Specification
+enum ArrowBufferType {
+  NANOARROW_BUFFER_TYPE_NONE,
+  NANOARROW_BUFFER_TYPE_VALIDITY,
+  NANOARROW_BUFFER_TYPE_TYPE_ID,
+  NANOARROW_BUFFER_TYPE_UNION_OFFSET,
+  NANOARROW_BUFFER_TYPE_DATA_OFFSET,
+  NANOARROW_BUFFER_TYPE_DATA
+};
+
+/// \brief A description of an arrangement of buffers
+///
+/// Contains the minimum amount of information required to
+/// calculate the size of each buffer in an ArrowArray knowing only
+/// the length and offset of the array.
+struct ArrowLayout {
+  /// \brief The function of each buffer
+  enum ArrowBufferType buffer_type[3];
+
+  /// \brief The size of an element each buffer or 0 if this size is variable 
or unknown
+  int64_t element_size_bits[3];
+
+  /// \brief The fixed size of a child element
+  int64_t child_size_elements;

Review Comment:
   It's needed to calculate the length of a child of a fixed-size list (I 
should clarify + add a test for that, though, to make sure).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-nanoarrow] lidavidm commented on a diff in pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



lidavidm commented on code in PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19#discussion_r944654967


##
src/nanoarrow/typedefs_inline.h:
##
@@ -166,6 +166,32 @@ enum ArrowType {
   NANOARROW_TYPE_INTERVAL_MONTH_DAY_NANO
 };
 
+/// \brief Functional types of buffers as described in the Arrow Columnar 
Specification
+enum ArrowBufferType {
+  NANOARROW_BUFFER_TYPE_NONE,
+  NANOARROW_BUFFER_TYPE_VALIDITY,
+  NANOARROW_BUFFER_TYPE_TYPE_ID,
+  NANOARROW_BUFFER_TYPE_UNION_OFFSET,
+  NANOARROW_BUFFER_TYPE_DATA_OFFSET,
+  NANOARROW_BUFFER_TYPE_DATA
+};
+
+/// \brief A description of an arrangement of buffers
+///
+/// Contains the minimum amount of information required to
+/// calculate the size of each buffer in an ArrowArray knowing only
+/// the length and offset of the array.
+struct ArrowLayout {
+  /// \brief The function of each buffer
+  enum ArrowBufferType buffer_type[3];
+
+  /// \brief The size of an element each buffer or 0 if this size is variable 
or unknown
+  int64_t element_size_bits[3];
+
+  /// \brief The fixed size of a child element
+  int64_t child_size_elements;

Review Comment:
   How is this different from `element_size_bits`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



paleolimbot commented on code in PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19#discussion_r944649830


##
src/nanoarrow/utils_inline.h:
##
@@ -26,6 +26,115 @@
 extern "C" {
 #endif
 
+static inline void ArrowLayoutInit(struct ArrowLayout* layout,

Review Comment:
   Done! I'm sure it *could* be header-only but that's a discussion/battle for 
another day that I'm not all that qualified to weigh in on.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



paleolimbot commented on code in PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19#discussion_r944646506


##
src/nanoarrow/typedefs_inline.h:
##
@@ -179,6 +217,19 @@ struct ArrowStringView {
   int64_t n_bytes;
 };
 
+/// \brief An non-owning view of a buffer
+struct ArrowBufferView {
+  /// \brief A pointer to the start of the buffer
+  ///
+  /// If n_bytes is 0, this value may be NULL.
+  const union ArrowBufferDataPointer data;
+
+  /// \brief The size of the string in bytes,
+  ///
+  /// (Not including the null terminator.)
+  int64_t n_bytes;

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm merged pull request #62: [Format][C][Java] Add method to get parameter schema

2022-08-12 Thread GitBox



lidavidm merged PR #62:
URL: https://github.com/apache/arrow-adbc/pull/62


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm closed issue #60: [Format] Retrieve expected param binding information

2022-08-12 Thread GitBox



lidavidm closed issue #60: [Format] Retrieve expected param binding information
URL: https://github.com/apache/arrow-adbc/issues/60


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

2022-08-12 Thread GitBox



lidavidm commented on issue #61:
URL: https://github.com/apache/arrow-adbc/issues/61#issuecomment-1213320396

   Another reason to differentiate between queries with/without result sets: in 
a Postgres driver, that means we know when we can attempt to use `COPY (...) TO 
STDOUT (FORMAT binary)` (akin to DuckDB's integration with Postgres) to get 
bulk binary data instead of parsing data one row at a time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-nanoarrow] codecov-commenter commented on pull request #19: ArrowArray consumer buffer helpers

2022-08-12 Thread GitBox



codecov-commenter commented on PR #19:
URL: https://github.com/apache/arrow-nanoarrow/pull/19#issuecomment-1213263144

   # 
[Codecov](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19?src=pr=h1_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#19](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19?src=pr=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (e393e32) into 
[main](https://codecov.io/gh/apache/arrow-nanoarrow/commit/3b305075d2c4c8489ac0e6288edb58a52a64884d?el=desc_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 (3b30507) will **increase** coverage by `0.44%`.
   > The diff coverage is `95.37%`.
   
   ```diff
   @@Coverage Diff @@
   ## main  #19  +/-   ##
   ==
   + Coverage   90.64%   91.09%   +0.44% 
   ==
 Files   9   10   +1 
 Lines1037 1145 +108 
 Branches   43   46   +3 
   ==
   + Hits  940 1043 +103 
   - Misses 63   66   +3 
   - Partials   34   36   +2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[src/nanoarrow/array\_view.c](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-c3JjL25hbm9hcnJvdy9hcnJheV92aWV3LmM=)
 | `82.75% <82.75%> (ø)` | |
   | 
[src/nanoarrow/schema\_view.c](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-c3JjL25hbm9hcnJvdy9zY2hlbWFfdmlldy5j)
 | `98.88% <100.00%> (+0.01%)` | :arrow_up: |
   | 
[src/nanoarrow/utils\_inline.h](https://codecov.io/gh/apache/arrow-nanoarrow/pull/19/diff?src=pr=tree_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation#diff-c3JjL25hbm9hcnJvdy91dGlsc19pbmxpbmUuaA==)
 | `100.00% <100.00%> (ø)` | |
   
   :mega: We’re building smart automated test selection to slash your CI/CD 
build times. [Learn 
more](https://about.codecov.io/iterative-testing/?utm_medium=referral_source=github_content=comment_campaign=pr+comments_term=The+Apache+Software+Foundation)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] zeroshade commented on pull request #62: [Format][C][Java] Add method to get parameter schema

2022-08-12 Thread GitBox



zeroshade commented on PR #62:
URL: https://github.com/apache/arrow-adbc/pull/62#issuecomment-1213241281

   :shipit: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

2022-08-12 Thread Gianluca Ficarelli (Jira)

Gianluca Ficarelli created ARROW-17399:
--

 Summary: pyarrow may use a lot of memory to load a dataframe from 
parquet
 Key: ARROW-17399
 URL: https://issues.apache.org/jira/browse/ARROW-17399
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet, Python
Affects Versions: 9.0.0
 Environment: linux
Reporter: Gianluca Ficarelli
 Attachments: memory-profiler.png

When a pandas dataframe is loaded from a parquet file using 
{{{}pyarrow.parquet.read_table{}}}, the memory usage may grow a lot more than 
what should be needed to load the dataframe, and it's not freed until the 
dataframe is deleted.

The problem is evident when the dataframe has a {*}column containing lists or 
numpy arrays{*}, while it seems absent (or not noticeable) if the column 
contains only integer or floats.

I'm attaching a simple script to reproduce the issue, and a graph created with 
memory-profiler showing the memory usage.

In this example, the dataframe created with pandas needs around 1.2 GB, but the 
memory usage after loading it from parquet is around 16 GB.

The items of the column are created as numpy arrays and not lists, to be 
consistent with the types loaded from parquet (pyarrow produces numoy arrays 
and not lists).

 
{code:python}
import gc
import time
import numpy as np
import pandas as pd
import pyarrow
import pyarrow.parquet
import psutil

def pyarrow_dump(filename, df, compression="snappy"):
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table, filename, compression=compression)

def pyarrow_load(filename):
table = pyarrow.parquet.read_table(filename)
return table.to_pandas()

def print_mem(msg, start_time=time.monotonic(), process=psutil.Process()):
# gc.collect()
current_time = time.monotonic() - start_time
rss = process.memory_info().rss / 2 ** 20
print(f"{msg:>3} time:{current_time:>10.1f} rss:{rss:>10.1f}")

if __name__ == "__main__":
print_mem(0)

rows = 500
df = pd.DataFrame({"a": [np.arange(10) for i in range(rows)]})
print_mem(1)

pyarrow_dump("example.parquet", df)
print_mem(2)

del df
print_mem(3)
time.sleep(3)
print_mem(4)

df = pyarrow_load("example.parquet")
print_mem(5)
time.sleep(3)
print_mem(6)

del df
print_mem(7)
time.sleep(3)
print_mem(8)
{code}
Run with memory-profiler:
{code:bash}
mprof run --multiprocess python test_pyarrow.py
{code}
Output:
{code:java}
mprof: Sampling memory every 0.1s
running new process
  0 time:   0.0 rss: 135.4
  1 time:   4.9 rss:1252.2
  2 time:   7.1 rss:1265.0
  3 time:   7.5 rss: 760.2
  4 time:  10.7 rss: 758.9
  5 time:  19.6 rss:   16745.4
  6 time:  22.6 rss:   16335.4
  7 time:  22.9 rss:   15833.0
  8 time:  25.9 rss: 955.0
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] lidavidm commented on a diff in pull request #62: [Format][C][Java] Add method to get parameter schema

2022-08-12 Thread GitBox



lidavidm commented on code in PR #62:
URL: https://github.com/apache/arrow-adbc/pull/62#discussion_r944581552


##
adbc.h:
##
@@ -746,6 +746,22 @@ AdbcStatusCode AdbcStatementBindStream(struct 
AdbcStatement* statement,
struct ArrowArrayStream* values,
struct AdbcError* error);
 
+/// \brief Get the schema for bound parameters.
+///
+/// This should be called after AdbcStatementPrepare.  This retrieves
+/// an Arrow schema describing the number, names, and types of the
+/// parameters in a parameterized statement.  Not all drivers will
+/// support this.  If the name of a parameter cannot be determined,
+/// the name of the corresponding field in the schema will be an empty
+/// string.  Similarly, if the type cannot be statically determined,
+/// the type of the corresponding field will be NA (NullType).

Review Comment:
   Good idea - updated the docstrings



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17398) [R] Add support for %Z to strptime

2022-08-12 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-17398:
--

 Summary: [R] Add support for %Z to strptime 
 Key: ARROW-17398
 URL: https://issues.apache.org/jira/browse/ARROW-17398
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Rok Mihevc


While lubridate does not support %Z flag for strptime Arrow could.

Changes to C++ kernels might be required for support on all platforms, but that 
shouldn't block implementation as kStrptimeSupportsZone flag can be used, [see 
proposal|https://github.com/apache/arrow/pull/13854#issuecomment-1212694663].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] lidavidm opened a new issue, #64: [Format] Formalize thread safety guarantees

2022-08-12 Thread GitBox



lidavidm opened a new issue, #64:
URL: https://github.com/apache/arrow-adbc/issues/64

   Things to consider
   - What do underlying APIs provide (libpq, sqlite, JDBC, ODBC, Flight SQL)
   - What do wrapper APIs expect (JDBC, ODBC, dbapi, Go's database library)
   
   Example: libpq disallows concurrent queries through a single PGconn, so 
multiple AdbcStatements can't be used if they share a connection (and the 
semantics of that get murky anyways) - but what should the behavior be?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] zeroshade commented on a diff in pull request #62: [Format][C][Java] Add method to get parameter schema

2022-08-12 Thread GitBox



zeroshade commented on code in PR #62:
URL: https://github.com/apache/arrow-adbc/pull/62#discussion_r944512305


##
adbc.h:
##
@@ -746,6 +746,22 @@ AdbcStatusCode AdbcStatementBindStream(struct 
AdbcStatement* statement,
struct ArrowArrayStream* values,
struct AdbcError* error);
 
+/// \brief Get the schema for bound parameters.
+///
+/// This should be called after AdbcStatementPrepare.  This retrieves
+/// an Arrow schema describing the number, names, and types of the
+/// parameters in a parameterized statement.  Not all drivers will
+/// support this.  If the name of a parameter cannot be determined,
+/// the name of the corresponding field in the schema will be an empty
+/// string.  Similarly, if the type cannot be statically determined,
+/// the type of the corresponding field will be NA (NullType).

Review Comment:
   should we also explicitly state/define that the order of the columns in the 
schema should match the ordinal position of the parameters and if a named 
parameter is used multiple times in the query, it should only appear once in 
the schema?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] lidavidm commented on pull request #62: [Format][C][Java] Add method to get parameter schema

2022-08-12 Thread GitBox



lidavidm commented on PR #62:
URL: https://github.com/apache/arrow-adbc/pull/62#issuecomment-1213030715

   @zeroshade does this seem reasonable? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?

2022-08-12 Thread Roy Assis (Jira)

Roy Assis created ARROW-17397:
-

 Summary: [R] Does R API for Apache Arrow has a tableFromIPC 
function ? 
 Key: ARROW-17397
 URL: https://issues.apache.org/jira/browse/ARROW-17397
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Roy Assis


I'm building an API using python and flask. I want to return a dataframe from 
the API, i'm serializing the dataframe like so and sending it in the response:

{code:python}
batch = pa.record_batch(df)
sink = pa.BufferOutputStream()
with pa.ipc.new_stream(sink, batch.schema) as writer:
writer.write_batch(batch)
pybytes = sink.getvalue().to_pybytes()
{code}

Is it possible to read it with R ? If so can you provide a code snippet.

Best,
Roy



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17396) [C++][Dataset] Allow creating FileSystemDataset with FileInfoGenerator as a source

2022-08-12 Thread Pavel Solodovnikov (Jira)

Pavel Solodovnikov created ARROW-17396:
--

 Summary: [C++][Dataset] Allow creating FileSystemDataset with 
FileInfoGenerator as a source
 Key: ARROW-17396
 URL: https://issues.apache.org/jira/browse/ARROW-17396
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Pavel Solodovnikov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] paleolimbot commented on a diff in pull request #65: [C] Basic libpq-based driver

[jira] [Created] (ARROW-17405) [Java] C Data Interface library (.so / .dylib) able to compile with mvn command

[jira] [Created] (ARROW-17404) [Java] Consolidate JNI compilation #2

[jira] [Created] (ARROW-17403) [Java] C Data Interface library (.so / .dylib) able to compile with mvn command

[jira] [Created] (ARROW-17402) [C++] Improve Dataset Write Option Defaults

[GitHub] [arrow-adbc] lidavidm opened a new pull request, #65: [C] Basic libpq-based driver

[GitHub] [arrow-adbc] lidavidm merged pull request #63: [C] Use nanoarrow to improve validation suite

[GitHub] [arrow-nanoarrow] paleolimbot merged pull request #19: ArrowArray consumer buffer helpers

[jira] [Created] (ARROW-17401) [C++] Add ReadTable method to RecordBatchFileReader

[jira] [Created] (ARROW-17400) [C++] Move Parquet APIs to use Result instead of Status

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

[GitHub] [arrow-nanoarrow] lidavidm commented on a diff in pull request #19: ArrowArray consumer buffer helpers

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

[GitHub] [arrow-nanoarrow] paleolimbot commented on a diff in pull request #19: ArrowArray consumer buffer helpers

[GitHub] [arrow-adbc] lidavidm merged pull request #62: [Format][C][Java] Add method to get parameter schema

[GitHub] [arrow-adbc] lidavidm closed issue #60: [Format] Retrieve expected param binding information

[GitHub] [arrow-adbc] lidavidm commented on issue #61: [Format] Simplify Execute and Query interface

[GitHub] [arrow-nanoarrow] codecov-commenter commented on pull request #19: ArrowArray consumer buffer helpers

[GitHub] [arrow-adbc] zeroshade commented on pull request #62: [Format][C][Java] Add method to get parameter schema

[jira] [Created] (ARROW-17399) pyarrow may use a lot of memory to load a dataframe from parquet

[GitHub] [arrow-adbc] lidavidm commented on a diff in pull request #62: [Format][C][Java] Add method to get parameter schema

[jira] [Created] (ARROW-17398) [R] Add support for %Z to strptime

[GitHub] [arrow-adbc] lidavidm opened a new issue, #64: [Format] Formalize thread safety guarantees

[GitHub] [arrow-adbc] zeroshade commented on a diff in pull request #62: [Format][C][Java] Add method to get parameter schema

[GitHub] [arrow-adbc] lidavidm commented on pull request #62: [Format][C][Java] Add method to get parameter schema

[jira] [Created] (ARROW-17397) [R] Does R API for Apache Arrow has a tableFromIPC function ?

[jira] [Created] (ARROW-17396) [C++][Dataset] Allow creating FileSystemDataset with FileInfoGenerator as a source

27 matches

Site Navigation

Mail list logo

Footer information