[GitHub] [arrow-datafusion] andygrove opened a new issue, #4985: Changelog generator not working for patch releases

2023-01-19 Thread GitBox


andygrove opened a new issue, #4985:
URL: https://github.com/apache/arrow-datafusion/issues/4985

   **Describe the bug**
   I could not get the changelog script to generate an accurate changelog for 
the maint-16.x branch. It showed changes from the master branch that are not in 
maint-16.x
   
   **To Reproduce**
   Try doing the above
   
   **Expected behavior**
   Generate correct changelog
   
   **Additional context**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] AlenkaF commented on pull request #33761: GH-14932: Add python bindings for JSON streaming reader

2023-01-19 Thread GitBox


AlenkaF commented on PR #33761:
URL: https://github.com/apache/arrow/pull/33761#issuecomment-1397108286

   Great work so far @akshaysu12 !
   Looking at the code from the binding for `CSVStreamingReader` and the work 
done on the C++ side for the JSON stream reader I would say this PR is already 
in a good shape.
   
   As you mentioned, you should add the documentation to the 
https://arrow.apache.org/docs/dev/python/api/formats.html 
(docs/source/python/api/formats.rst)and 
https://arrow.apache.org/docs/dev/python/json.html 
(docs/source/python/json.rst).
   
   To build the documentation locally see:  
https://arrow.apache.org/docs/dev/developers/documentation.html
   
   You should also correct the linter error:
   ```python
   --- original//arrow/python/pyarrow/tests/test_json.py
   +++ fixed//arrow/python/pyarrow/tests/test_json.py
   @@ -106,6 +106,7 @@
check_options_class_pickling(cls, explicit_schema=schema,
 newlines_in_values=False,
 unexpected_field_behavior="ignore")
   +

class BaseTestJSON(abc.ABC):
@abc.abstractmethod
   @@ -296,11 +297,12 @@
# Better error output
assert table.to_pydict() == expected.to_pydict()

   +
class BaseTestJSONRead(BaseTestJSON):

def read_bytes(self, b, **kwargs):
return self.read_json(pa.py_buffer(b), **kwargs)
   -
   +
def test_file_object(self):
data = b'{"a": 1, "b": 2}\n'
expected_data = {'a': [1], 'b': [2]}
   @@ -311,7 +313,7 @@
sio = io.StringIO(data.decode())
with pytest.raises(TypeError):
self.read_json(sio)
   -
   +
def test_reconcile_accross_blocks(self):
# ARROW-12065: reconciling inferred types across blocks
first_row = b'{   }\n'
   ```
   To run the linter locally see: 
https://arrow.apache.org/docs/dev/developers/python.html#coding-style


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #33716: WIP: DO NOT MERGE: Apache Arrow Flight SQL adapter for PostgreSQL plan

2023-01-19 Thread GitBox


alamb commented on PR #33716:
URL: https://github.com/apache/arrow/pull/33716#issuecomment-1397104567

   > Ok. And those clients use the wire protocol directly(ish), so they can't 
take advantage of the JDBC or ODBC Flight SQL drivers, presumably?
   
   That is correct. In the ideal future world, it will be FlightSQL (via ADBC 
perhaps or native support) all the way down. Until we get there it is important 
to work with the existing ecosystem
   
   > ADBC may not help there, unless they're using DBAPI/SQLAlchemy in Python, 
where the ADBC bindings implement DBAPI and work(ish) with the rest of the 
stack.
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion-python] jdye64 opened a new issue, #140: Bump pyo3 to 0.18.0

2023-01-19 Thread GitBox


jdye64 opened a new issue, #140:
URL: https://github.com/apache/arrow-datafusion-python/issues/140

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   PyO3 recently had a release and there seems to be several features we could 
benefit from. We should bump our PyO3 version from `0.17.3` -> `0.18.0`
   
   **Describe the solution you'd like**
   Bump PyO3 version from `0.17.3` -> `0.18.0` in `Cargo.toml`
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on pull request #4975: [maint-16.x] Prep for release

2023-01-19 Thread GitBox


andygrove commented on PR #4975:
URL: 
https://github.com/apache/arrow-datafusion/pull/4975#issuecomment-1397101686

   @jonmmease @alamb PTAL
   
   I could not get the changelog script to generate an accurate changelog for 
this patch release and do not have time to debug that right now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4972: Simplify GroupByHash implementation (to prepare for more work)

2023-01-19 Thread GitBox


alamb commented on code in PR #4972:
URL: https://github.com/apache/arrow-datafusion/pull/4972#discussion_r1081382689


##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -219,91 +221,76 @@ impl GroupedHashAggregateStream {
 batch_size,
 row_group_skip_position: 0,
 indices: [normal_agg_indices, row_agg_indices],
-};
+})
+}
+}
 
-let stream = futures::stream::unfold(inner, |mut this| async move {
-let elapsed_compute = this.baseline_metrics.elapsed_compute();
+impl Stream for GroupedHashAggregateStream {
+type Item = ArrowResult;
 
-loop {
-let result: ArrowResult> =
-match this.input.next().await {
+fn poll_next(
+mut self: std::pin::Pin< Self>,
+cx:  Context<'_>,
+) -> Poll> {
+let elapsed_compute = self.baseline_metrics.elapsed_compute().clone();
+
+loop {
+match self.exec_state {
+ExecutionState::ReadingInput => {
+match ready!(self.input.poll_next_unpin(cx)) {
+// new batch to aggregate
 Some(Ok(batch)) => {
 let timer = elapsed_compute.timer();
-let result = group_aggregate_batch(
-,
-_state,
-_by,
-_aggr_expr,
- this.row_accumulators,
- this.row_converter,
-this.row_aggr_layout.clone(),
-batch,
- this.row_aggr_state,
-_aggregate_expressions,
-_aggregate_expressions,
-);
-
+let result = self.group_aggregate_batch(batch);
 timer.done();
 
 // allocate memory
 // This happens AFTER we actually used the memory, 
but simplifies the whole accounting and we are OK with
 // overshooting a bit. Also this means we either 
store the whole record batch or not.
 match result.and_then(|allocated| {
-
this.row_aggr_state.reservation.try_grow(allocated)
+
self.row_aggr_state.reservation.try_grow(allocated)
 }) {
-Ok(_) => continue,
-Err(e) => 
Err(ArrowError::ExternalError(Box::new(e))),
+Ok(_) => {}
+Err(e) => {
+return Poll::Ready(Some(Err(
+ArrowError::ExternalError(Box::new(e)),
+)))
+}
 }
 }

Review Comment:
   I agree -- changed in 06847b550



##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -219,91 +221,76 @@ impl GroupedHashAggregateStream {
 batch_size,
 row_group_skip_position: 0,
 indices: [normal_agg_indices, row_agg_indices],
-};
+})
+}
+}
 
-let stream = futures::stream::unfold(inner, |mut this| async move {
-let elapsed_compute = this.baseline_metrics.elapsed_compute();
+impl Stream for GroupedHashAggregateStream {
+type Item = ArrowResult;
 
-loop {
-let result: ArrowResult> =
-match this.input.next().await {
+fn poll_next(
+mut self: std::pin::Pin< Self>,
+cx:  Context<'_>,
+) -> Poll> {
+let elapsed_compute = self.baseline_metrics.elapsed_compute().clone();
+
+loop {
+match self.exec_state {
+ExecutionState::ReadingInput => {
+match ready!(self.input.poll_next_unpin(cx)) {
+// new batch to aggregate
 Some(Ok(batch)) => {
 let timer = elapsed_compute.timer();
-let result = group_aggregate_batch(
-,
-_state,
-_by,
-_aggr_expr,
- this.row_accumulators,
- this.row_converter,
-this.row_aggr_layout.clone(),
-batch,
- this.row_aggr_state,
-_aggregate_expressions,
-_aggregate_expressions,
-

[GitHub] [arrow-datafusion-python] andygrove opened a new issue, #139: Make it easier to create a Pandas dataframe from DataFusion query results

2023-01-19 Thread GitBox


andygrove opened a new issue, #139:
URL: https://github.com/apache/arrow-datafusion-python/issues/139

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   DataFrame.collect returns a list of PyArrow record batches. Each batch can 
be turned into a Pandas datraframe but I do not know how to create a Pandas 
dataframe that contains data from all of the batches in an efficient way.
   
   **Describe the solution you'd like**
   Either an example for this, or new features to help with this. Perhaps a 
`DataFrame.collect_single_batch` could work.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ursabot commented on pull request #4923: Support join-filter pushdown for semi/anti join

2023-01-19 Thread GitBox


ursabot commented on PR #4923:
URL: 
https://github.com/apache/arrow-datafusion/pull/4923#issuecomment-1397098405

   Benchmark runs are scheduled for baseline = 
19f6f19c1d4783f9bcfde83744ee436f8e984154 and contender = 
dde23efed94704044822bcefe49c0af7f9260088. 
dde23efed94704044822bcefe49c0af7f9260088 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4d88266c4cb04c45aa4d9c5065459682...4b5b4b9e439f4abd90ba366e0b099e18/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/af3f6df5c9d94c03b971683713db8e22...931ef05c25c443d79347312b1f940f7f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ab2e5fffcaec43889fd74cec883fd6b6...2718b8b06b1440ed9d15931ffe93f5aa/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ab6c2c64b84e4ee0b362a1d3922ba15e...4e163eff914541bdbd6b18ec024233fa/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #33769: [Python] support quantile for temporal dtypes

2023-01-19 Thread GitBox


jorisvandenbossche commented on issue #33769:
URL: https://github.com/apache/arrow/issues/33769#issuecomment-1397094509

   I think also for that reason (quantile by default always returns floats), we 
might not want to add support for quantile to temporal types (not sure if we 
actually ever considered this). 
   
   There are interpolation methods (lowest, highest, nearest) that do preserve 
the input type, though. So maybe we could only allow those?
   
   But on your point of not matching pandas behaviour if you do the casting 
manually. I am not sure having this kernel in pyarrow would make any difference 
for that, since arrow would basically do the same under the hood and calculate 
the interpolated value based on the underlying integers.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion-python] andygrove merged pull request #137: Improve README and add more examples

2023-01-19 Thread GitBox


andygrove merged PR #137:
URL: https://github.com/apache/arrow-datafusion-python/pull/137


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] mbrobbel opened a new pull request, #33788: MINOR: [Documentation] Add link to IPC section on the Columnar page

2023-01-19 Thread GitBox


mbrobbel opened a new pull request, #33788:
URL: https://github.com/apache/arrow/pull/33788

   ### Rationale for this change
   
   Linking directly to the IPC section makes it easier to find what people are 
looking for.
   
   ### What changes are included in this PR?
   
   Adds a link to the section about IPC (on the columnar page) from the IPC 
page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion-python] jdye64 commented on pull request #137: Improve README and add more examples

2023-01-19 Thread GitBox


jdye64 commented on PR #137:
URL: 
https://github.com/apache/arrow-datafusion-python/pull/137#issuecomment-1397088728

   FYI - I'm working on adding support for compressed file reading and will 
update the examples to show that as well once completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


thisisnic commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081369004


##
r/NEWS.md:
##
@@ -19,6 +19,92 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, 
+  [#14514](https://github.com/apache/arrow/issues/14514))  
+
+### Reading/writing data
+
+* New functions `open_csv_dataset()`, `open_tsv_dataset()`, and 
+  `open_delim_dataset()` all wrap `open_dataset()`- they don't provide new 
+  functionality, but allow for readr-style options to be supplied, making it 
+  simpler to switch between individual file-reading and dataset 
+  functionality. ([#33614](https://github.com/apache/arrow/issues/33614))
+* User-defined null values can be set when writing CSVs both as datasets 
+  and as individual files. (@wjones127, 
+  [#14679](https://github.com/apache/arrow/issues/14679))
+* The new `col_names` parameter allows specification of column names when 
+  opening a CSV dataset. (@wjones127, 
+  [#14705](https://github.com/apache/arrow/issues/14705))
+* The `parse_options`, `read_options`, and `convert_options` parameters for 
+  reading individual files (`read_*_arrow()` functions) and datasets 
+  (`open_dataset()` and the new `open_*_dataset()` functions) can be passed 
+  in as lists. ([#15270](https://github.com/apache/arrow/issues/15270))
+* File paths containing accents can be read by `read_csv_arrow()`. 
+  ([#14930](https://github.com/apache/arrow/issues/14930))
+
+### dplyr compatibility
+* New dplyr (1.1.0) function `join_by()` has been implemented for dplyr joins 
+  on Arrow objects (equality conditions only).  
+  ([#33664](https://github.com/apache/arrow/issues/33664))
+
+### Function bindings
+
+* The following functions can be used in queries on Arrow objects:
+  * `lubridate::with_tz()` and `lubridate::force_tz()` (@eitsupi, 
+  [#14093](https://github.com/apache/arrow/issues/14093))
+  * `stringr::str_remove()` and `stringr::str_remove_all()` 
+  ([#14644](https://github.com/apache/arrow/issues/14644))
+
+### Arrow object creation
+
+* Arrow Scalars can be created from `POSIXlt` objects. 
+  ([#15277](https://github.com/apache/arrow/issues/15277))
+* `Array$create()` can create Decimal arrays. 
+  ([#15211](https://github.com/apache/arrow/issues/15211))
+* `StructArray$create()` can be used to create StructArray objects. 
+  ([#14922](https://github.com/apache/arrow/issues/14922))
+
+### Installation
+
+* Improved offline installation using pre-downloaded binaries. 
+  (@pgramme, [#14086](https://github.com/apache/arrow/issues/14086))
+* The package can automatically link to system installations of the AWS SDK
+  for C++. (@kou, [#14235](https://github.com/apache/arrow/issues/14235))
+
+## Minor improvements and fixes
+
+* Calling `lubridate::as_datetime()` on Arrow objects can handle time in 
+  sub-seconds. (@eitsupi, 
+  [#13890](https://github.com/apache/arrow/issues/13890))
+* `head()` can be called after `as_record_batch_read()`. 
+  ([#14518](https://github.com/apache/arrow/issues/14518))
+* `dplyr::right_join()` correctly coalesces keys. 
+  ([#15077](https://github.com/apache/arrow/issues/15077))
+* Output is accurate when multiple `dplyr::group_by()`/`dplyr::summarise()` 
+  calls are used. ([#14905](https://github.com/apache/arrow/issues/14905))
+* `dplyr::summarize()` works with division when divisor is a variable. 
+  ([#14933](https://github.com/apache/arrow/issues/14933))
+* `as.Date()` can go from `timestamp[us]` to `timestamp[s]`. 
+  ([#14935](https://github.com/apache/arrow/issues/14935))
+* Creating an Array from an object bigger than 2^31 has correct length 
+  ([#14929](https://github.com/apache/arrow/issues/14929))
+* curl timeout policy can be configured for S3. 
+  ([#15166](https://github.com/apache/arrow/issues/15166))
+* Multiple changes to ensure compatibility with dplyr 1.1.0. 
+  (@lionel-, [#14948](https://github.com/apache/arrow/issues/14948))
+* rlang dependency must be at least version 1.0.0 because of 
+  `check_dots_empty()`. (@daattali, 
+  [#14744](https://github.com/apache/arrow/issues/14744))
+
+## Breaking changes
+
+* `map_batches()` is lazy by default. 
([#14521](https://github.com/apache/arrow/issues/14521))

Review Comment:
   Yeah, that would mirror the tidyverse guide, so will move it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


thisisnic commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081366249


##
r/NEWS.md:
##
@@ -19,6 +19,92 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, 
+  [#14514](https://github.com/apache/arrow/issues/14514))  

Review Comment:
   It's substantial enough to warrant inclusion, otherwise I'd put it elsewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4923: Support join-filter pushdown for semi/anti join

2023-01-19 Thread GitBox


alamb commented on code in PR #4923:
URL: https://github.com/apache/arrow-datafusion/pull/4923#discussion_r1081365754


##
datafusion/core/tests/sqllogictests/test_files/join.slt:
##
@@ -42,3 +42,42 @@ SELECT s.*, g.grade FROM students s join grades g on s.mark 
between g.min and g.
 Amina 89 5
 Salma 77 4
 Christen 50 3
+
+# two tables for join
+statement ok
+CREATE TABLE t1(t1_id INT, t1_name TEXT, t1_int INT) AS VALUES
+(11, 'a', 1),
+(22, 'b', 2),
+(33, 'c', 3),
+(44, 'd', 4);
+
+statement ok
+CREATE TABLE t2(t2_id INT, t2_name TEXT, t2_int INT) AS VALUES
+(11, 'z', 3),
+(22, 'y', 1),
+(44, 'x', 3),
+(55, 'w', 3);
+
+# left semi with wrong where clause

Review Comment:
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


thisisnic commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081366249


##
r/NEWS.md:
##
@@ -19,6 +19,92 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, 
+  [#14514](https://github.com/apache/arrow/issues/14514))  

Review Comment:
   It's substantial enough to warrant inclusion, yep.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #4923: Support join-filter pushdown for semi/anti join

2023-01-19 Thread GitBox


alamb merged PR #4923:
URL: https://github.com/apache/arrow-datafusion/pull/4923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ursabot commented on pull request #4969: refactor: display input partitions for `RepartitionExec`

2023-01-19 Thread GitBox


ursabot commented on PR #4969:
URL: 
https://github.com/apache/arrow-datafusion/pull/4969#issuecomment-1397085822

   Benchmark runs are scheduled for baseline = 
96cf046be57bf09548d51f50d0bc964904bcec7d and contender = 
19f6f19c1d4783f9bcfde83744ee436f8e984154. 
19f6f19c1d4783f9bcfde83744ee436f8e984154 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ddbe6458091b4b95994b991e87c68b8b...4d88266c4cb04c45aa4d9c5065459682/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/34270373e32849ed8db9900b9262b751...af3f6df5c9d94c03b971683713db8e22/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a71cf0d1192049489917195356b46b9f...ab2e5fffcaec43889fd74cec883fd6b6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d31807daee1b497680a6bb93e6e23577...ab6c2c64b84e4ee0b362a1d3922ba15e/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #4908: added a method to read multiple locations at the same time.

2023-01-19 Thread GitBox


alamb commented on PR #4908:
URL: 
https://github.com/apache/arrow-datafusion/pull/4908#issuecomment-1397083616

   > What do you think about having a single method which only takes a list of 
paths? For a single path, the callee can create a slice/Vec. This would be a 
lot simpler to do.
   
   I was thinking about this PR and I have an alternate suggestion
   
   It seems to me that  `read_parquet`, `read_avro`, etc are wrappers to 
simplify the process of creating a `ListingTable`. Support for multiple paths 
starts complicating the API more -- what do you think about instead of adding 
`read_parquet_from_path`s we make it easier to see how to read multiple files 
using the `ListingTable` API directly?
   
   For example, I bet if we added a doc example like the following
   
   ```rust
   /// Creates a [`DataFrame`] for reading a Parquet data source from a 
single file or directory. 
   ///
   /// Note: if you want to read from multiple files, or control other 
behaviors
   /// you can use the [`ListingTable`] API directly. For example to read 
multiple files
   /// 
   /// ```
   /// Example here (basically copy/paste the implementation of 
read_parquet and support multiple files)
   /// ```
   pub async fn read_parquet(
   ,
   table_path: impl AsRef,
   options: ParquetReadOptions<'_>,
   ) -> Result {
   ...
   ```
   
   We could give similar treatment to the docstrings for `read_avro` and 
`read_csv` (perhaps by pointing to the docs for `read_parquet` for an example 
of creating `ListingTable`s)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] eitsupi commented on issue #18487: [R] Read CSV from character vector

2023-01-19 Thread GitBox


eitsupi commented on issue #18487:
URL: https://github.com/apache/arrow/issues/18487#issuecomment-1397083374

   Note that this has not worked since `readr` 2.0.0.
   
   ``` r
   readr::read_csv(c("a,b", "1,2", "3,4"))
   #> Error: 'a,b' does not exist in current working directory 
('/tmp/Rtmp5bv5aV/reprex-4864c762e4-bared-pika').
   ```
   
   Created on 2023-01-19 with [reprex 
v2.0.2](https://reprex.tidyverse.org)
   
   AsIs `I()` must be used.
   
   ``` r
   readr::read_csv(I(c("a,b", "1,2", "3,4")))
   #> Rows: 2 Columns: 2
   #> ── Column specification 

   #> Delimiter: ","
   #> dbl (2): a, b
   #>
   #> ℹ Use `spec()` to retrieve the full column specification for this data.
   #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
   #> # A tibble: 2 × 2
   #>   a b
   #>
   #> 1 1 2
   #> 2 3 4
   ```
   
   Created on 2023-01-19 with [reprex 
v2.0.2](https://reprex.tidyverse.org)
   
   Perhaps `I()` could be used here as well to make it do the same?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jorisvandenbossche merged pull request #33764: GH-15109: [Python] Allow creation of non empty struct array with zero field

2023-01-19 Thread GitBox


jorisvandenbossche merged PR #33764:
URL: https://github.com/apache/arrow/pull/33764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] cyborne100 commented on issue #33758: [R] SparkR Arrow "Hello World" Error: 'write_arrow' is not an exported object from 'namespace:arrow'

2023-01-19 Thread GitBox


cyborne100 commented on issue #33758:
URL: https://github.com/apache/arrow/issues/33758#issuecomment-1397075519

   Ahh, I understand now.  Thanks for helping a NOOB. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #4969: refactor: display input partitions for `RepartitionExec`

2023-01-19 Thread GitBox


alamb merged PR #4969:
URL: https://github.com/apache/arrow-datafusion/pull/4969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #4964: `RepartitionExec`: print number of input partitions in text representation

2023-01-19 Thread GitBox


alamb closed issue #4964: `RepartitionExec`: print number of input partitions 
in text representation
URL: https://github.com/apache/arrow-datafusion/issues/4964


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] eitsupi commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


eitsupi commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081343959


##
r/NEWS.md:
##
@@ -19,6 +19,92 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, 
+  [#14514](https://github.com/apache/arrow/issues/14514))  

Review Comment:
   Simple question: Improved documentation is new features?



##
r/NEWS.md:
##
@@ -19,6 +19,92 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, 
+  [#14514](https://github.com/apache/arrow/issues/14514))  
+
+### Reading/writing data
+
+* New functions `open_csv_dataset()`, `open_tsv_dataset()`, and 
+  `open_delim_dataset()` all wrap `open_dataset()`- they don't provide new 
+  functionality, but allow for readr-style options to be supplied, making it 
+  simpler to switch between individual file-reading and dataset 
+  functionality. ([#33614](https://github.com/apache/arrow/issues/33614))
+* User-defined null values can be set when writing CSVs both as datasets 
+  and as individual files. (@wjones127, 
+  [#14679](https://github.com/apache/arrow/issues/14679))
+* The new `col_names` parameter allows specification of column names when 
+  opening a CSV dataset. (@wjones127, 
+  [#14705](https://github.com/apache/arrow/issues/14705))
+* The `parse_options`, `read_options`, and `convert_options` parameters for 
+  reading individual files (`read_*_arrow()` functions) and datasets 
+  (`open_dataset()` and the new `open_*_dataset()` functions) can be passed 
+  in as lists. ([#15270](https://github.com/apache/arrow/issues/15270))
+* File paths containing accents can be read by `read_csv_arrow()`. 
+  ([#14930](https://github.com/apache/arrow/issues/14930))
+
+### dplyr compatibility
+* New dplyr (1.1.0) function `join_by()` has been implemented for dplyr joins 
+  on Arrow objects (equality conditions only).  
+  ([#33664](https://github.com/apache/arrow/issues/33664))
+
+### Function bindings
+
+* The following functions can be used in queries on Arrow objects:
+  * `lubridate::with_tz()` and `lubridate::force_tz()` (@eitsupi, 
+  [#14093](https://github.com/apache/arrow/issues/14093))
+  * `stringr::str_remove()` and `stringr::str_remove_all()` 
+  ([#14644](https://github.com/apache/arrow/issues/14644))
+
+### Arrow object creation
+
+* Arrow Scalars can be created from `POSIXlt` objects. 
+  ([#15277](https://github.com/apache/arrow/issues/15277))
+* `Array$create()` can create Decimal arrays. 
+  ([#15211](https://github.com/apache/arrow/issues/15211))
+* `StructArray$create()` can be used to create StructArray objects. 
+  ([#14922](https://github.com/apache/arrow/issues/14922))
+
+### Installation
+
+* Improved offline installation using pre-downloaded binaries. 
+  (@pgramme, [#14086](https://github.com/apache/arrow/issues/14086))
+* The package can automatically link to system installations of the AWS SDK
+  for C++. (@kou, [#14235](https://github.com/apache/arrow/issues/14235))
+
+## Minor improvements and fixes
+
+* Calling `lubridate::as_datetime()` on Arrow objects can handle time in 
+  sub-seconds. (@eitsupi, 
+  [#13890](https://github.com/apache/arrow/issues/13890))
+* `head()` can be called after `as_record_batch_read()`. 
+  ([#14518](https://github.com/apache/arrow/issues/14518))
+* `dplyr::right_join()` correctly coalesces keys. 
+  ([#15077](https://github.com/apache/arrow/issues/15077))
+* Output is accurate when multiple `dplyr::group_by()`/`dplyr::summarise()` 
+  calls are used. ([#14905](https://github.com/apache/arrow/issues/14905))
+* `dplyr::summarize()` works with division when divisor is a variable. 
+  ([#14933](https://github.com/apache/arrow/issues/14933))
+* `as.Date()` can go from `timestamp[us]` to `timestamp[s]`. 
+  ([#14935](https://github.com/apache/arrow/issues/14935))
+* Creating an Array from an object bigger than 2^31 has correct length 
+  ([#14929](https://github.com/apache/arrow/issues/14929))
+* curl timeout policy can be configured for S3. 
+  ([#15166](https://github.com/apache/arrow/issues/15166))
+* Multiple changes to ensure compatibility with dplyr 1.1.0. 
+  (@lionel-, [#14948](https://github.com/apache/arrow/issues/14948))
+* rlang dependency must be at least version 1.0.0 because of 
+  `check_dots_empty()`. (@daattali, 
+  [#14744](https://github.com/apache/arrow/issues/14744))
+
+## Breaking changes
+
+* `map_batches()` is lazy by default. 
([#14521](https://github.com/apache/arrow/issues/14521))

Review Comment:
   Shouldn't this be at the top of the list because I think users are more 
interested in Breaking Changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, 

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4903: Update main DataFusion README

2023-01-19 Thread GitBox


alamb commented on code in PR #4903:
URL: https://github.com/apache/arrow-datafusion/pull/4903#discussion_r1081347231


##
README.md:
##
@@ -57,9 +75,31 @@ a foundation for building new systems. Here are some example 
use cases:
 - _Easy to Embed_: Allowing extension at almost any point in its design, 
DataFusion can be tailored for your specific use case
 - _High Quality_: Extensively tested, both by itself and with the rest of the 
Arrow ecosystem, DataFusion can be used as the foundation for production 
systems.
 
+## Comparisons with other projects

Review Comment:
   Thank you @sundy-li  -- @waitingkuo  I wonder if you have time to update the 
benchmark results?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #33716: WIP: DO NOT MERGE: Apache Arrow Flight SQL adapter for PostgreSQL plan

2023-01-19 Thread GitBox


lidavidm commented on PR #33716:
URL: https://github.com/apache/arrow/pull/33716#issuecomment-1397066936

   Ok. And those clients use the wire protocol directly(ish), so they can't 
take advantage of the JDBC or ODBC Flight SQL drivers, presumably?
   
   ADBC may not help there, unless they're using DBAPI/SQLAlchemy in Python, 
where the ADBC bindings implement DBAPI and work(ish) with the rest of the 
stack. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4984: minor: Update data type support documentation

2023-01-19 Thread GitBox


alamb commented on code in PR #4984:
URL: https://github.com/apache/arrow-datafusion/pull/4984#discussion_r1081344325


##
.github/workflows/dev.yml:
##
@@ -42,8 +42,7 @@ jobs:
   node-version: "14"
   - name: Prettier check
 run: |
-  # if you encounter error, try rerun the command below with --write 
instead of --check
-  # and commit the changes
+  # if you encounter error, rerun the command below and commit the 
changes

Review Comment:
   driveby cleanup (the command below actually uses `--write` rather than 
`--check` already



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4984: minor: Update data type support documentation

2023-01-19 Thread GitBox


alamb commented on code in PR #4984:
URL: https://github.com/apache/arrow-datafusion/pull/4984#discussion_r1081343683


##
docs/source/user-guide/sql/data_types.md:
##
@@ -25,13 +25,26 @@ execution. The SQL types from
 are mapped to [Arrow data 
types](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html) 
according to the following table.
 This mapping occurs when defining the schema in a `CREATE EXTERNAL TABLE` 
command or when performing a SQL `CAST` operation.
 
+You can see the corresponding Arrow type for any SQL expression using
+the `arrow_typeof` function. For example:
+
+```sql
+❯ select arrow_typeof(interval '1 month');
++-+
+| arrowtypeof(IntervalYearMonth("1")) |
++-+
+| Interval(YearMonth) |
++-+
+```
+
 ## Character Types
 
 | SQL DataType | Arrow DataType |
 |  | -- |
 | `CHAR`   | `Utf8` |
 | `VARCHAR`| `Utf8` |
 | `TEXT`   | `Utf8` |
+| `STRING` | `Utf8` |

Review Comment:
   ```sql
   ❯ create table foo (x string) as select 'foo';
   0 rows in set. Query took 0.003 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic merged pull request #33780: GH-33779: [R] Nightly builds (R 3.5 and 3.6) failing due to field refs test

2023-01-19 Thread GitBox


thisisnic merged PR #33780:
URL: https://github.com/apache/arrow/pull/33780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4984: minor: Update data type support documentation

2023-01-19 Thread GitBox


alamb commented on code in PR #4984:
URL: https://github.com/apache/arrow-datafusion/pull/4984#discussion_r1081343250


##
docs/source/user-guide/sql/data_types.md:
##
@@ -52,11 +65,12 @@ This mapping occurs when defining the schema in a `CREATE 
EXTERNAL TABLE` comman
 
 ## Date/Time Types
 
-| SQL DataType | Arrow DataType  |
-|  | :-- |
-| `DATE`   | `Date32`|
-| `TIME`   | `Time64(TimeUnit::Nanosecond)`  |
-| `TIMESTAMP`  | `Timestamp(TimeUnit::Nanosecond, None)` |
+| SQL DataType | Arrow DataType
   |
+|  | 
:--- |
+| `DATE`   | `Date32`  
   |
+| `TIME`   | `Time64(TimeUnit::Nanosecond)`
   |
+| `TIMESTAMP`  | `Timestamp(TimeUnit::Nanosecond, None)`   
   |
+| `INTERVAL`   | `Interval(IntervalUnit::YearMonth)` or 
`Interval(IntervalUnit::DayTime)` |

Review Comment:
   Intervals work great:
   
   ```sql
   ❯ select now() + interval '1 minute';
   +--+
   | now() + IntervalDayTime("6") |
   +--+
   | 2023-01-19T14:30:03.656542+00:00 |
   +--+
   
   ❯ select arrow_typeof(interval '1 month');
   +-+
   | arrowtypeof(IntervalYearMonth("1")) |
   +-+
   | Interval(YearMonth) |
   +-+
   1 row in set. Query took 0.006 seconds.
   ❯ select arrow_typeof(interval '1 second');
   +--+
   | arrowtypeof(IntervalDayTime("1000")) |
   +--+
   | Interval(DayTime)|
   +--+
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb opened a new pull request, #4984: minor: Update data type support documentation

2023-01-19 Thread GitBox


alamb opened a new pull request, #4984:
URL: https://github.com/apache/arrow-datafusion/pull/4984

   # Which issue does this PR close?
   
   N/A
   
   # Rationale for this change
   
   https://arrow.apache.org/datafusion/user-guide/sql/data_types.html is out of 
date, as we discovered while working on docs for IOx: 
https://github.com/influxdata/docs-v2/pull/4700
   
   # What changes are included in this PR?
   1. Update supported type
   2. Add note about `arrow_typeof` function
   
   # Are these changes tested?
   
   N/A (they are docs)
   
   # Are there any user-facing changes?
   
   Yes, docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] paleolimbot commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


paleolimbot commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081335626


##
r/NEWS.md:
##
@@ -33,62 +33,57 @@
   functionality, but allow for readr-style options to be supplied, making it 
   simpler to switch between individual file-reading and dataset 
   functionality. (#33614)
-* User-defined null values can now be set when writing CSVs both as datasets 
+* User-defined null values can be set when writing CSVs both as datasets 
   and as individual files. (@wjones127, #14679)
 * The new `col_names` parameter allows specification of column names when 
   opening a CSV dataset. (@wjones127, #14705)
 * The `parse_options`, `read_options`, and `convert_options` parameters for 
-  reading individual files and datasets can now be passed in as lists. (#15270)
+  reading individual files (`read_*_arrow()` functions) and datasets 
+  (`open_dataset()` and the new `open_*_dataset()` functions) can be passed 
+  in as lists. (#15270)
 
 ### Function bindings
 
-The following functions can now be used in queries on Arrow objects:
+The following functions can be used in queries on Arrow objects:
 * `lubridate::with_tz()` and `lubridate::force_tz()` (@eitsupi, #14093)
 * `stringr::str_remove()` and `stringr::str_remove_all()` (#14644)
 
 ### Installation
 
 * Improved offline installation using pre-downloaded binaries. 
   (@pgramme, #14086)
-* The package can now automatically link to system installations of the AWS SDK
+* The package can automatically link to system installations of the AWS SDK
   for C++. (@kou, #14235)
 
 ### Other
 
 * New dplyr (1.1.0) function `join_by()` has been implemented for dplyr joins 
   on Arrow objects (equality conditions only).  (#33664)
-* StructArray objects can now be created directly via `StructArray$create()`. 
+* StructArray objects can be created directly via `StructArray$create()`. 
   (#14922)
-* curl timeout policy can now be configured for S3. (#15166)
+* curl timeout policy can be configured for S3. (#15166)
 
 ## Minor improvements and fixes
 
-* `map_batches()` now is lazy by default. (#14521)
-* Decimal arrays can now be created in `Array$create()` without
-  casting. (#15211)
-* Calling `lubridate::as_datetime()` on Arrow objects now can handle time in 
+* Decimal arrays can be created in `Array$create()` without casting. (#15211)
+* Calling `lubridate::as_datetime()` on Arrow objects can handle time in 
   sub-seconds. (@eitsupi, #13890)
-* `head()` can now be called after `as_record_batch_read()` without error. 
-  (#14518)
-* Fix for a bug in which `dplyr::right_join()` did not coalesce keys. (#15077)
-* Fix for a bug in output returned when multiple 
-  `dplyr::group_by()`/`dplyr::summarise()` calls are used. (#14905)
-* Fix for a bug in which `dplyr::summarize()` fails with division when divisor 
-  is a variable. (#14933)
-* Fix for a bug in which `as.Date()` fails going from `timestamp[us]` to
-  `timestamp[s]`. (#14935)
-* Fix for a bug in which creating an Array from an object bigger than 2^31 
-  results in an Array of length 0. (#14929)
-* Fix for a bug in which accents in file paths caused an error in 
-  `read_csv_arrow()`. (#14930)
-* Fix for a bug which prevented Arrow arrays of `POSIXlt` objects being 
created 
-  from Scalars. (#15277)
+* `head()` can be called after `as_record_batch_read()` without error. (#14518)
+* `dplyr::right_join()` correctly coalesces keys. (#15077)
+* output accurate when multiple `dplyr::group_by()`/`dplyr::summarise()` calls 
+  are used. (#14905)
+* `dplyr::summarize()` works with division when divisor is a variable. (#14933)
+* `as.Date()` can go from `timestamp[us]` to `timestamp[s]`. (#14935)
+* creating an Array from an object bigger than 2^31 has correct length (#14929)
+* file paths containing accents can be read by `read_csv_arrow()`. (#14930)
+* Arrow arrays of `POSIXlt` objects can be created from Scalars. (#15277)
 * Multiple changes to ensure compatibility with dplyr 1.1.0. (@lionel-, #14948)
+* rlang dependency must be at least version 1.0.0 because of 
+  `check_dots_empty()`. (@daattali, #14744)
 
 ## Breaking changes
 
-* rlang dependency must be at least version 1.0.0 because of 
-  `check_dots_empty()`. (@daattali, #14744)
+* `map_batches()` is lazy by default. (#14521)

Review Comment:
   ```suggestion
   * `map_batches()` is lazy by default: it now returns a `RecordBatchReader`
 instead of a list of `RecordBach`es unless `lazy = FALSE`. (#14521)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] paleolimbot commented on issue #33701: [R] Link-time optimization reports violations of one-definition rule in the R package

2023-01-19 Thread GitBox


paleolimbot commented on issue #33701:
URL: https://github.com/apache/arrow/issues/33701#issuecomment-1397052533

   CRAN runs an LTO build as part of its check suite...if there are warnings 
during the linking, we get an email saying that we need to fix them in two 
weeks to "safely retain our package on CRAN".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #33778: GH-33777: [R] Nightly builds failing due to dataset test not being skipped on builds without datasets module

2023-01-19 Thread GitBox


github-actions[bot] commented on PR #33778:
URL: https://github.com/apache/arrow/pull/33778#issuecomment-1397052029

   Revision: de1d764d08c6d051e9577ac4144ac4476f79a88c
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-ae4cf10f4d](https://github.com/ursacomputing/crossbow/branches/all?query=actions-ae4cf10f4d)
   
   |Task|Status|
   ||--|
   
|test-r-offline-minimal|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-ae4cf10f4d-azure-test-r-offline-minimal)](https://github.com/ursacomputing/crossbow/runs/10754319230)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] paleolimbot commented on pull request #33778: GH-33777: [R] Nightly builds failing due to dataset test not being skipped on builds without datasets module

2023-01-19 Thread GitBox


paleolimbot commented on PR #33778:
URL: https://github.com/apache/arrow/pull/33778#issuecomment-1397048335

   @github-actions crossbow submit test-r-offline-minimal


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] alamb opened a new issue, #3565: Should we write a "arrow-rs" update blog post?

2023-01-19 Thread GitBox


alamb opened a new issue, #3565:
URL: https://github.com/apache/arrow-rs/issues/3565

   **Which part is this question about**
   
   We often write DataFusion 
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/  updates as a way 
to help grow the community and awareness about arrow.
   
   The main arrow release does the same: 
https://github.com/apache/arrow-site/pull/300
   
   **Describe your question**
   I wonder if it would be useful to write an update for "the state of the 
arrow-rs project"
   
   There are all sorts of cool things to share: Bloom filters, smaller crates, 
parquet performance, arrow-flight, many more I probably am missing. 
   
   Thoughts @viirya  / @tustvold  / @andygrove ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #53: [Parquet] Reading parquet file into an ndarray

2023-01-19 Thread GitBox


alamb commented on issue #53:
URL: https://github.com/apache/arrow-rs/issues/53#issuecomment-1397036453

   > Instead of creating multiple readers can it be done by single reader?
   
   That is an interesting question @Sach1nAgarwal  -- I think @tustvold  has 
some ideas on parallelized decode but I am not sure how concrete they are


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #33716: WIP: DO NOT MERGE: Apache Arrow Flight SQL adapter for PostgreSQL plan

2023-01-19 Thread GitBox


alamb commented on PR #33716:
URL: https://github.com/apache/arrow/pull/33716#issuecomment-1397033386

   > Yes, the ADBC driver wraps libpq and should let you work with other 
databases that use the PostgreSQL wire protocol, with the caveat that it has to 
convert the data (of course).
   
   I don't think I realized this -- I will check it out. Thank you @lidavidm 
   
   > I suppose you have a third goal: using the PostgreSQL 
protocol/drivers/clients as the lingua franca, placing it sort of in between 
the roles of Flight SQL and JDBC/ODBC? I don't think anyone's actively working 
towards that.
   
   I would phrase our goal as "we are building a server (InfluxDB IOx) and we 
want to allow connections from as many clients as possible but implement only 
FlightSQL on the server". Since **many** existing clients, for better or worse, 
use the postgres wire protocol, having some solution for them to speak to IOx 
without native FEBE support is compelling from our perspective
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33660: GH-33659: [Developer Tools] Add definition of Breaking Change and Critical Fix

2023-01-19 Thread GitBox


jorisvandenbossche commented on code in PR #33660:
URL: https://github.com/apache/arrow/pull/33660#discussion_r1081303563


##
docs/source/developers/reviewing.rst:
##
@@ -255,3 +255,43 @@ Social aspects
 * Like any communication, code reviews are governed by the Apache
   `Code of Conduct `_.
   This applies to both reviewers and contributors.
+
+
+Labelling
+=
+
+While reviewing PRs, we should try to identify whether these changes need to be
+marked with one or both of the following labels:
+
+* **Critical Fix**: The change fixes either: (a) a security vulnerability;
+  (b) a bug that caused incorrect or invalid data to be produced;
+  or (c) a bug that causes a crash, though only if the API contract is upheld.
+  This is intended to mark fixes to issues that may affect users without their
+  knowledge. For this reason, fixing bugs that cause errors don't count, since 
+  those bugs are usually obvious. Bugs that cause crashes are considered 
critical
+  because they are a possible vector of Denial-of-Service attacks.
+* **Breaking Change**: The change breaks backwards compatibility in a public 
API.
+  For changes in C++, this does not include changes that simply break ABI
+  compatibility, except for the few places where we do guarantee ABI
+  compatibility (such as C Data Interface). Experimental APIs are *not*
+  exempt from this; they are just more likely to be associated with this tag.
+  
+Breaking changes and critical fixes are separate: breaking changes alter the
+API contract, while critical fixes make the implementation align with the
+existing API contract. For example, fixing a bug that caused a Parquet reader
+to skip rows containing the number 42 is a critical fix but not a breaking 
change,
+since the row skipping wasn't behavior a reasonable user would rely on.
+
+These labels are used in the release to highlight changes that users ought to 
be
+aware of when they consider upgrading library versions. Breaking changes help
+identify reasons when users may wish to wait to upgrade until they have time to
+adapt their code, while critical fixes highlight the risk in *not* upgrading.
+
+In addition, we use the following labels to indicate priority:
+
+* **Priority: Blocker**: Indicates the PR **must** be merged before the next

Review Comment:
   Personally, as a reviewer, I would find it useful to have those labels on 
the PR as well ..
   
   (but don't let this drag on this PR. It's a general issue we have right now 
with labeling and milestoning issues vs PRs)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33660: GH-33659: [Developer Tools] Add definition of Breaking Change and Critical Fix

2023-01-19 Thread GitBox


jorisvandenbossche commented on code in PR #33660:
URL: https://github.com/apache/arrow/pull/33660#discussion_r1081303563


##
docs/source/developers/reviewing.rst:
##
@@ -255,3 +255,43 @@ Social aspects
 * Like any communication, code reviews are governed by the Apache
   `Code of Conduct `_.
   This applies to both reviewers and contributors.
+
+
+Labelling
+=
+
+While reviewing PRs, we should try to identify whether these changes need to be
+marked with one or both of the following labels:
+
+* **Critical Fix**: The change fixes either: (a) a security vulnerability;
+  (b) a bug that caused incorrect or invalid data to be produced;
+  or (c) a bug that causes a crash, though only if the API contract is upheld.
+  This is intended to mark fixes to issues that may affect users without their
+  knowledge. For this reason, fixing bugs that cause errors don't count, since 
+  those bugs are usually obvious. Bugs that cause crashes are considered 
critical
+  because they are a possible vector of Denial-of-Service attacks.
+* **Breaking Change**: The change breaks backwards compatibility in a public 
API.
+  For changes in C++, this does not include changes that simply break ABI
+  compatibility, except for the few places where we do guarantee ABI
+  compatibility (such as C Data Interface). Experimental APIs are *not*
+  exempt from this; they are just more likely to be associated with this tag.
+  
+Breaking changes and critical fixes are separate: breaking changes alter the
+API contract, while critical fixes make the implementation align with the
+existing API contract. For example, fixing a bug that caused a Parquet reader
+to skip rows containing the number 42 is a critical fix but not a breaking 
change,
+since the row skipping wasn't behavior a reasonable user would rely on.
+
+These labels are used in the release to highlight changes that users ought to 
be
+aware of when they consider upgrading library versions. Breaking changes help
+identify reasons when users may wish to wait to upgrade until they have time to
+adapt their code, while critical fixes highlight the risk in *not* upgrading.
+
+In addition, we use the following labels to indicate priority:
+
+* **Priority: Blocker**: Indicates the PR **must** be merged before the next

Review Comment:
   Personally, as a reviewer, I would find it useful to have those labels on 
the PR as well ..
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm merged pull request #15223: GH-15203: [Java] Implement writing compressed files

2023-01-19 Thread GitBox


lidavidm merged PR #15223:
URL: https://github.com/apache/arrow/pull/15223


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #33614: GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow

2023-01-19 Thread GitBox


ursabot commented on PR #33614:
URL: https://github.com/apache/arrow/pull/33614#issuecomment-1396971858

   Benchmark runs are scheduled for baseline = 
04ffb1f740f1e868c08313fa9043070345d9b6f0 and contender = 
444dcb6779755fc33f3f81d647c188cf31abd23c. 
444dcb6779755fc33f3f81d647c188cf31abd23c is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/f6fd4fd5f7a94999a82da3fa4141e390...de10069ca83c4be8af5157ca778cf909/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/b902fb2824fa49c3a6ea24b4ab4219f4...69b4408990e94526a73e72face3867c4/)
   [Finished :arrow_down:0.26% :arrow_up:0.0%] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/259585af4f5049a7a31029ea64f09ed0...a4a8cbcfa31e4bcfaf59371815e3e3cc/)
   [Finished :arrow_down:0.19% :arrow_up:0.0%] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/19a391a76cbc4f778ed1f9c1afe42db9...7fe244edaf4c47f593f9f696c95b59ee/)
   Buildkite builds:
   [Finished] [`444dcb67` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2216)
   [Finished] [`444dcb67` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2230)
   [Finished] [`444dcb67` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2199)
   [Finished] [`444dcb67` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/)
   [Finished] [`04ffb1f7` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2215)
   [Failed] [`04ffb1f7` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2241)
   [Finished] [`04ffb1f7` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2211)
   [Finished] [`04ffb1f7` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2234)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] rok commented on issue #33782: [Release] Vote email number of issues is querying JIRA and producing a wrong number

2023-01-19 Thread GitBox


rok commented on issue #33782:
URL: https://github.com/apache/arrow/issues/33782#issuecomment-1396971891

   GraphQL API is another option:
   ```
   curl -H 'Content-Type: application/json' -H "Authorization: bearer 
GITHUB_TOKEN" -X POST -d '{"query": "query {search(query: \"repo:apache/arrow 
is:issue is:closed milestone:11.0.0\", type: ISSUE) {issueCount}}"}' 
https://api.github.com/graphql | jq ".data.search.issueCount"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #15213: GH-15212: [C++] fix sliced list array writing in ORC

2023-01-19 Thread GitBox


ursabot commented on PR #15213:
URL: https://github.com/apache/arrow/pull/15213#issuecomment-1396971385

   Benchmark runs are scheduled for baseline = 
2b50694c10e09e4a1343b62c6b5f44ad4403d0e1 and contender = 
04ffb1f740f1e868c08313fa9043070345d9b6f0. 
04ffb1f740f1e868c08313fa9043070345d9b6f0 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b69a7118eddb41a695cd537131c0bcf4...f6fd4fd5f7a94999a82da3fa4141e390/)
   [Failed] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/e62713d5ba6e4cbda2b0f634cc137de1...b902fb2824fa49c3a6ea24b4ab4219f4/)
   [Finished :arrow_down:0.51% :arrow_up:0.0%] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/8f9ba14983c34891ad8653cf430cb4b3...259585af4f5049a7a31029ea64f09ed0/)
   [Finished :arrow_down:0.22% :arrow_up:0.0%] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/895eb1a43a284e71b5d818b1ad9d2986...19a391a76cbc4f778ed1f9c1afe42db9/)
   Buildkite builds:
   [Finished] [`04ffb1f7` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2215)
   [Failed] [`04ffb1f7` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2241)
   [Finished] [`04ffb1f7` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2211)
   [Finished] [`04ffb1f7` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2234)
   [Finished] [`2b50694c` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2214)
   [Failed] [`2b50694c` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2240)
   [Finished] [`2b50694c` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2210)
   [Finished] [`2b50694c` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2233)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #4972: Simplify GroupByHash implementation (to prepare for more work)

2023-01-19 Thread GitBox


ozankabak commented on code in PR #4972:
URL: https://github.com/apache/arrow-datafusion/pull/4972#discussion_r1081247158


##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -219,91 +221,76 @@ impl GroupedHashAggregateStream {
 batch_size,
 row_group_skip_position: 0,
 indices: [normal_agg_indices, row_agg_indices],
-};
+})
+}
+}
 
-let stream = futures::stream::unfold(inner, |mut this| async move {
-let elapsed_compute = this.baseline_metrics.elapsed_compute();
+impl Stream for GroupedHashAggregateStream {
+type Item = ArrowResult;
 
-loop {
-let result: ArrowResult> =
-match this.input.next().await {
+fn poll_next(
+mut self: std::pin::Pin< Self>,
+cx:  Context<'_>,
+) -> Poll> {
+let elapsed_compute = self.baseline_metrics.elapsed_compute().clone();
+
+loop {
+match self.exec_state {
+ExecutionState::ReadingInput => {
+match ready!(self.input.poll_next_unpin(cx)) {
+// new batch to aggregate
 Some(Ok(batch)) => {
 let timer = elapsed_compute.timer();
-let result = group_aggregate_batch(
-,
-_state,
-_by,
-_aggr_expr,
- this.row_accumulators,
- this.row_converter,
-this.row_aggr_layout.clone(),
-batch,
- this.row_aggr_state,
-_aggregate_expressions,
-_aggregate_expressions,
-);
-
+let result = self.group_aggregate_batch(batch);
 timer.done();
 
 // allocate memory
 // This happens AFTER we actually used the memory, 
but simplifies the whole accounting and we are OK with
 // overshooting a bit. Also this means we either 
store the whole record batch or not.
 match result.and_then(|allocated| {
-
this.row_aggr_state.reservation.try_grow(allocated)
+
self.row_aggr_state.reservation.try_grow(allocated)
 }) {
-Ok(_) => continue,
-Err(e) => 
Err(ArrowError::ExternalError(Box::new(e))),
+Ok(_) => {}
+Err(e) => {
+return Poll::Ready(Some(Err(
+ArrowError::ExternalError(Box::new(e)),
+)))
+}
 }
 }

Review Comment:
   Since the `Ok` case is a no-op, an `if let Err(e) = ...` seems to be more 
idiomatic here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #33781: GH-33723: [C++] re2::RE2::RE2() result must be checked

2023-01-19 Thread GitBox


github-actions[bot] commented on PR #33781:
URL: https://github.com/apache/arrow/pull/33781#issuecomment-1396950828

   * Closes: #33723


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] mustafasrepo closed pull request #4982: Use arrow concat_batches instead of custom non-owning merge_batches

2023-01-19 Thread GitBox


mustafasrepo closed pull request #4982: Use arrow concat_batches instead of 
custom non-owning merge_batches
URL: https://github.com/apache/arrow-datafusion/pull/4982


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] lidavidm commented on a diff in pull request #355: feat(python): add Flight SQL driver using Go library

2023-01-19 Thread GitBox


lidavidm commented on code in PR #355:
URL: https://github.com/apache/arrow-adbc/pull/355#discussion_r1081224490


##
docs/source/driver/go/flight_sql.rst:
##
@@ -25,10 +25,19 @@ The Flight SQL Driver provides access to any database 
implementing a
 Installation
 
 
-The Flight SQL driver is shipped as part of the Arrow C++ libraries
-and PyArrow.  See the `main Arrow project website
-`_ for instructions.  Version >= 11
-is required.
+The Flight SQL driver is shipped as a standalone library.

Review Comment:
   Filed https://github.com/apache/arrow-adbc/issues/359 to update this more 
thoroughly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #4804: Blog post about datafusion 16 release

2023-01-19 Thread GitBox


alamb commented on issue #4804:
URL: 
https://github.com/apache/arrow-datafusion/issues/4804#issuecomment-1396922128

   Rendered site: Rendered: 
https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #4804: Blog post about datafusion 16 release

2023-01-19 Thread GitBox


alamb closed issue #4804: Blog post about datafusion 16 release
URL: https://github.com/apache/arrow-datafusion/issues/4804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] DDtKey opened a new pull request, #4983: fix(4981): incorrect error wrapping in `OnceFut`

2023-01-19 Thread GitBox


DDtKey opened a new pull request, #4983:
URL: https://github.com/apache/arrow-datafusion/pull/4983

   # Which issue does this PR close?
   
   
   Closes #4981
   
   # Rationale for this change
   
   
   
   # What changes are included in this PR?
   
   
   
   # Are these changes tested?
   
   I've added test which failed before these changes. 
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ygf11 commented on pull request #4944: Only add outer filter once when transforming exists/in subquery to join

2023-01-19 Thread GitBox


ygf11 commented on PR #4944:
URL: 
https://github.com/apache/arrow-datafusion/pull/4944#issuecomment-1396904416

   > Looks like the method's comment is out of date. Could you please also fix 
them in this PR ?
   
   Fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] OfekShilon commented on issue #33784: [R] writing/reading a data.frame with column class 'list' changes column class

2023-01-19 Thread GitBox


OfekShilon commented on issue #33784:
URL: https://github.com/apache/arrow/issues/33784#issuecomment-1396903597

   A repro with no use of `tibble`:
   ```r
   df <- data.frame(a=I(list(list(1), list(8
   class(df$a[[1]])
   # [1] "list"
   
   tmpf <- tempfile()
   arrow::write_feather(df, tmpf)
   df2 <- arrow::read_feather(tmpf)
   class(df2$a[[1]])
   #[1] "arrow_list""vctrs_list_of" "vctrs_vctr""list" 
   
   unlink(tmpf)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #33785: GH-33741: [Python] Address docstrings in Data Types Factory Functions

2023-01-19 Thread GitBox


github-actions[bot] commented on PR #33785:
URL: https://github.com/apache/arrow/pull/33785#issuecomment-1396903229

   * Closes: #33741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] AlenkaF opened a new pull request, #33785: GH-33741: [Python] Address docstrings in Data Types Factory Functions

2023-01-19 Thread GitBox


AlenkaF opened a new pull request, #33785:
URL: https://github.com/apache/arrow/pull/33785

   ### Rationale for this change
   Ensure docstrings for [Data Types Factory 
Functions](https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions)
 have an Examples section.
   
   ### What changes are included in this PR?
   Docstrings are added to listed Data Types Factory Functions.
   
   ### Are these changes tested?
   Yes, locally with `pytest --doctest-cython --disable-warnings` pyarrow and 
on the CI with `Python / AMD64 Conda Python 3.9 Sphinx & Numpydoc` build.
   
   ### Are there any user-facing changes?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] mustafasrepo opened a new pull request, #4982: Use arrow concat_batches instead of custom non-owning merge_batches

2023-01-19 Thread GitBox


mustafasrepo opened a new pull request, #4982:
URL: https://github.com/apache/arrow-datafusion/pull/4982

   # Which issue does this PR close?
   
   
   
   Closes #.
   
   # Rationale for this change
   
   
   
   With recent change in arrow `concat_batches`, `merge_batches` and 
`merge_multiple_batches` functions became obsolete. This PR removes these 
obsolete functions.
   # What changes are included in this PR?
   
   
   
   # Are these changes tested?
   
   
   
   # Are there any user-facing changes?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ursabot commented on pull request #4924: Unify Row hash and hash implementation

2023-01-19 Thread GitBox


ursabot commented on PR #4924:
URL: 
https://github.com/apache/arrow-datafusion/pull/4924#issuecomment-1396885272

   Benchmark runs are scheduled for baseline = 
e6a050058bd704f73b38106b7abf21dc4539eebc and contender = 
96cf046be57bf09548d51f50d0bc964904bcec7d. 
96cf046be57bf09548d51f50d0bc964904bcec7d is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/3ff11d0e49df4d6fb45ef0ae53c2ead6...ddbe6458091b4b95994b991e87c68b8b/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/ed7c429e3b0c4cba9f89dbfdcfccd277...34270373e32849ed8db9900b9262b751/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/daf0263f054a46b08aa886890e6e7d74...a71cf0d1192049489917195356b46b9f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported 
on ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ff2b3d7dd8eb4bdc85619121333c3ef3...d31807daee1b497680a6bb93e6e23577/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4972: Simplify GroupByHash implementation (to prepare for more work)

2023-01-19 Thread GitBox


alamb commented on code in PR #4972:
URL: https://github.com/apache/arrow-datafusion/pull/4972#discussion_r1081176906


##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -115,6 +106,14 @@ struct GroupedHashAggregateStreamInner {
 indices: [Vec>; 2],
 }
 
+#[derive(Debug)]
+/// tracks what phase the aggregation is in
+enum ExecutionState {

Review Comment:
   This used to be tracked using several multi-level `match` statement and a 
`fused` inner stream. Now it is represented explicitly in this stream



##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -75,19 +75,10 @@ use hashbrown::raw::RawTable;
 /// [Compact]: datafusion_row::layout::RowType::Compact
 /// [WordAligned]: datafusion_row::layout::RowType::WordAligned
 pub(crate) struct GroupedHashAggregateStream {
-stream: BoxStream<'static, ArrowResult>,
-schema: SchemaRef,
-}
-
-/// Actual implementation of [`GroupedHashAggregateStream`].
-///
-/// This is wrapped into yet another struct because we need to interact with 
the async memory management subsystem

Review Comment:
   this comment about another struct for memory management is out of date and 
so I folded `GroupedHashAggregateStreamInner` directly into 
`GroupedHashAggregateStream`



##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -313,222 +300,227 @@ impl RecordBatchStream for GroupedHashAggregateStream {
 }
 }
 
-/// Perform group-by aggregation for the given [`RecordBatch`].
-///
-/// If successfull, this returns the additional number of bytes that were 
allocated during this process.
-///
-/// TODO: Make this a member function of [`GroupedHashAggregateStream`]

Review Comment:
   DONE!



##
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##
@@ -219,91 +221,76 @@ impl GroupedHashAggregateStream {
 batch_size,
 row_group_skip_position: 0,
 indices: [normal_agg_indices, row_agg_indices],
-};
+})
+}
+}
 
-let stream = futures::stream::unfold(inner, |mut this| async move {
-let elapsed_compute = this.baseline_metrics.elapsed_compute();
+impl Stream for GroupedHashAggregateStream {
+type Item = ArrowResult;
 
-loop {
-let result: ArrowResult> =
-match this.input.next().await {
+fn poll_next(
+mut self: std::pin::Pin< Self>,
+cx:  Context<'_>,
+) -> Poll> {
+let elapsed_compute = self.baseline_metrics.elapsed_compute().clone();
+
+loop {
+match self.exec_state {
+ExecutionState::ReadingInput => {
+match ready!(self.input.poll_next_unpin(cx)) {
+// new batch to aggregate
 Some(Ok(batch)) => {
 let timer = elapsed_compute.timer();
-let result = group_aggregate_batch(
-,
-_state,
-_by,
-_aggr_expr,
- this.row_accumulators,
- this.row_converter,
-this.row_aggr_layout.clone(),
-batch,
- this.row_aggr_state,
-_aggregate_expressions,
-_aggregate_expressions,
-);
-
+let result = self.group_aggregate_batch(batch);
 timer.done();
 
 // allocate memory
 // This happens AFTER we actually used the memory, 
but simplifies the whole accounting and we are OK with
 // overshooting a bit. Also this means we either 
store the whole record batch or not.
 match result.and_then(|allocated| {
-
this.row_aggr_state.reservation.try_grow(allocated)
+
self.row_aggr_state.reservation.try_grow(allocated)
 }) {
-Ok(_) => continue,
-Err(e) => 
Err(ArrowError::ExternalError(Box::new(e))),
+Ok(_) => {}
+Err(e) => {
+return Poll::Ready(Some(Err(
+ArrowError::ExternalError(Box::new(e)),
+)))
+}
 }
 }
-Some(Err(e)) => Err(e),
+// inner had error, return to caller
+Some(Err(e)) => return Poll::Ready(Some(Err(e))),
+// inner is done, 

[GitHub] [arrow-datafusion] alamb commented on pull request #4924: Unify Row hash and hash implementation

2023-01-19 Thread GitBox


alamb commented on PR #4924:
URL: 
https://github.com/apache/arrow-datafusion/pull/4924#issuecomment-1396866343

   Thanks again -- this is going to be great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #4924: Unify Row hash and hash implementation

2023-01-19 Thread GitBox


alamb merged PR #4924:
URL: https://github.com/apache/arrow-datafusion/pull/4924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #2723: Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication)

2023-01-19 Thread GitBox


alamb closed issue #2723: Consolidate  GroupByHash implementations 
`row_hash.rs` and `hash.rs` (remove duplication)
URL: https://github.com/apache/arrow-datafusion/issues/2723


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3563: Implement Extend for ArrayBuilder (#1841)

2023-01-19 Thread GitBox


tustvold commented on code in PR #3563:
URL: https://github.com/apache/arrow-rs/pull/3563#discussion_r1081165909


##
arrow-array/src/builder/generic_bytes_dictionary_builder.rs:
##
@@ -255,12 +255,34 @@ where
 Ok(key)
 }
 
+/// Infallibly append a value to this builder
+///
+/// # Panics
+///
+/// Panics if the resulting length of the dictionary values array would 
exceed `T::Native::MAX`
+pub fn append_value( self, value: impl AsRef) {

Review Comment:
   These are the new APIs for #3562 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a diff in pull request #3514: Use array_value_to_string in arrow-csv

2023-01-19 Thread GitBox


alamb commented on code in PR #3514:
URL: https://github.com/apache/arrow-rs/pull/3514#discussion_r1081145188


##
arrow-cast/src/display.rs:
##
@@ -309,9 +379,10 @@ fn append_map_field_string(
 ///
 /// Note this function is quite inefficient and is unlikely to be
 /// suitable for converting large arrays or record batches.
-pub fn array_value_to_string(
+fn array_value_to_string_internal(
 column: ,
 row: usize,
+datetime_format_opt: ,

Review Comment:
   FYI I think the typical rust way to do this (and not require an owned 
`String`) is:
   
   ```suggestion
   datetime_format_opt: Option<>,
   ```
   
   And then at callsites with `val: Option` instead of:
   
   ```rust
   array_value_to_string_internal(foo, bar, )
   ```
   
   You do :
   
   ```rust
   array_value_to_string_internal(foo, bar, val.as_ref())
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a diff in pull request #3514: Use array_value_to_string in arrow-csv

2023-01-19 Thread GitBox


alamb commented on code in PR #3514:
URL: https://github.com/apache/arrow-rs/pull/3514#discussion_r1081145188


##
arrow-cast/src/display.rs:
##
@@ -309,9 +379,10 @@ fn append_map_field_string(
 ///
 /// Note this function is quite inefficient and is unlikely to be
 /// suitable for converting large arrays or record batches.
-pub fn array_value_to_string(
+fn array_value_to_string_internal(
 column: ,
 row: usize,
+datetime_format_opt: ,

Review Comment:
   FYI I think the typical rust way to do this (and not require an owned 
`String`) is:
   
   ```suggestion
   datetime_format_opt: Option<>,
   ```
   
   And then at callsites with `val: Option` instead of:
   
   ```rust
   array_value_to_string_internal(foo, bar, )
   ```
   
   You do :
   
   ```rust
   array_value_to_string_internal(foo, bar, val.as_ref)
   ```
   



##
arrow-csv/src/writer.rs:
##
@@ -430,33 +364,45 @@ impl WriterBuilder {
 self
 }
 
+/// Whether to use RFC3339 format for date/time/timestamps

Review Comment:
   ```suggestion
   ///  Use RFC3339 format for date/time/timestamps by clearing all 
   /// date/time specific formats.
   ```



##
arrow-csv/src/writer.rs:
##
@@ -601,75 +547,9 @@ sed do eiusmod 
tempor,-556132.25,1,,2019-04-18T02:45:55.55500,23:46:03,foo
 file.read_to_end( buffer).unwrap();
 
 assert_eq!(
-"Lorem ipsum dolor sit amet|123.564532|3|true|12:20:34 
AM\nconsectetur adipiscing elit|NULL|2|false|06:51:20 AM\nsed do eiusmod 
tempor|-556132.25|1|NULL|11:46:03 PM\n"
+"Lorem ipsum dolor sit 
amet|123.564532|3|true|00:20:34\nconsectetur adipiscing 
elit|NULL|2|false|06:51:20\nsed do eiusmod tempor|-556132.25|1|NULL|23:46:03\n"
 .to_string(),
 String::from_utf8(buffer).unwrap()
 );
 }
-
-#[test]
-fn test_conversion_consistency() {

Review Comment:
   Why was this test removed? It seems useful, doesn't it?



##
arrow-cast/src/display.rs:
##
@@ -510,6 +591,21 @@ pub fn array_value_to_string(
 }
 }
 
+pub fn array_value_to_string(
+column: ,
+row: usize,
+) -> Result {
+array_value_to_string_internal(column, row, )
+}
+
+pub fn datetime_array_value_to_string(
+column: ,
+row: usize,
+format: ,

Review Comment:
   ```suggestion
   format: Option<>,
   ```



##
arrow-cast/src/display.rs:
##
@@ -510,6 +591,21 @@ pub fn array_value_to_string(
 }
 }
 
+pub fn array_value_to_string(
+column: ,
+row: usize,
+) -> Result {
+array_value_to_string_internal(column, row, )
+}
+
+pub fn datetime_array_value_to_string(

Review Comment:
   I would love to revamp this display code to be easier to use / more 
performant but I think that is outside the scope of this PR. I'll try and file 
a ticket explaining what I am thinking



##
arrow-cast/src/display.rs:
##
@@ -510,6 +591,21 @@ pub fn array_value_to_string(
 }
 }
 
+pub fn array_value_to_string(
+column: ,
+row: usize,
+) -> Result {
+array_value_to_string_internal(column, row, )
+}
+
+pub fn datetime_array_value_to_string(

Review Comment:
   I am somewhat worried about adding a new public API like this, but I see it 
is needed to call in arrow-csv
   
   



##
arrow/tests/csv.rs:
##
@@ -57,8 +57,8 @@ fn test_export_csv_timestamps() {
 drop(writer);
 
 let left = "c1,c2
-2019-04-18T20:54:47.37800+10:00,2019-04-18T10:54:47.37800
-2021-10-30T17:59:07.0+11:00,2021-10-30T06:59:07.0\n";
+2019-04-18T20:54:47.378+10:00,2019-04-18T10:54:47.37800

Review Comment:
   this sure looks better  



##
arrow-csv/src/writer.rs:
##
@@ -430,33 +364,45 @@ impl WriterBuilder {
 self
 }
 
+/// Whether to use RFC3339 format for date/time/timestamps
+pub fn with_rfc3339(mut self, use_rfc3339: bool) -> Self {
+self.use_rfc3339 = use_rfc3339;

Review Comment:
   Rather than an extra flag, what would you think about having this function 
simply set
   ```rust
 self.date_format = None;
 self.datetime_format = None; 
   
   ```
   ?
   
   As a separate flag it might be confusing to understand what happens if 
`use_rfc3339` is set and then a user also set `self.date_format` or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] MarcoGorelli commented on pull request #14662: ARROW-16544: [DevTools] Add linting to Cython files

2023-01-19 Thread GitBox


MarcoGorelli commented on PR #14662:
URL: https://github.com/apache/arrow/pull/14662#issuecomment-1396842946

   Hi - is there still interest in addressing 
https://issues.apache.org/jira/browse/ARROW-16544 ?
   
   If not, I'll close, no worries


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic commented on pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


thisisnic commented on PR #33748:
URL: https://github.com/apache/arrow/pull/33748#issuecomment-1396828704

   I've added in the URLS (this won't work automatically yet as things are 
configured to work with JIRA), ditched the "other" subsection and moved some 
items up in the hierarchy to "new feature" subsections.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ygf11 commented on pull request #4923: Support join-filter pushdown for semi/anti join

2023-01-19 Thread GitBox


ygf11 commented on PR #4923:
URL: 
https://github.com/apache/arrow-datafusion/pull/4923#issuecomment-1396823473

   > Maybe we can add some error cases shpwing that SELECT t1.id FROM t1 SEMI 
JOIN t2 ON (t1.id = t2.id) WHERE t2.value = 5 generates an error
   
   > It would also be good to cover such a predicate in the ON clause (which 
does make sense and could be pushed down I think):
   
   > SELECT t1.id FROM t1 SEMI JOIN t2 ON (t1.id = t2.id AND t2.value = 5)
   
   Added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] joosthooz commented on pull request #33738: GH-33737: [C++] simplify exec plan tracing

2023-01-19 Thread GitBox


joosthooz commented on PR #33738:
URL: https://github.com/apache/arrow/pull/33738#issuecomment-1396819003

   Nice, I think the WriteAndCheckBackpressure span is important because that's 
where the backpressure is checked and also it performs some work combining 
staged batches (in `PopStagedBatch`). Maybe that even deserves its own span 
because it is not always called (only if enough rows have arrived in the `Push` 
function).
   Shouldn't there also be a span created in the delta that gets submitted to 
the IO executor in `WriteNext`? That's where the actual writing (and parquet 
encoding & compression) is being performed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] ericphanson commented on pull request #381: Tag new version dev/release/release.sh

2023-01-19 Thread GitBox


ericphanson commented on PR #381:
URL: https://github.com/apache/arrow-julia/pull/381#issuecomment-1396815064

   TagBot is enabled/running here, but is having issues across many repos those 
days (lots of 502 errors from GitHub), including this one, so it is not very 
reliable at present unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold opened a new pull request, #3564: Improve GenericBytesBuilder offset overflow panic message (#139)

2023-01-19 Thread GitBox


tustvold opened a new pull request, #3564:
URL: https://github.com/apache/arrow-rs/pull/3564

   # Which issue does this PR close?
   
   
   
   Closes #139
   
   # Rationale for this change

   
   
   # What changes are included in this PR?
   
   
   
   Adds a panic message instead of `unwrap()`
   
   # Are there any user-facing changes?
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #3562: Panic on Key Overflow in Dictionary Builders

2023-01-19 Thread GitBox


alamb commented on issue #3562:
URL: https://github.com/apache/arrow-rs/issues/3562#issuecomment-1396798368

   My personal opinion is that the client code should be able to avoid panic's 
if the user desires, but that doesn't mean the API needs to be fallible. 
   
   For example, as long as it is possible to pre-compute / estimate when a 
panic would occur (either explained via docs or some sort of calculation with 
https://docs.rs/arrow/31.0.0/arrow/array/struct.ArrayData.html#method.get_slice_memory_size
 or something) then  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] DDtKey opened a new issue, #4981: Incorrect nested error wrapped to `ArrowError:External` variant

2023-01-19 Thread GitBox


DDtKey opened a new issue, #4981:
URL: https://github.com/apache/arrow-datafusion/issues/4981

   **Describe the bug**
   [This line of 
code](https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/joins/utils.rs#L695)
 makes impossible to get real reason of error.
   ```rs
   .map_err(|e| ArrowError::ExternalError(e.to_string().into())),
   ```
   
   It could produce such error: `External error: Arrow error: External error: 
Arrow error: External error: Arrow error: External error: Arrow error: Csv 
error: ...`
   
   That's actually really bad, because under `Csv error` it could be useful 
info. But instead of this I receive text error with too many nested `External` 
variants.
   
   **To Reproduce**
   Occurrence depends on execution plan and my data that caused this is rather 
large.
   I believe it should be possible to cover with simple test of error type:
   ```rs
   #[tokio::test]
   async fn check_error_nesting() {
   let once_fut = OnceFut::<()>::new(async {
   Err(DataFusionError::ArrowError(ArrowError::CsvError("some 
error".to_string(
   });
   
   struct TestFut(OnceFut<()>);
   impl Future for TestFut {
   type Output = ArrowResult<()>;
   
   fn poll(mut self: Pin< Self>, cx:  Context<'_>) -> 
Poll {
   match ready!(( self.0).get(cx)) {
   Ok(()) => Poll::Ready(Ok(())),
   Err(e) => Poll::Ready(Err(e)),
   }
   }
   }
   
   let res = TestFut(once_fut).await;
   let err = res.expect_err("future always return err");
   // ... some checks of error type
   
   }
   ```
   
   
   **Expected behavior**
   I expect to be able to take error type instead of message in string format.
   However, I see that it's reference and need to figure out right way to 
handle this case. But current one breaks error handling system for datafustion.
   
   **Additional context**
   I would like to help/fix this behavior, however need to discuss right 
approach
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] joosthooz commented on a diff in pull request #33738: GH-33737: [C++] simplify exec plan tracing

2023-01-19 Thread GitBox


joosthooz commented on code in PR #33738:
URL: https://github.com/apache/arrow/pull/33738#discussion_r1081097788


##
cpp/src/arrow/compute/exec/sink_node.cc:
##
@@ -335,19 +327,13 @@ class ConsumingSinkNode : public ExecNode, public 
BackpressureControl {
   void Resume() override { inputs_[0]->ResumeProducing(this, 
++backpressure_counter_); }
 
   void StopProducing() override {
-EVENT(span_, "StopProducing");
 if (input_counter_.Cancel()) {
   Finish(Status::OK());
 }
   }
 
   void InputReceived(ExecNode* input, ExecBatch batch) override {
-EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
-util::tracing::Span span;
-START_COMPUTE_SPAN_WITH_PARENT(
-span, span_, "InputReceived",
-{{"node.label", label()}, {"batch.length", batch.length}});
-
+auto scope = TraceInputReceived(batch);

Review Comment:
   Yes, I think that makes a lot of sense especially because the dataset writer 
submits tasks to the IO executor that can run in parallel. All that would 
otherwise be lost behind a single ConsumingSinkNode span



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold opened a new pull request, #3563: Implement Extend for ArrayBuilder (#1841)

2023-01-19 Thread GitBox


tustvold opened a new pull request, #3563:
URL: https://github.com/apache/arrow-rs/pull/3563

   _Draft pending #3562_
   
   # Which issue does this PR close?
   
   
   
   Closes #1841
   
   # Rationale for this change

   
   
   # What changes are included in this PR?
   
   
   
   # Are there any user-facing changes?
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold opened a new issue, #3562: Panic on Key Overflow in Dictionary Builders

2023-01-19 Thread GitBox


tustvold opened a new issue, #3562:
URL: https://github.com/apache/arrow-rs/issues/3562

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   
   I suspect this will be a controversial issue, but there a couple of reasons 
for proposing this
   
   * Rust collections and accompanying traits like `FromIterator` and `Extend` 
assume infallibility
   * We currently panic on offset overflow for StringArray, ByteArray, 
ListArray, etc...
   * We already panic on offset overflow in FromIterator for DictionaryArray
   
   **Describe the solution you'd like**
   
   
   Panic on key overflow instead of returning an error
   
   **Describe alternatives you've considered**
   
   
   We could implement `Extend` and panic on key overflow there, much like we do 
for FromIterator, but I feel we should be consistent about this.
   
   We could also implement our own `TryExtend` trait or something similar.
   
   **Additional context**
   
   
   #2725 tracks a more holistic rework of how we handle errors
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou commented on issue #15054: [CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit

2023-01-19 Thread GitBox


pitrou commented on issue #15054:
URL: https://github.com/apache/arrow/issues/15054#issuecomment-1396772668

   > …maybe we could just leak the S3 client as a workaround?
   
   We can probably do that indeed. We just have to make sure to disable S3 on 
ASAN and Valgrind CI jobs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1858: Huge amount of llvm code generated by comparison kernels, potentially slowing compile times

2023-01-19 Thread GitBox


tustvold commented on issue #1858:
URL: https://github.com/apache/arrow-rs/issues/1858#issuecomment-1396761193

   I'm closing as these comparison kernels are now gated with feature flags, 
and crates can also choose not to depend on arrow-ord at all. Feel free to 
re-open if there is additional work to be done in this area


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold closed issue #1858: Huge amount of llvm code generated by comparison kernels, potentially slowing compile times

2023-01-19 Thread GitBox


tustvold closed issue #1858: Huge amount of llvm code generated by comparison 
kernels, potentially slowing compile times
URL: https://github.com/apache/arrow-rs/issues/1858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] thisisnic commented on a diff in pull request #33748: GH-33746: [R] Update NEWS.md for 11.0.0

2023-01-19 Thread GitBox


thisisnic commented on code in PR #33748:
URL: https://github.com/apache/arrow/pull/33748#discussion_r1081073855


##
r/NEWS.md:
##
@@ -19,6 +19,77 @@
 
 # arrow 10.0.1.9000
 
+## New features
+
+### Docs
+
+* A substantial reorganisation, rewrite of and addition to, many of the 
+  vignettes and README. (@djnavarro, #14514)  
+
+### Reading/writing data
+
+* New functions `open_csv_dataset()`, `open_ts_dataset()`, and 
+  `open_delim_dataset()` all wrap `open_dataset()`- they don't provide new 
+  functionality, but allow for readr-style options to be supplied, making it 
+  simpler to switch between individual file-reading and dataset 
+  functionality. (#33614)
+* User-defined null values can now be set when writing CSVs both as datasets 
+  and as individual files. (@wjones127, #14679)
+* The new `col_names` parameter allows specification of column names when 
+  opening a CSV dataset. (@wjones127, #14705)
+* The `parse_options`, `read_options`, and `convert_options` parameters for 
+  reading individual files and datasets can now be passed in as lists. (#15270)
+
+### Function bindings
+
+The following functions can now be used in queries on Arrow objects:
+* `lubridate::with_tz()` and `lubridate::force_tz()` (@eitsupi, #14093)
+* `stringr::str_remove()` and `stringr::str_remove_all()` (#14644)
+
+### Installation
+
+* The package can now be installed offline using pre-downloaded binaries. 
+  (@pgramme, #14086)
+* The package can now automatically link to system installations of the AWS SDK
+  for C++. (@kou, #14235)
+
+### Other
+
+* New dplyr (1.1.0) function `join_by()` has been implemented for dplyr joins 
+  on Arrow objects (equality conditions only).  (#33664)
+* StructArray objects can now be created directly via `StructArray$create()`. 
+  (#14922)
+* curl timeout policy can now be configured for S3. (#15166)
+
+## Minor improvements and fixes
+
+* `map_batches()` now is lazy by default. (#14521)
+* Arrays of Decimal type objects can now be created directly and without 
+  casting. (#15211)
+* Calling `lubridate::as_datetime()` on Arrow objects now can handle time in 
+  sub-seconds. (@eitsupi, #13890)
+* `head()` can now be called after `as_record_batch_read()` without error. 
+  (#14518)
+* Fix for a bug in which `dplyr::right_join()` did not coalesce keys. (#15077)
+* Fix for a bug in output returned when multiple 
+  `dplyr::group_by()`/`dplyr::summarise()` calls are used. (#14905)
+* Fix for a bug in which `dplyr::summarize()` fails with division when divisor 
+  is a variable. (#14933)
+* Fix for a bug in which `as.Date()` fails going from `timestamp[us]` to
+  `timestamp[s]`. (#14935)
+* Fix for a bug in which creating an Array from an object bigger than 2^31 
+  results in an Array of length 0. (#14929)
+* Fix for a bug in which accents in file paths caused an error in 
+  `read_csv_arrow()`. (#14930)
+* Fix for a bug which prevented Arrow arrays of `POSIXlt` objects being 
created 
+  from Scalars. (#15277)
+* Multiple changes to ensure compatibility with dplyr 1.1.0. (@lionel-, #14948)
+
+## Breaking changes
+
+* rlang dependency must be at least version 1.0.0 because of 

Review Comment:
   Swapped them round, thanks folks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #15210: GH-20512: [Python] Numpy conversion doesn't account for ListArray offset

2023-01-19 Thread GitBox


ursabot commented on PR #15210:
URL: https://github.com/apache/arrow/pull/15210#issuecomment-1396745891

   Benchmark runs are scheduled for baseline = 
705e04bb15f481e476c9e7a8e2ac92460890ad0c and contender = 
2b50694c10e09e4a1343b62c6b5f44ad4403d0e1. 
2b50694c10e09e4a1343b62c6b5f44ad4403d0e1 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/55b6d454010d4834b9d7bfc5a85fe756...b69a7118eddb41a695cd537131c0bcf4/)
   [Failed :arrow_down:0.81% :arrow_up:0.0%] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/a21315152d8a4ce9ba4231fb614ed91a...e62713d5ba6e4cbda2b0f634cc137de1/)
   [Finished :arrow_down:0.51% :arrow_up:0.0%] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/11309d9e609543bcb69b6247ff3e0569...8f9ba14983c34891ad8653cf430cb4b3/)
   [Finished :arrow_down:0.53% :arrow_up:0.0%] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/28d069ebefcd4355b89bfa07629ad364...895eb1a43a284e71b5d818b1ad9d2986/)
   Buildkite builds:
   [Finished] [`2b50694c` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2214)
   [Failed] [`2b50694c` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2240)
   [Finished] [`2b50694c` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2210)
   [Finished] [`2b50694c` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2233)
   [Finished] [`705e04bb` 
ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2213)
   [Finished] [`705e04bb` 
test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2239)
   [Finished] [`705e04bb` 
ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2209)
   [Finished] [`705e04bb` 
ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2232)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] MMCMA commented on issue #15153: [Python] OSError: Couldn't deserialize thrift: TProtocolException

2023-01-19 Thread GitBox


MMCMA commented on issue #15153:
URL: https://github.com/apache/arrow/issues/15153#issuecomment-1396725954

   I can close the issue - I just discovered by chance that in very rare 
circumstances two processes we writing to the same file at the same time. Sorry 
about this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] raulcd merged pull request #33751: WIP: [Release] Verify release-11.0.0-rc0

2023-01-19 Thread GitBox


raulcd merged PR #33751:
URL: https://github.com/apache/arrow/pull/33751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


tustvold commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081032053


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   Yeah, that sounds like 
https://github.com/apache/arrow-datafusion/issues/4918 and the error is 
occurring when it tries to skip the header



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


tustvold commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081032053


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   Yeah, that sounds like https://github.com/apache/arrow-datafusion/issues/4918



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081030906


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   @tustvold yes, I did & it prints. But the line in files are correct, it 
always refer to first one in my case. Probably it's related to intermediate 
schemas during execution 樂  As I said, I'll try to create a MRE, otherwise it's 
hard to explain



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] raulcd commented on issue #15054: [CI][Python] wheel-manylinux2014-* sometimes crashed on pytest exit

2023-01-19 Thread GitBox


raulcd commented on issue #15054:
URL: https://github.com/apache/arrow/issues/15054#issuecomment-1396706119

   As suggested by @lidavidm this is causing release verification 
(https://github.com/apache/arrow/pull/33751) to be a bit painful.
   I had to retry the `wheel-manylinux2014-cp39-amd64` 11 times for it to 
succeed (1 of the early failures was due to a failure uploading the artifact): 
https://github.com/ursacomputing/crossbow/actions/runs/3949372445/jobs/6760493478
   And currently `verify-rc-binaries-wheels-linux-conda-latest-amd64` is still 
failing with the crash on pytest exit after several retries: 
https://github.com/ursacomputing/crossbow/actions/runs/3954649758/jobs/6772203240
   Those have been the only jobs that have been having the issue as far as I am 
aware but it has been quite consistent on those. I wanted to raise that so it 
is taken into account on the release vote.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081030906


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   @tustvold yes, I did & it prints. But the line in files are correct, it 
always refer to first one in my case. Probably it's related to intermediate 
schemas during execution 樂 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


tustvold commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081025726


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   If you update to the latest DataFusion, which includes arrow 31, it will 
print a line number which may help identify what is going on



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081013622


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   When one of the rows contains more fields than specified in schema it's 
usually:
   `incorrect number of fields, expected X got Y` (from 
`ReadRecordResult::Record` arm of match), but no this variant of error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081013622


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   Yes, when one of the row contains more fields than specified in schema it's 
usually:
   `incorrect number of fields, expected X got Y` (from 
`ReadRecordResult::Record` arm of match), but no this variant of error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081013622


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   Yes, when one of the row contains more fields than specified in schema it's 
always:
   `incorrect number of fields, expected X got Y` (from 
`ReadRecordResult::Record` arm of match), but no this variant of error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1081004665


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   @tustvold I'm totally sure that number of rows were **equal or less**(I mean 
null values) than number of fields in schema 
   
   I'm double-checking that 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] ursabot commented on pull request #3560: Update pyarrow method call with kwargs

2023-01-19 Thread GitBox


ursabot commented on PR #3560:
URL: https://github.com/apache/arrow-rs/pull/3560#issuecomment-1396668572

   Benchmark runs are scheduled for baseline = 
de62808a9d65e052ff3e89550bf780d952c8ceae and contender = 
d9802353f195979f7c6541143c7e849f5ac2d661. 
d9802353f195979f7c6541143c7e849f5ac2d661 is a master commit associated with 
this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on 
ec2-t3-xlarge-us-east-2] 
[ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6aeb3f2d23424562bfbd880181f9e8d3...a28180b3979b42aab203bad3846fa235/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on 
test-mac-arm] 
[test-mac-arm](https://conbench.ursa.dev/compare/runs/8f9c4a21a180416f889f1a9803ceb0d6...afbf266a92274aaa9865590b02567448/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on 
ursa-i9-9960x] 
[ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/43a8babcf2c344eca8a3ac4ca1f9b034...65be1cced99c40dc8f1adedd329ddc10/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on 
ursa-thinkcentre-m75q] 
[ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f9003b054a6b45fea4994976fb696166...5181329ffb554c7d9d896abec4abec3b/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only 
benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3561: Return reference from ListArray::values

2023-01-19 Thread GitBox


tustvold commented on code in PR #3561:
URL: https://github.com/apache/arrow-rs/pull/3561#discussion_r1080992940


##
arrow-cast/src/cast.rs:
##
@@ -4710,8 +4709,8 @@ mod tests {
 assert_eq!(1, arr.value_length(2));
 assert_eq!(1, arr.value_length(3));
 assert_eq!(1, arr.value_length(4));
-let values = arr.values();
-let c = values.as_any().downcast_ref::().unwrap();
+

Review Comment:
   This was not possible before, as the temporary `ArrayRef` would not live for 
long enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold opened a new pull request, #3561: Return reference from ListArray::values

2023-01-19 Thread GitBox


tustvold opened a new pull request, #3561:
URL: https://github.com/apache/arrow-rs/pull/3561

   # Which issue does this PR close?
   
   
   
   Closes #.
   
   # Rationale for this change

   
   
   Returning an owned value not only result in unnecessary clones, but results 
in poor ergonomics, especially when combined with downcasts
   
   # What changes are included in this PR?
   
   
   
   # Are there any user-facing changes?
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] Ziy1-Tan commented on pull request #33781: re2::RE2::RE2() result must be checked

2023-01-19 Thread GitBox


Ziy1-Tan commented on PR #33781:
URL: https://github.com/apache/arrow/pull/33781#issuecomment-1396660590

   cc @kou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold merged pull request #3560: Update pyarrow method call with kwargs

2023-01-19 Thread GitBox


tustvold merged PR #3560:
URL: https://github.com/apache/arrow-rs/pull/3560


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] DDtKey commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


DDtKey commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1080981598


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   @tustvold wow. but CSV files are correct which produce this error (and 
schema passed directly without inference) - it has exactly number of fields as 
expected (however, let me check for null values inside). 
   
   Shouldn't the error message be changed in that case? I can't clearly 
understand the case when it happens, but trying 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3365: Add csv-core based reader (#3338)

2023-01-19 Thread GitBox


tustvold commented on code in PR #3365:
URL: https://github.com/apache/arrow-rs/pull/3365#discussion_r1080984521


##
arrow-csv/src/reader/records.rs:
##
@@ -0,0 +1,266 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::ArrowError;
+use csv_core::{ReadRecordResult, Reader};
+use std::io::BufRead;
+
+/// The estimated length of a field in bytes
+const AVERAGE_FIELD_SIZE: usize = 8;
+
+/// The minimum amount of data in a single read
+const MIN_CAPACITY: usize = 1024;
+
+pub struct RecordReader {
+reader: R,
+delimiter: Reader,
+
+num_columns: usize,
+
+num_rows: usize,
+offsets: Vec,
+data: Vec,
+}
+
+impl RecordReader {
+pub fn new(reader: R, delimiter: Reader, num_columns: usize) -> Self {
+Self {
+reader,
+delimiter,
+num_columns,
+num_rows: 0,
+offsets: vec![],
+data: vec![],
+}
+}
+
+fn fill_buf( self, to_read: usize) -> Result<(), ArrowError> {
+// Reserve sufficient capacity in offsets
+self.offsets.resize(to_read * self.num_columns + 1, 0);
+self.num_rows = 0;
+
+if to_read == 0 {
+return Ok(());
+}
+
+// The current offset into `self.data`
+let mut output_offset = 0;
+// The current offset into `input`
+let mut input_offset = 0;
+// The current offset into `self.offsets`
+let mut field_offset = 1;
+// The number of fields read for the current row
+let mut field_count = 0;
+
+'outer: loop {
+let input = self.reader.fill_buf()?;
+
+'input: loop {
+// Reserve necessary space in output data based on best 
estimate
+let remaining_rows = to_read - self.num_rows;
+let capacity = remaining_rows * self.num_columns * 
AVERAGE_FIELD_SIZE;
+let estimated_data = capacity.max(MIN_CAPACITY);
+self.data.resize(output_offset + estimated_data, 0);
+
+loop {
+let (result, bytes_read, bytes_written, end_positions) =
+self.delimiter.read_record(
+[input_offset..],
+ self.data[output_offset..],
+ self.offsets[field_offset..],
+);
+
+field_count += end_positions;
+field_offset += end_positions;
+input_offset += bytes_read;
+output_offset += bytes_written;
+
+match result {
+ReadRecordResult::End => break 'outer, // Reached end 
of file
+ReadRecordResult::InputEmpty => break 'input, // Input 
exhausted, need to read more
+ReadRecordResult::OutputFull => break, // Need to 
allocate more capacity
+ReadRecordResult::OutputEndsFull => {
+return Err(ArrowError::CsvError(format!("incorrect 
number of fields, expected {} got more than {}", self.num_columns, 
field_count)))

Review Comment:
   It happens when at least one row contains more fields than specified in the 
schema, i.e. more than `schema.fields().len() - 1` delimiters



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    1   2   3   4   5   6   7   8   9   10   >