[jira] [Created] (ARROW-17201) Cannot override environmental variables setting region
Carl Boettiger created ARROW-17201: -- Summary: Cannot override environmental variables setting region Key: ARROW-17201 URL: https://issues.apache.org/jira/browse/ARROW-17201 Project: Apache Arrow Issue Type: Improvement Components: R Affects Versions: 8.0.1 Reporter: Carl Boettiger If a user has set the AWS_DEFAULT_REGION, there is no way we can override that setting (especially to a null value) in a one-off call. Consider the following example. This fails: {code:java} library(arrow) Sys.setenv(AWS_DEFAULT_REGION="data") noaa <- s3_bucket("neon4cast-drivers/", endpoint_override = "data.ecoforecast.org", anonymous = TRUE) {code} If the env var is not set or unset, this succeeds. However, attempting to override the region does not help: {code:java} noaa <- s3_bucket("neon4cast-drivers/", endpoint_override = "data.ecoforecast.org", region = "us-east-1", anonymous = TRUE) {code} (nor does it help to set the region to "", or NULL. – note this is a MINIO host and so region is not needed anyway). Relatedly, one might expect that AWS_S3_ENDPOINT could be used instead of setting `endpoint_override`, but this does not seem to work either (lemme know if you want that in a separate issue thread). At least on the plus side, the above code does not fail if AWS_S3_ENDPOINT is set to some other value than given in endpoint_override, so that is nice). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17200) [Python, Parquet] support partitioning by Pandas DataFrame index
Gregory Werbin created ARROW-17200: -- Summary: [Python, Parquet] support partitioning by Pandas DataFrame index Key: ARROW-17200 URL: https://issues.apache.org/jira/browse/ARROW-17200 Project: Apache Arrow Issue Type: New Feature Components: Parquet, Python Reporter: Gregory Werbin In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" index level, one might want to partition by that index level when saving the data frame to Parquet format. This is currently not possible; you need to manually reset the index before writing, and re-add the index after reading. It would be very useful if you could supply the name of an index level to {{partition_cols}} instead of (or ideally in addition to) a data column name. I originally posted this on the Pandas issue tracker ([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke looked at the code and figured out that the partitioning functionality was implemented entirely in PyArrow, and that the change would need to happen within PyArrow itself. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17199) [FlightRPC][Java] Fix example Flight SQL server
David Li created ARROW-17199: Summary: [FlightRPC][Java] Fix example Flight SQL server Key: ARROW-17199 URL: https://issues.apache.org/jira/browse/ARROW-17199 Project: Apache Arrow Issue Type: Bug Components: FlightRPC, Java Reporter: David Li Assignee: David Li There are a number of small bugs in the Java Flight SQL example (e.g. binding parameters to the wrong index, not handling null parameter values, not properly reporting errors) that should be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm commented on issue #46: Question around fixed hierarchy for schema
lidavidm commented on issue #46: URL: https://github.com/apache/arrow-adbc/issues/46#issuecomment-1194305728 Hmm, real-world edge cases are always fun, thanks for poking around. (Admittedly we should've looked more closely at these.) We were debating whether to keep the hierarchy or split it into multiple views (akin to Flight SQL), so this might be an argument for splitting them. Do you have a reference for the Snowflake/Trino behavior? I see examples like the following that document three levels of hierarchy, but not the fourth level: https://trino.io/docs/current/connector/postgresql.html#querying-postgresql https://docs.snowflake.com/en/user-guide/databases.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] GavinRay97 opened a new issue, #46: Question around fixed hierarchy for schema
GavinRay97 opened a new issue, #46: URL: https://github.com/apache/arrow-adbc/issues/46 I noticed that the ADBC metadata information assumes a fixed hierarchy: https://github.com/apache/arrow-adbc/blob/2485d7c3da217a7190f86128d769a7d0445755ab/java/core/src/main/java/org/apache/arrow/adbc/core/AdbcConnection.java#L58 What would the advice be for datasources that don't fit this, like Snowflake/Trino/Dremio where the hierarchy might be: - `datasource.database.schema.table` - `postgres1.mydb.public.emps` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-17198) [C++] Potential memory leak at shutdown if an exec plan with a scanner fails or is aborted immediately before shutdown
Weston Pace created ARROW-17198: --- Summary: [C++] Potential memory leak at shutdown if an exec plan with a scanner fails or is aborted immediately before shutdown Key: ARROW-17198 URL: https://issues.apache.org/jira/browse/ARROW-17198 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Weston Pace I'm primarily creating this so we can remember to make a test for this. This problem should be solved as part of ARROW-16072. When the scanner fails it simply discards references to the various scanner AsyncGenerators. However, some I/O tasks may still have references to these generators and so some parts of the scanner survive after the plan itself is marked complete. If there is an immediate shutdown then these parts will not be properly disposed of even though the plan is marked complete and it will show up as a memory leak. Example: https://pipelines.actions.githubusercontent.com/serviceHosts/8bb0d999-3387-4c48-9fa6-c66c718a46e2/_apis/pipelines/1/runs/359690/signedlogcontent/4?urlExpires=2022-07-25T14%3A43%3A01.2797488Z=HMACV1=GS3lS09Q9sTRweN%2B8UEu2GwUGc%2FbO9eyH27FRKumbrg%3D -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17197) [R] floor_date/ceiling_date lubridate comparison tests failing on macOS
Rok Mihevc created ARROW-17197: -- Summary: [R] floor_date/ceiling_date lubridate comparison tests failing on macOS Key: ARROW-17197 URL: https://issues.apache.org/jira/browse/ARROW-17197 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Rok Mihevc Fix For: 9.0.0 We observed failing tests on local machines and [in CI|https://github.com/ursacomputing/crossbow/runs/7460282895?check_suite_focus=true#step:10:228] where timezoned timestamps are rounded to subsecond, second and minute units. Tests fail when comparing our result to lubridate's, [however it seems the issue is on lubridate's side|https://github.com/apache/arrow/pull/12154/files#diff-d405691ec7dd30bdf039b63136e5aac3c34cea96d8ff532485d1faea7f2caaacR2815-R2823]. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [arrow-adbc] lidavidm commented on pull request #45: [C][Java][Python] Add GetInfo
lidavidm commented on PR #45: URL: https://github.com/apache/arrow-adbc/pull/45#issuecomment-1194001552 CC @hannes @krlmlr if either of you have comments - I noticed this was missing while working on the Ibis backend -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-17196) [C++] Optimize Fetch Node to avoid collecting all records for non-sort setting
Vibhatha Lakmal Abeykoon created ARROW-17196: Summary: [C++] Optimize Fetch Node to avoid collecting all records for non-sort setting Key: ARROW-17196 URL: https://issues.apache.org/jira/browse/ARROW-17196 Project: Apache Arrow Issue Type: Sub-task Reporter: Vibhatha Lakmal Abeykoon The current/initial implementation of Fetch node collects all the records and then do a fetch. Without doing it while watching the input stream, only collect the required amount of records with an offset. Also it is important to evaluate if the order can be retained. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17195) [Docs] Create Documentation for Fetch Sink Node
Vibhatha Lakmal Abeykoon created ARROW-17195: Summary: [Docs] Create Documentation for Fetch Sink Node Key: ARROW-17195 URL: https://issues.apache.org/jira/browse/ARROW-17195 Project: Apache Arrow Issue Type: Sub-task Reporter: Vibhatha Lakmal Abeykoon Assignee: Vibhatha Lakmal Abeykoon Adding a section to Streaming engine. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17194) [CI][Conan] Enable glog
Kouhei Sutou created ARROW-17194: Summary: [CI][Conan] Enable glog Key: ARROW-17194 URL: https://issues.apache.org/jira/browse/ARROW-17194 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Packaging Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian Jira (v8.20.10#820010)