date:20220725

[jira] [Created] (ARROW-17201) Cannot override environmental variables setting region

2022-07-25 Thread Carl Boettiger (Jira)

Carl Boettiger created ARROW-17201:
--

 Summary: Cannot override environmental variables setting region
 Key: ARROW-17201
 URL: https://issues.apache.org/jira/browse/ARROW-17201
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Affects Versions: 8.0.1
Reporter: Carl Boettiger


If a user has set the AWS_DEFAULT_REGION, there is no way we can override that 
setting (especially to a null value) in a one-off call.  Consider the following 
example.  This fails:


{code:java}
library(arrow)
Sys.setenv(AWS_DEFAULT_REGION="data")

noaa <- s3_bucket("neon4cast-drivers/",
                  endpoint_override = "data.ecoforecast.org",
                  anonymous = TRUE) {code}
If the env var is not set or unset, this succeeds.  However, attempting to 
override the region does not help:


{code:java}
noaa <- s3_bucket("neon4cast-drivers/",
                  endpoint_override = "data.ecoforecast.org",
  region = "us-east-1",
                  anonymous = TRUE) {code}
(nor does it help to set the region to "", or NULL. – note this is a MINIO host 
and so region is not needed anyway).  

Relatedly, one might expect that AWS_S3_ENDPOINT could be used instead of 
setting `endpoint_override`, but this does not seem to work either (lemme know 
if you want that in a separate issue thread).  At least on the plus side, the 
above code does not fail if AWS_S3_ENDPOINT is set to some other value than 
given in endpoint_override, so that is nice). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17200) [Python, Parquet] support partitioning by Pandas DataFrame index

2022-07-25 Thread Gregory Werbin (Jira)

Gregory Werbin created ARROW-17200:
--

 Summary: [Python, Parquet] support partitioning by Pandas 
DataFrame index
 Key: ARROW-17200
 URL: https://issues.apache.org/jira/browse/ARROW-17200
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Parquet, Python
Reporter: Gregory Werbin


In a Pandas {{DataFrame}} with a multi-index, with a slowly-varying "outer" 
index level, one might want to partition by that index level when saving the 
data frame to Parquet format. This is currently not possible; you need to 
manually reset the index before writing, and re-add the index after reading. It 
would be very useful if you could supply the name of an index level to 
{{partition_cols}} instead of (or ideally in addition to) a data column name.

I originally posted this on the Pandas issue tracker 
([https://github.com/pandas-dev/pandas/issues/47797]). Matthew Roeschke looked 
at the code and figured out that the partitioning functionality was implemented 
entirely in PyArrow, and that the change would need to happen within PyArrow 
itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17199) [FlightRPC][Java] Fix example Flight SQL server

2022-07-25 Thread David Li (Jira)

David Li created ARROW-17199:


 Summary: [FlightRPC][Java] Fix example Flight SQL server
 Key: ARROW-17199
 URL: https://issues.apache.org/jira/browse/ARROW-17199
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Java
Reporter: David Li
Assignee: David Li


There are a number of small bugs in the Java Flight SQL example (e.g. binding 
parameters to the wrong index, not handling null parameter values, not properly 
reporting errors) that should be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] lidavidm commented on issue #46: Question around fixed hierarchy for schema

2022-07-25 Thread GitBox



lidavidm commented on issue #46:
URL: https://github.com/apache/arrow-adbc/issues/46#issuecomment-1194305728

   Hmm, real-world edge cases are always fun, thanks for poking around. 
(Admittedly we should've looked more closely at these.)
   
   We were debating whether to keep the hierarchy or split it into multiple 
views (akin to Flight SQL), so this might be an argument for splitting them. 
   
   Do you have a reference for the Snowflake/Trino behavior? I see examples 
like the following that document three levels of hierarchy, but not the fourth 
level: 
   
   https://trino.io/docs/current/connector/postgresql.html#querying-postgresql
   https://docs.snowflake.com/en/user-guide/databases.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-adbc] GavinRay97 opened a new issue, #46: Question around fixed hierarchy for schema

2022-07-25 Thread GitBox



GavinRay97 opened a new issue, #46:
URL: https://github.com/apache/arrow-adbc/issues/46

   I noticed that the ADBC metadata information assumes a fixed hierarchy:
   
https://github.com/apache/arrow-adbc/blob/2485d7c3da217a7190f86128d769a7d0445755ab/java/core/src/main/java/org/apache/arrow/adbc/core/AdbcConnection.java#L58
   
   What would the advice be for datasources that don't fit this, like 
Snowflake/Trino/Dremio where the hierarchy might be:
   - `datasource.database.schema.table`
   - `postgres1.mydb.public.emps`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17198) [C++] Potential memory leak at shutdown if an exec plan with a scanner fails or is aborted immediately before shutdown

2022-07-25 Thread Weston Pace (Jira)

Weston Pace created ARROW-17198:
---

 Summary: [C++] Potential memory leak at shutdown if an exec plan 
with a scanner fails or is aborted immediately before shutdown
 Key: ARROW-17198
 URL: https://issues.apache.org/jira/browse/ARROW-17198
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace


I'm primarily creating this so we can remember to make a test for this.  This 
problem should be solved as part of ARROW-16072.  When the scanner fails it 
simply discards references to the various scanner AsyncGenerators.  However, 
some I/O tasks may still have references to these generators and so some parts 
of the scanner survive after the plan itself is marked complete.  If there is 
an immediate shutdown then these parts will not be properly disposed of even 
though the plan is marked complete and it will show up as a memory leak.

Example:

https://pipelines.actions.githubusercontent.com/serviceHosts/8bb0d999-3387-4c48-9fa6-c66c718a46e2/_apis/pipelines/1/runs/359690/signedlogcontent/4?urlExpires=2022-07-25T14%3A43%3A01.2797488Z=HMACV1=GS3lS09Q9sTRweN%2B8UEu2GwUGc%2FbO9eyH27FRKumbrg%3D



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17197) [R] floor_date/ceiling_date lubridate comparison tests failing on macOS

2022-07-25 Thread Rok Mihevc (Jira)

Rok Mihevc created ARROW-17197:
--

 Summary: [R] floor_date/ceiling_date lubridate comparison tests 
failing on macOS
 Key: ARROW-17197
 URL: https://issues.apache.org/jira/browse/ARROW-17197
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Rok Mihevc
 Fix For: 9.0.0


We observed failing tests on local machines and [in 
CI|https://github.com/ursacomputing/crossbow/runs/7460282895?check_suite_focus=true#step:10:228]
 where timezoned timestamps are rounded to subsecond, second and minute units.
Tests fail when comparing our result to lubridate's, [however it seems the 
issue is on lubridate's 
side|https://github.com/apache/arrow/pull/12154/files#diff-d405691ec7dd30bdf039b63136e5aac3c34cea96d8ff532485d1faea7f2caaacR2815-R2823].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [arrow-adbc] lidavidm commented on pull request #45: [C][Java][Python] Add GetInfo

2022-07-25 Thread GitBox



lidavidm commented on PR #45:
URL: https://github.com/apache/arrow-adbc/pull/45#issuecomment-1194001552

   CC @hannes @krlmlr if either of you have comments - I noticed this was 
missing while working on the Ibis backend


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (ARROW-17196) [C++] Optimize Fetch Node to avoid collecting all records for non-sort setting

2022-07-25 Thread Vibhatha Lakmal Abeykoon (Jira)

Vibhatha Lakmal Abeykoon created ARROW-17196:


 Summary: [C++] Optimize Fetch Node to avoid collecting all records 
for non-sort setting
 Key: ARROW-17196
 URL: https://issues.apache.org/jira/browse/ARROW-17196
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Vibhatha Lakmal Abeykoon


The current/initial implementation of Fetch node collects all the records and 
then do a fetch. Without doing it while watching the input stream, only collect 
the required amount of records with an offset. Also it is important to evaluate 
if the order can be retained. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17195) [Docs] Create Documentation for Fetch Sink Node

2022-07-25 Thread Vibhatha Lakmal Abeykoon (Jira)

Vibhatha Lakmal Abeykoon created ARROW-17195:


 Summary: [Docs] Create Documentation for Fetch Sink Node
 Key: ARROW-17195
 URL: https://issues.apache.org/jira/browse/ARROW-17195
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Vibhatha Lakmal Abeykoon
Assignee: Vibhatha Lakmal Abeykoon


Adding a section to Streaming engine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17194) [CI][Conan] Enable glog

2022-07-25 Thread Kouhei Sutou (Jira)

Kouhei Sutou created ARROW-17194:


 Summary: [CI][Conan] Enable glog
 Key: ARROW-17194
 URL: https://issues.apache.org/jira/browse/ARROW-17194
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Packaging
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (ARROW-17201) Cannot override environmental variables setting region

[jira] [Created] (ARROW-17200) [Python, Parquet] support partitioning by Pandas DataFrame index

[jira] [Created] (ARROW-17199) [FlightRPC][Java] Fix example Flight SQL server

[GitHub] [arrow-adbc] lidavidm commented on issue #46: Question around fixed hierarchy for schema

[GitHub] [arrow-adbc] GavinRay97 opened a new issue, #46: Question around fixed hierarchy for schema

[jira] [Created] (ARROW-17198) [C++] Potential memory leak at shutdown if an exec plan with a scanner fails or is aborted immediately before shutdown

[jira] [Created] (ARROW-17197) [R] floor_date/ceiling_date lubridate comparison tests failing on macOS

[GitHub] [arrow-adbc] lidavidm commented on pull request #45: [C][Java][Python] Add GetInfo

[jira] [Created] (ARROW-17196) [C++] Optimize Fetch Node to avoid collecting all records for non-sort setting

[jira] [Created] (ARROW-17195) [Docs] Create Documentation for Fetch Sink Node

[jira] [Created] (ARROW-17194) [CI][Conan] Enable glog

11 matches

Site Navigation

Mail list logo

Footer information