[jira] [Created] (ARROW-15895) [R] R docs version switcher disappears & reappears with back
Stephanie Hazlitt created ARROW-15895: - Summary: [R] R docs version switcher disappears & reappears with back Key: ARROW-15895 URL: https://issues.apache.org/jira/browse/ARROW-15895 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Stephanie Hazlitt Using Chrome (on a Macbook 2020) the R docs switcher disappears when <7.0.0 version is selected (expected behaviour) however reappears with the older version page with the back button. Might be related to [#ARROW-15819] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15894) [C++] Strptime issues umbrella
Rok Mihevc created ARROW-15894: -- Summary: [C++] Strptime issues umbrella Key: ARROW-15894 URL: https://issues.apache.org/jira/browse/ARROW-15894 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc This is to make strptime efforts more visible -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15893) [Python][CI] Exercise Python minimal build exampls
Antoine Pitrou created ARROW-15893: -- Summary: [Python][CI] Exercise Python minimal build exampls Key: ARROW-15893 URL: https://issues.apache.org/jira/browse/ARROW-15893 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Python Reporter: Antoine Pitrou The build examples in https://github.com/apache/arrow/tree/master/python/examples/minimal_build are currently not exercised, meaning they can silently start failing (which they actually did until https://github.com/apache/arrow/pull/12592 fixed them). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15892) [C++] Dataset APIs require s3:ListBucket Permissions
Jonny Fuller created ARROW-15892: Summary: [C++] Dataset APIs require s3:ListBucket Permissions Key: ARROW-15892 URL: https://issues.apache.org/jira/browse/ARROW-15892 Project: Apache Arrow Issue Type: Bug Reporter: Jonny Fuller Hi team, first time posting an issue so I apologize if the format is lacking. My original comment is on ARROW-13685 Github Issue [here|https://github.com/apache/arrow/pull/11136#issuecomment-1062406820]. Long story short, our environment is super locked down, and while my application has permission to write data against an s3 prefix, I do not have the {{ListBucket}} permission nor can I add it. This does not prevent me from using the "individual" file APIs like {{pq.write_table}} but the bucket validation logic in the "dataset" APIs breaks when trying to test for the bucket's existence. {code:java} pq.write_to_dataset(pa.Table.from_batches([data]), location, filesystem=s3fs){code} {code:java} OSError: When creating bucket '': AWS Error [code 15]: Access Denied{code} The same is true for the generic {{pyarrow.dataset}} APIs. My understanding is the bucket validation logic is part of the C++ code, not the Python API. As a Pythonista who knows nothing of C++ I am not sure how to resolve this problem. Would it be possible to disable the bucket existence check with an optional key word argument? Thank you for your time! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15891) [Python][Packaging] macOS universal wheel test failure on x86
Antoine Pitrou created ARROW-15891: -- Summary: [Python][Packaging] macOS universal wheel test failure on x86 Key: ARROW-15891 URL: https://issues.apache.org/jira/browse/ARROW-15891 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Packaging, Python Reporter: Antoine Pitrou Loading the crossbow-built macOS universal wheels on x86 fails with a symbol lookup error: https://github.com/ursacomputing/crossbow/runs/5481178200?check_suite_focus=true#step:13:95 {code} Traceback (most recent call last): File "", line 2, in File "/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/__init__.py", line 65, in import pyarrow.lib as _lib ImportError: dlopen(/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so, 2): Symbol not found: _EVP_CIPHER_CTX_ctrl Referenced from: /Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/libparquet.800.dylib Expected in: flat namespace in /Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/libparquet.800.dylib {code} This symbol is provided in OpenSSL. For some reason, this fails on x86 but not on ARM64. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [arrow-julia] simsurace opened a new issue #303: Question on `Date` encoding
simsurace opened a new issue #303: URL: https://github.com/apache/arrow-julia/issues/303 I'm trying to use Arrow to send data between a Julia (Arrows.jl) and a Rust (Polars) app. However, when I write a table containing Date, it is read by Polars as Extension("JuliaLang.Date", Date32, Some("")), and Polars complains with ``` Cannot create polars series from Extension("JuliaLang.Date", Date32, Some("")) type ``` I would have expected one of the following things would happen: 1. When writing the Arrow file, the type is converted to a suitable Arrow type (e.g. Date32 ),disguising its origin. 2. When reading the Arrow file, Polars would ignore the origin (JuliaLang.Date, which seems to just be a name) and recognize it as Date32 or whatever. Instead, what seems to happen is that it is encoded as an extension type and Polars does not know what to do with it. Is this expected behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (ARROW-15890) [CI][Python] Use venv, not virtualenv, in CI
Antoine Pitrou created ARROW-15890: -- Summary: [CI][Python] Use venv, not virtualenv, in CI Key: ARROW-15890 URL: https://issues.apache.org/jira/browse/ARROW-15890 Project: Apache Arrow Issue Type: Task Components: Continuous Integration, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou The standard {{venv}} module is enough to create virtual environments without installing the {{virtualenv}} module. It will also solve virtualenv installation issues on some setups such as macOS wheel builders. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15889) [Java][FlightRPC] "Used undeclared dependencies found" due to netty-transport-native-kqueue
David Li created ARROW-15889: Summary: [Java][FlightRPC] "Used undeclared dependencies found" due to netty-transport-native-kqueue Key: ARROW-15889 URL: https://issues.apache.org/jira/browse/ARROW-15889 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: David Li Similar to ARROW-15831 we should remove kqueue from the dependencies (it's not strictly required). This is causing the java-jars build to fail. {noformat} 2022-03-09T12:56:20.3740860Z [INFO] --- maven-dependency-plugin:3.0.1:analyze-only (analyze) @ flight-core --- 2022-03-09T12:56:20.5028740Z [WARNING] Used undeclared dependencies found: 2022-03-09T12:56:20.5030090Z [WARNING] io.netty:netty-transport-classes-kqueue:jar:4.1.72.Final:compile 2022-03-09T12:56:20.5030840Z [WARNING] Unused declared dependencies found: 2022-03-09T12:56:20.5036430Z [WARNING] io.netty:netty-transport-native-kqueue:jar:osx-x86_64:4.1.72.Final:compile 2022-03-09T12:56:20.5037370Z [INFO] {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15888) [Doc][Python] Python development guide is outdated
Antoine Pitrou created ARROW-15888: -- Summary: [Doc][Python] Python development guide is outdated Key: ARROW-15888 URL: https://issues.apache.org/jira/browse/ARROW-15888 Project: Apache Arrow Issue Type: Bug Components: Documentation, Python Reporter: Antoine Pitrou Assignee: Antoine Pitrou Fix For: 8.0.0 Many instructions in https://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos are outdated, we should do a pass and fix them. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15887) [Python] Update timezones strategy to include fixed offsets
Alenka Frim created ARROW-15887: --- Summary: [Python] Update timezones strategy to include fixed offsets Key: ARROW-15887 URL: https://issues.apache.org/jira/browse/ARROW-15887 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Alenka Frim PyArrow tests strategy for {{timezones }}should also include fixed offsets. Note: it is not supported out of the box by hypothesis. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15886) [Ruby] Add support for #raw_records of Day Millisecond Interval Type
Keisuke Okada created ARROW-15886: - Summary: [Ruby] Add support for #raw_records of Day Millisecond Interval Type Key: ARROW-15886 URL: https://issues.apache.org/jira/browse/ARROW-15886 Project: Apache Arrow Issue Type: Sub-task Components: Ruby Reporter: Keisuke Okada Assignee: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15885) [Ruby] Add support for #values of Day Millisecond Interval Type
Keisuke Okada created ARROW-15885: - Summary: [Ruby] Add support for #values of Day Millisecond Interval Type Key: ARROW-15885 URL: https://issues.apache.org/jira/browse/ARROW-15885 Project: Apache Arrow Issue Type: Sub-task Components: Ruby Reporter: Keisuke Okada Assignee: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15884) [C++][Doc] Document that the strptime kernel ignores %Z
Joris Van den Bossche created ARROW-15884: - Summary: [C++][Doc] Document that the strptime kernel ignores %Z Key: ARROW-15884 URL: https://issues.apache.org/jira/browse/ARROW-15884 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: Joris Van den Bossche After ARROW-12820, the {{strptime}} kernel still ignores the {{%Z}} specifier (for timezone names), and when using it, it basically ignores any string. For example: {code:python} # the %z specifier now works (after ARROW-12820) >>> pc.strptime(["2022-03-05 09:00:00+01"], format="%Y-%m-%d %H:%M:%S%z", >>> unit="us") [ 2022-03-05 08:00:00.00 ] # in theory this should give the same result, but %Z is still ignore >>> pc.strptime(["2022-03-05 09:00:00 CET"], format="%Y-%m-%d %H:%M:%S %Z", >>> unit="us") [ 2022-03-05 09:00:00.00 ] # as a result any garbage in the string is also ignored >>> pc.strptime(["2022-03-05 09:00:00 blabla"], format="%Y-%m-%d %H:%M:%S %Z", >>> unit="us") [ 2022-03-05 09:00:00.00 ] {code} I don't think it is easy to actually fix this (at least as long as we use the system strptime, see also https://github.com/apache/arrow/pull/11358#issue-1020404727). But at least we should document this limitation / gotcha. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15883) [C++] Support for fractional seconds in strptime() for ISO format?
Joris Van den Bossche created ARROW-15883: - Summary: [C++] Support for fractional seconds in strptime() for ISO format? Key: ARROW-15883 URL: https://issues.apache.org/jira/browse/ARROW-15883 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, we can't parse "our own" string representation of a timestamp array with the timestamp parser {{strptime}}: {code:python} import datetime import pyarrow as pa import pyarrow.compute as pc >>> pa.array([datetime.datetime(2022, 3, 5, 9)]) [ 2022-03-05 09:00:00.00 ] # trying to parse the above representation as string >>> pc.strptime(["2022-03-05 09:00:00.00"], format="%Y-%m-%d %H:%M:%S", >>> unit="us") ... ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.00' as a scalar of type timestamp[us] {code} The reason for this is the fractional second part, so the following works: {code:python} >>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us") [ 2022-03-05 09:00:00.00 ] {code} Now, I think the reason that this fails is because {{strptime}} only supports parsing seconds as an integer (https://man7.org/linux/man-pages/man3/strptime.3.html). But, it creates a strange situation where the timestamp parser cannot parse the representation we use for timestamps. In addition, for CSV we have a custom ISO parser (used by default), so when parsing the strings while reading a CSV file, the same string with fractional seconds does work: {code:python} s = b"""a 2022-03-05 09:00:00.00""" from pyarrow import csv >>> csv.read_csv(io.BytesIO(s)) pyarrow.Table a: timestamp[ns] a: [[2022-03-05 09:00:00.0]] {code} cc [~apitrou] [~rokm] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15882) [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests
Joris Van den Bossche created ARROW-15882: - Summary: [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests Key: ARROW-15882 URL: https://issues.apache.org/jira/browse/ARROW-15882 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Python Reporter: Joris Van den Bossche Fix For: 8.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15881) [c++] call parquet::arrow::WriteTable in a child thread get segmentation fault
zzh created ARROW-15881: --- Summary: [c++] call parquet::arrow::WriteTable in a child thread get segmentation fault Key: ARROW-15881 URL: https://issues.apache.org/jira/browse/ARROW-15881 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Environment: CentOS7,gcc7.0+ Reporter: zzh Attachments: message.png When I try to use Arrow to write parquet files, I encounter a error. The parquet::arrow::WriteTable out child thread can call successful, but parquet::arrow::WriteTable in child thread while cause Segmentation fault. The code like this: {code:cpp} arrow::Int64Builder test_a; for (int i = 0; i < 1e7; ++i) { PARQUET_THROW_NOT_OK(test_a.Append(i)); } auto sc = arrow::schema({arrow::field("A", arrow::int64())}); auto table = arrow::Table::Make(sc,{test_a.Finish().ValueOrDie()}); const string = sole::uuid4().str(); string filename = "test.parq"; try { std::shared_ptr outfile; PARQUET_ASSIGN_OR_THROW( outfile,arrow::io::FileOutputStream::Open(filename) ); PARQUET_THROW_NOT_OK( parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, table->num_rows()) ); } catch (exception ) { cout << ex.what() << endl; } shared_ptr thread = make_shared([=]() { arrow::Int64Builder test_a; for (int i = 0; i < 1e7; ++i) { PARQUET_THROW_NOT_OK(test_a.Append(i)); } auto sc = arrow::schema({arrow::field("A", arrow::int64())}); auto table = arrow::Table::Make(sc,{test_a.Finish().ValueOrDie()}); const string = sole::uuid4().str(); string filename = "test.parq"; try { std::shared_ptr outfile; PARQUET_ASSIGN_OR_THROW( outfile,arrow::io::FileOutputStream::Open(filename) ); PARQUET_THROW_NOT_OK( parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), outfile, table->num_rows()) ); } catch (exception ) { cout << ex.what() << endl; } };{code} The stack message is in the picture in attachment. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15880) [C++] Can't open partitioned dataset if the root directory has "=" in its name
Nicola Crane created ARROW-15880: Summary: [C++] Can't open partitioned dataset if the root directory has "=" in its name Key: ARROW-15880 URL: https://issues.apache.org/jira/browse/ARROW-15880 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Nicola Crane Not sure if this is a bug or "just how Hive style partitioning works" but if I try to open a dataset where the root directory has an "=" in it, I have to specify that directory in my partitioning to be able to successfully open it. This has caused users to trip up when they've saved one directory from a partitioned dataset somewhere and tried to then open this directory as a dataset. {code:r} library(arrow) td <- tempfile() dir.create(td) # directory with equals sign in name subdir <- file.path(td, "foo=bar") dir.create(subdir) write_dataset(mtcars, subdir, partitioning = "am") list.files(td, recursive = TRUE) #> [1] "foo=bar/am=0/part-0.parquet" "foo=bar/am=1/part-0.parquet" # doesn't work open_dataset(subdir, partitioning = "am") #> Error: #> ! "partitioning" does not match the detected Hive-style partitions: c("foo", "am") #> ℹ Omit "partitioning" to use the Hive partitions #> ℹ Set `hive_style = FALSE` to override what was detected #> ℹ Or, to rename partition columns, call `select()` or `rename()` after opening the dataset # works open_dataset(subdir, partitioning = c("foo", "am")) #> FileSystemDataset with 2 Parquet files #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> gear: double #> carb: double #> foo: string #> am: int32 #> #> See $metadata for additional Schema metadata {code} Compare this with the same example but the folder is just called "foobar" instead of "foo=bar". {code:r} td <- tempfile() dir.create(td) subdir <- file.path(td, "foobar") dir.create(subdir) write_dataset(mtcars, subdir, partitioning = "am") list.files(td, recursive = TRUE) #> [1] "foobar/am=0/part-0.parquet" "foobar/am=1/part-0.parquet" # works open_dataset(subdir, partitioning = "am") #> FileSystemDataset with 2 Parquet files #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> gear: double #> carb: double #> am: int32 #> #> See $metadata for additional Schema metadata {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)