[jira] [Created] (ARROW-15895) [R] R docs version switcher disappears & reappears with back

2022-03-09 Thread Stephanie Hazlitt (Jira)
Stephanie Hazlitt created ARROW-15895:
-

 Summary: [R] R docs version switcher disappears & reappears with 
back
 Key: ARROW-15895
 URL: https://issues.apache.org/jira/browse/ARROW-15895
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Stephanie Hazlitt


Using Chrome (on a Macbook 2020) the R docs switcher disappears when <7.0.0 
version is selected (expected behaviour) however reappears with the older 
version page with the back button. Might be related to [#ARROW-15819]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15894) [C++] Strptime issues umbrella

2022-03-09 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-15894:
--

 Summary: [C++] Strptime issues umbrella
 Key: ARROW-15894
 URL: https://issues.apache.org/jira/browse/ARROW-15894
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


This is to make strptime efforts more visible



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15893) [Python][CI] Exercise Python minimal build exampls

2022-03-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15893:
--

 Summary: [Python][CI] Exercise Python minimal build exampls
 Key: ARROW-15893
 URL: https://issues.apache.org/jira/browse/ARROW-15893
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Python
Reporter: Antoine Pitrou


The build examples in 
https://github.com/apache/arrow/tree/master/python/examples/minimal_build are 
currently not exercised, meaning they can silently start failing (which they 
actually did until https://github.com/apache/arrow/pull/12592 fixed them).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15892) [C++] Dataset APIs require s3:ListBucket Permissions

2022-03-09 Thread Jonny Fuller (Jira)
Jonny Fuller created ARROW-15892:


 Summary: [C++] Dataset APIs require s3:ListBucket Permissions
 Key: ARROW-15892
 URL: https://issues.apache.org/jira/browse/ARROW-15892
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jonny Fuller


Hi team, first time posting an issue so I apologize if the format is lacking. 
My original comment is on ARROW-13685 Github Issue 
[here|https://github.com/apache/arrow/pull/11136#issuecomment-1062406820]. 

Long story short, our environment is super locked down, and while my 
application has permission to write data against an s3 prefix, I do not have 
the {{ListBucket}} permission nor can I add it. This does not prevent me from 
using the "individual" file APIs like {{pq.write_table}} but the bucket 
validation logic in the "dataset" APIs breaks when trying to test for the 
bucket's existence. 
{code:java}
pq.write_to_dataset(pa.Table.from_batches([data]), location, 
filesystem=s3fs){code}
{code:java}
OSError: When creating bucket '': AWS Error [code 15]: Access 
Denied{code}
The same is true for the generic {{pyarrow.dataset}} APIs. My understanding is 
the bucket validation logic is part of the C++ code, not the Python API. As a 
Pythonista who knows nothing of C++ I am not sure how to resolve this problem.
 
Would it be possible to disable the bucket existence check with an optional key 
word argument? Thank you for your time!
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15891) [Python][Packaging] macOS universal wheel test failure on x86

2022-03-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15891:
--

 Summary: [Python][Packaging] macOS universal wheel test failure on 
x86
 Key: ARROW-15891
 URL: https://issues.apache.org/jira/browse/ARROW-15891
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, Packaging, Python
Reporter: Antoine Pitrou


Loading the crossbow-built macOS universal wheels on x86 fails with a symbol 
lookup error:
https://github.com/ursacomputing/crossbow/runs/5481178200?check_suite_focus=true#step:13:95
{code}
Traceback (most recent call last):
  File "", line 2, in 
  File 
"/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/__init__.py",
 line 65, in 
import pyarrow.lib as _lib
ImportError: 
dlopen(/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/lib.cpython-39-darwin.so,
 2): Symbol not found: _EVP_CIPHER_CTX_ctrl
  Referenced from: 
/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/libparquet.800.dylib
  Expected in: flat namespace
 in 
/Users/github/actions-runner/_work/crossbow/crossbow/test-amd64-env/lib/python3.9/site-packages/pyarrow/libparquet.800.dylib
{code}

This symbol is provided in OpenSSL. For some reason, this fails on x86 but not 
on ARM64.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [arrow-julia] simsurace opened a new issue #303: Question on `Date` encoding

2022-03-09 Thread GitBox


simsurace opened a new issue #303:
URL: https://github.com/apache/arrow-julia/issues/303


   I'm trying to use Arrow to send data between a Julia (Arrows.jl) and a Rust 
(Polars) app. 
   However, when I write a table containing Date, it is read by Polars as 
Extension("JuliaLang.Date", Date32, Some("")), and Polars complains with
   ```
   Cannot create polars series from Extension("JuliaLang.Date", Date32, 
Some("")) type
   ```
   I would have expected one of the following things would happen:
   
   1. When writing the Arrow file, the type is converted to a suitable Arrow 
type (e.g. Date32 ),disguising its origin.
   2. When reading the Arrow file, Polars would ignore the origin 
(JuliaLang.Date, which seems to just be a name) and recognize it as Date32 or 
whatever.
   
   Instead, what seems to happen is that it is encoded as an extension type and 
Polars does not know what to do with it. Is this expected behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-15890) [CI][Python] Use venv, not virtualenv, in CI

2022-03-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15890:
--

 Summary: [CI][Python] Use venv, not virtualenv, in CI
 Key: ARROW-15890
 URL: https://issues.apache.org/jira/browse/ARROW-15890
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The standard {{venv}} module is enough to create virtual environments without 
installing the {{virtualenv}} module. It will also solve virtualenv 
installation issues on some setups such as macOS wheel builders.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15889) [Java][FlightRPC] "Used undeclared dependencies found" due to netty-transport-native-kqueue

2022-03-09 Thread David Li (Jira)
David Li created ARROW-15889:


 Summary: [Java][FlightRPC] "Used undeclared dependencies found" 
due to netty-transport-native-kqueue
 Key: ARROW-15889
 URL: https://issues.apache.org/jira/browse/ARROW-15889
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: David Li


Similar to ARROW-15831 we should remove kqueue from the dependencies (it's not 
strictly required). This is causing the java-jars build to fail.
{noformat}
2022-03-09T12:56:20.3740860Z [INFO] --- 
maven-dependency-plugin:3.0.1:analyze-only (analyze) @ flight-core ---
2022-03-09T12:56:20.5028740Z [WARNING] Used undeclared dependencies found:
2022-03-09T12:56:20.5030090Z [WARNING]    
io.netty:netty-transport-classes-kqueue:jar:4.1.72.Final:compile
2022-03-09T12:56:20.5030840Z [WARNING] Unused declared dependencies found:
2022-03-09T12:56:20.5036430Z [WARNING]    
io.netty:netty-transport-native-kqueue:jar:osx-x86_64:4.1.72.Final:compile
2022-03-09T12:56:20.5037370Z [INFO] 
 
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15888) [Doc][Python] Python development guide is outdated

2022-03-09 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-15888:
--

 Summary: [Doc][Python] Python development guide is outdated
 Key: ARROW-15888
 URL: https://issues.apache.org/jira/browse/ARROW-15888
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, Python
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou
 Fix For: 8.0.0


Many instructions in 
https://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos
 are outdated, we should do a pass and fix them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15887) [Python] Update timezones strategy to include fixed offsets

2022-03-09 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-15887:
---

 Summary: [Python] Update timezones strategy to include fixed 
offsets
 Key: ARROW-15887
 URL: https://issues.apache.org/jira/browse/ARROW-15887
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alenka Frim


PyArrow tests strategy for {{timezones }}should also include fixed offsets.

Note: it is not supported out of the box by hypothesis.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15886) [Ruby] Add support for #raw_records of Day Millisecond Interval Type

2022-03-09 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-15886:
-

 Summary: [Ruby] Add support for #raw_records of Day Millisecond 
Interval Type
 Key: ARROW-15886
 URL: https://issues.apache.org/jira/browse/ARROW-15886
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Ruby
Reporter: Keisuke Okada
Assignee: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15885) [Ruby] Add support for #values of Day Millisecond Interval Type

2022-03-09 Thread Keisuke Okada (Jira)
Keisuke Okada created ARROW-15885:
-

 Summary: [Ruby] Add support for #values of Day Millisecond 
Interval Type
 Key: ARROW-15885
 URL: https://issues.apache.org/jira/browse/ARROW-15885
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Ruby
Reporter: Keisuke Okada
Assignee: Keisuke Okada






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15884) [C++][Doc] Document that the strptime kernel ignores %Z

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15884:
-

 Summary: [C++][Doc] Document that the strptime kernel ignores %Z
 Key: ARROW-15884
 URL: https://issues.apache.org/jira/browse/ARROW-15884
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Joris Van den Bossche


After ARROW-12820, the {{strptime}} kernel still ignores the {{%Z}} specifier 
(for timezone names), and when using it, it basically ignores any string.

For example:

{code:python}
# the %z specifier now works (after ARROW-12820)
>>> pc.strptime(["2022-03-05 09:00:00+01"], format="%Y-%m-%d %H:%M:%S%z", 
>>> unit="us")

[
  2022-03-05 08:00:00.00
]

# in theory this should give the same result, but %Z is still ignore
>>> pc.strptime(["2022-03-05 09:00:00 CET"], format="%Y-%m-%d %H:%M:%S %Z", 
>>> unit="us")

[
  2022-03-05 09:00:00.00
]

# as a result any garbage in the string is also ignored
>>> pc.strptime(["2022-03-05 09:00:00 blabla"], format="%Y-%m-%d %H:%M:%S %Z", 
>>> unit="us")

[
  2022-03-05 09:00:00.00
]
{code}

I don't think it is easy to actually fix this (at least as long as we use the 
system strptime, see also 
https://github.com/apache/arrow/pull/11358#issue-1020404727). But at least we 
should document this limitation / gotcha.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15883) [C++] Support for fractional seconds in strptime() for ISO format?

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15883:
-

 Summary: [C++] Support for fractional seconds in strptime() for 
ISO format?
 Key: ARROW-15883
 URL: https://issues.apache.org/jira/browse/ARROW-15883
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently, we can't parse "our own" string representation of a timestamp array 
with the timestamp parser {{strptime}}:

{code:python}
import datetime
import pyarrow as pa
import pyarrow.compute as pc

>>> pa.array([datetime.datetime(2022, 3, 5, 9)])

[
  2022-03-05 09:00:00.00
]

# trying to parse the above representation as string
>>> pc.strptime(["2022-03-05 09:00:00.00"], format="%Y-%m-%d %H:%M:%S", 
>>> unit="us")
...
ArrowInvalid: Failed to parse string: '2022-03-05 09:00:00.00' as a scalar 
of type timestamp[us]
{code}

The reason for this is the fractional second part, so the following works:

{code:python}
>>> pc.strptime(["2022-03-05 09:00:00"], format="%Y-%m-%d %H:%M:%S", unit="us")

[
  2022-03-05 09:00:00.00
]
{code}

Now, I think the reason that this fails is because {{strptime}} only supports 
parsing seconds as an integer 
(https://man7.org/linux/man-pages/man3/strptime.3.html). 

But, it creates a strange situation where the timestamp parser cannot parse the 
representation we use for timestamps.

In addition, for CSV we have a custom ISO parser (used by default), so when 
parsing the strings while reading a CSV file, the same string with fractional 
seconds does work:

{code:python}
s = b"""a
2022-03-05 09:00:00.00"""

from pyarrow import csv

>>> csv.read_csv(io.BytesIO(s))
pyarrow.Table
a: timestamp[ns]

a: [[2022-03-05 09:00:00.0]]
{code}

cc [~apitrou] [~rokm]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15882) [Ci][Python] Nightly hypothesis build is not actually running the hypothesis tests

2022-03-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-15882:
-

 Summary: [Ci][Python] Nightly hypothesis build is not actually 
running the hypothesis tests
 Key: ARROW-15882
 URL: https://issues.apache.org/jira/browse/ARROW-15882
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Python
Reporter: Joris Van den Bossche
 Fix For: 8.0.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15881) [c++] call parquet::arrow::WriteTable in a child thread get segmentation fault

2022-03-09 Thread zzh (Jira)
zzh created ARROW-15881:
---

 Summary: [c++] call parquet::arrow::WriteTable in a child thread 
get segmentation fault
 Key: ARROW-15881
 URL: https://issues.apache.org/jira/browse/ARROW-15881
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
 Environment: CentOS7,gcc7.0+
Reporter: zzh
 Attachments: message.png

When I try to use Arrow to write parquet files, I encounter a error. The 
parquet::arrow::WriteTable out child thread can call successful, but  
parquet::arrow::WriteTable in child thread while cause Segmentation fault.

The code like this:
{code:cpp}
arrow::Int64Builder test_a;
for (int i = 0; i < 1e7; ++i) {
  PARQUET_THROW_NOT_OK(test_a.Append(i));
}
auto sc = arrow::schema({arrow::field("A", arrow::int64())});
auto table = arrow::Table::Make(sc,{test_a.Finish().ValueOrDie()});
const string  = sole::uuid4().str();
string filename = "test.parq";
try {
  std::shared_ptr outfile;
  PARQUET_ASSIGN_OR_THROW(
  outfile,arrow::io::FileOutputStream::Open(filename)
  );
  PARQUET_THROW_NOT_OK(
  parquet::arrow::WriteTable(*table, arrow::default_memory_pool(), 
outfile, table->num_rows())
  );
} catch (exception ) {
  cout << ex.what() << endl;
}
shared_ptr thread = make_shared([=]() {
arrow::Int64Builder test_a;
  for (int i = 0; i < 1e7; ++i) {
PARQUET_THROW_NOT_OK(test_a.Append(i));
}
auto sc = arrow::schema({arrow::field("A", arrow::int64())});
auto table = arrow::Table::Make(sc,{test_a.Finish().ValueOrDie()});
const string  = sole::uuid4().str();
string filename = "test.parq";
try {
std::shared_ptr outfile;
PARQUET_ASSIGN_OR_THROW(
outfile,arrow::io::FileOutputStream::Open(filename)
);
PARQUET_THROW_NOT_OK(
parquet::arrow::WriteTable(*table, 
arrow::default_memory_pool(), outfile, table->num_rows())
);
} catch (exception ) {
cout << ex.what() << endl;
}
};{code}
 

The stack message is in the picture in attachment.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15880) [C++] Can't open partitioned dataset if the root directory has "=" in its name

2022-03-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15880:


 Summary: [C++] Can't open partitioned dataset if the root 
directory has "=" in its name
 Key: ARROW-15880
 URL: https://issues.apache.org/jira/browse/ARROW-15880
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Nicola Crane


Not sure if this is a bug or "just how Hive style partitioning works" but if I 
try to open a dataset where the root directory has an "=" in it, I have to 
specify that directory in my partitioning to be able to successfully open it.

This has caused users to trip up when they've saved one directory from a 
partitioned dataset somewhere and tried to then open this directory as a 
dataset.

{code:r}
library(arrow)
td <- tempfile()
dir.create(td)
# directory with equals sign in name
subdir <- file.path(td, "foo=bar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foo=bar/am=0/part-0.parquet" "foo=bar/am=1/part-0.parquet"
# doesn't work
open_dataset(subdir, partitioning = "am")
#> Error:
#> ! "partitioning" does not match the detected Hive-style partitions: c("foo", 
"am")
#> ℹ Omit "partitioning" to use the Hive partitions
#> ℹ Set `hive_style = FALSE` to override what was detected
#> ℹ Or, to rename partition columns, call `select()` or `rename()` after 
opening the dataset
# works
open_dataset(subdir, partitioning = c("foo", "am"))
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> foo: string
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}

Compare this with the same example but the folder is just called "foobar" 
instead of "foo=bar".

{code:r}
td <- tempfile()
dir.create(td)
subdir <- file.path(td, "foobar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foobar/am=0/part-0.parquet" "foobar/am=1/part-0.parquet"
# works
open_dataset(subdir, partitioning = "am")
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)