[jira] [Created] (ARROW-16540) Support storing different timezone in an array
Gaurav Sheni created ARROW-16540: Summary: Support storing different timezone in an array Key: ARROW-16540 URL: https://issues.apache.org/jira/browse/ARROW-16540 Project: Apache Arrow Issue Type: New Feature Components: Format, Python Reporter: Gaurav Sheni As a user, I wish I could use pyarrow to store a column of datetimes with different timezones. In certain datasets, it is ideal to a column with mixed timezones (ex - taxi pickups). Even if the data is limited to a single location (let's say a business in NYC for example) over the time span of a single year... then your timezones will be EDT/EST with offsets of -4:00 and -5:00. Currently, it is not possible to keep a column with different timezones. {code:java} import pytz import pyarrow as pa import pytz arr = pa.array([datetime(2010, 1, 1, tzinfo=pytz.timezone('US/Central')), datetime(2015, 1, 1, tzinfo=pytz.timezone('US/Eastern'))]) arr.type arr[0] arr[1] {code} {code:java} TimestampType(timestamp[us, tz=US/Central]) )> Out[25]: )>{code} > Notice how both rows have Central timezone now -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16539) [C++] Bump thrift to 0.16.0
Neal Richardson created ARROW-16539: --- Summary: [C++] Bump thrift to 0.16.0 Key: ARROW-16539 URL: https://issues.apache.org/jira/browse/ARROW-16539 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Neal Richardson Assignee: Neal Richardson Looking at an unrelated issue, I noticed we're on 0.13, which is 2 years old. Figured it wouldn't hurt to try updating to the latest. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16538) [Java] Refactor FakeResultSet to support arbitrary tests
Todd Farmer created ARROW-16538: --- Summary: [Java] Refactor FakeResultSet to support arbitrary tests Key: ARROW-16538 URL: https://issues.apache.org/jira/browse/ARROW-16538 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 8.0.0 Reporter: Todd Farmer Assignee: Todd Farmer The existing FakeResultSet used in tests of the JDBC adapter is challenging to use to build arbitrary ResultSets - such as would be useful in dealing with issues like ARROW-16427. Converting this to a more generic utility to build mock ResultSets would enable testing of JDBC vendor-specific behavior that is discovered, without actually referencing those drivers within test code. Finally, it would be useful to more such a utility to a general class, leaving just the test code in the existing test class. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16537) [Java] Patch dataset module testing failure with JSE11+
David Dali Susanibar Arce created ARROW-16537: - Summary: [Java] Patch dataset module testing failure with JSE11+ Key: ARROW-16537 URL: https://issues.apache.org/jira/browse/ARROW-16537 Project: Apache Arrow Issue Type: Sub-task Components: Java Affects Versions: 9.0.0 Reporter: David Dali Susanibar Arce Assignee: David Dali Susanibar Arce Current Dataset module is failing test locally distinct from CI process. Implement current dataset module be able to test classes without this current problem: TestReservationListener.testDirectReservationListener:50 » Runtime error {code:java} $ cd arrow/java/dataset $ mvn -Drat.skip=true -Darrow.cpp.build.dir=/Users/arrow/java-dist/lib/ -Parrow-jni clean install [INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.758 s - in org.apache.arrow.dataset.file.TestFileSystemDataset [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] TestReservationListener.testDirectReservationListener:50 » Runtime java.lang.N... [INFO] [ERROR] Tests run: 16, Failures: 0, Errors: 1, Skipped: 0 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16536) [Doc][Cookbook][Flight] Find client address from ArrowFlightServer
Rok Mihevc created ARROW-16536: -- Summary: [Doc][Cookbook][Flight] Find client address from ArrowFlightServer Key: ARROW-16536 URL: https://issues.apache.org/jira/browse/ARROW-16536 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Rok Mihevc We want a cookbook entry for Python/C++/Java describing how to get Arrow Flight Server client's address. See: [Java|https://stackoverflow.com/a/36140002/262727] [Python |https://arrow.apache.org/docs/python/generated/pyarrow.flight.ServerCallContext.html#pyarrow.flight.ServerCallContext.peer] [C++|https://arrow.apache.org/docs/cpp/api/flight.html#_CPPv4NK5arrow6flight17ServerCallContext4peerEv] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16535) [C++] Temporal floor/ceil/round should have settable origin unit
Rok Mihevc created ARROW-16535: -- Summary: [C++] Temporal floor/ceil/round should have settable origin unit Key: ARROW-16535 URL: https://issues.apache.org/jira/browse/ARROW-16535 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Rok Mihevc Temporal rounding kernels (will) allow setting of rounding origin to a greater unit. This could be made more flexible by introducing a `greater_unit` parameter which would let user select the unit serving as origin. See [this discussion|https://github.com/apache/arrow/pull/12657#issuecomment-1119580484] for more context. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16534) Update gandiva protobuf library version to support M1
Larry White created ARROW-16534: --- Summary: Update gandiva protobuf library version to support M1 Key: ARROW-16534 URL: https://issues.apache.org/jira/browse/ARROW-16534 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 8.0.0, 9.0.0 Environment: macOS, M1 Reporter: Larry White Gandiva needs to generate Protobuf Java sources from the definitions, and this relies on a JAR that has the native Protobuf compiler embedded in it - but the current package doesn't have an ARMv8 build available. protobuf-java version 3.20.1 does have M1 support. This means that building from source as documented (https://arrow.apache.org/docs/developers/java/building.html) cannot be done on M1 as the following exception occurs: [INFO] BUILD FAILURE [INFO] [INFO] Total time: 03:38 min [INFO] Finished at: 2022-05-10T16:19:24-04:00 [INFO] [ERROR] Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on project arrow-gandiva: Unable to resolve artifact: Missing: [ERROR] -- [ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] Try downloading the file manually from the project website. [ERROR] [ERROR] Then, install it using the command: [ERROR] mvn install:install-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file [ERROR] [ERROR] Alternatively, if you host your own repository you can deploy the file there: [ERROR] mvn deploy:deploy-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] [ERROR] [ERROR] Path to dependency: [ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] -- [ERROR] 1 required artifact is missing. [ERROR] [ERROR] for artifact: [ERROR] org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] [ERROR] from the specified remote repositories: [ERROR] apache.snapshots (https://repository.apache.org/snapshots, releases=false, snapshots=true), [ERROR] central (https://repo.maven.apache.org/maven2, releases=true, snapshots=false) [ERROR] [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :arrow-gandiva -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16533) Update gandiva protobuf compilation support to include M1
Larry White created ARROW-16533: --- Summary: Update gandiva protobuf compilation support to include M1 Key: ARROW-16533 URL: https://issues.apache.org/jira/browse/ARROW-16533 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 8.0.0, 9.0.0 Environment: macOS, M1 Reporter: Larry White Gandiva needs to generate Protobuf Java sources from the definitions, and this relies on a JAR that has the native Protobuf compiler embedded in it - but the current package doesn't have an ARMv8 build available. protobuf-java version 3.20.1 does have M1 support. This means that building from source as documented (https://arrow.apache.org/docs/developers/java/building.html) cannot be done on M1 as the following exception occurs: [INFO] BUILD FAILURE [INFO] [INFO] Total time: 03:38 min [INFO] Finished at: 2022-05-10T16:19:24-04:00 [INFO] [ERROR] Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on project arrow-gandiva: Unable to resolve artifact: Missing: [ERROR] -- [ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] Try downloading the file manually from the project website. [ERROR] [ERROR] Then, install it using the command: [ERROR] mvn install:install-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file [ERROR] [ERROR] Alternatively, if you host your own repository you can deploy the file there: [ERROR] mvn deploy:deploy-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] [ERROR] [ERROR] Path to dependency: [ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] -- [ERROR] 1 required artifact is missing. [ERROR] [ERROR] for artifact: [ERROR] org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] [ERROR] from the specified remote repositories: [ERROR] apache.snapshots (https://repository.apache.org/snapshots, releases=false, snapshots=true), [ERROR] central (https://repo.maven.apache.org/maven2, releases=true, snapshots=false) [ERROR] [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :arrow-gandiva -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16532) Update gandiva protobuf compilation support to include M1
Larry White created ARROW-16532: --- Summary: Update gandiva protobuf compilation support to include M1 Key: ARROW-16532 URL: https://issues.apache.org/jira/browse/ARROW-16532 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 8.0.0, 9.0.0 Environment: macOS, M1 Reporter: Larry White Gandiva needs to generate Protobuf Java sources from the definitions, and this relies on a JAR that has the native Protobuf compiler embedded in it - but the current package doesn't have an ARMv8 build available. protobuf-java version 3.20.1 does have M1 support. This means that building from source as documented (https://arrow.apache.org/docs/developers/java/building.html) cannot be done on M1 as the following exception occurs: [INFO] BUILD FAILURE [INFO] [INFO] Total time: 03:38 min [INFO] Finished at: 2022-05-10T16:19:24-04:00 [INFO] [ERROR] Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on project arrow-gandiva: Unable to resolve artifact: Missing: [ERROR] -- [ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] Try downloading the file manually from the project website. [ERROR] [ERROR] Then, install it using the command: [ERROR] mvn install:install-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file [ERROR] [ERROR] Alternatively, if you host your own repository you can deploy the file there: [ERROR] mvn deploy:deploy-file -DgroupId=com.google.protobuf -DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id] [ERROR] [ERROR] Path to dependency: [ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0 [ERROR] [ERROR] -- [ERROR] 1 required artifact is missing. [ERROR] [ERROR] for artifact: [ERROR] org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT [ERROR] [ERROR] from the specified remote repositories: [ERROR] apache.snapshots (https://repository.apache.org/snapshots, releases=false, snapshots=true), [ERROR] central (https://repo.maven.apache.org/maven2, releases=true, snapshots=false) [ERROR] [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn -rf :arrow-gandiva -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16531) [Python] Lint rules do not seem to be getting enforced
Weston Pace created ARROW-16531: --- Summary: [Python] Lint rules do not seem to be getting enforced Key: ARROW-16531 URL: https://issues.apache.org/jira/browse/ARROW-16531 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Weston Pace It seems there are some legitimate linting errors in master. {noformat} (arrow-dev) pace@pace-desktop:~/dev/arrow$ archery lint --python INFO:archery:Running Python formatter (autopep8) INFO:archery:Running Python linter (flake8) /home/pace/dev/arrow/python/pyarrow/_parquet.pyx:156:80: E501 line too long (80 > 79 characters) /home/pace/dev/arrow/python/pyarrow/_parquet.pyx:170:80: E501 line too long (80 > 79 characters) /home/pace/dev/arrow/python/pyarrow/_parquet.pyx:242:77: W291 trailing whitespace /home/pace/dev/arrow/python/pyarrow/_parquet.pyx:447:80: E501 line too long (80 > 79 characters) /home/pace/dev/arrow/python/pyarrow/_parquet.pyx:447:81: W291 trailing whitespace ... {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16530) Serial read operations on columns, even when parallel = true
Robert created ARROW-16530: -- Summary: Serial read operations on columns, even when parallel = true Key: ARROW-16530 URL: https://issues.apache.org/jira/browse/ARROW-16530 Project: Apache Arrow Issue Type: Improvement Components: Go Affects Versions: 8.0.0 Environment: Linux, golang 1.18, AMD64 Reporter: Robert Fix For: 9.0.0 I have submitted a pull request with the changes. In pqarrow, when getting column readers for columns and struct members, the default behavior is a for loop that serially processes each column. The process of "getting" readers causes a read request, therefore causing these reads always to be issued serially. Additionally, the logic for getting next batch of records is executed in the same way, a for loop iterating through the columns. The performance impact is especially large on high-latency files such as cloud storage. Additionally, the code to retrieve the next batch of records also issues reads serially. I'm working with complex parquet files with 500+ "root" columns where some fields are lists of structs. Some of these structs have 100's of columns. In my tests, 800+ read operations are being issued to GCS serially which makes the current state of pqarrow too slow to be usable. The revision is to concurrently process the columns when retrieving child readers and column readers and to concurrently issue batch requests. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16529) [Java] Remove dependency on optional JDBC ResultSet method
Todd Farmer created ARROW-16529: --- Summary: [Java] Remove dependency on optional JDBC ResultSet method Key: ARROW-16529 URL: https://issues.apache.org/jira/browse/ARROW-16529 Project: Apache Arrow Issue Type: Improvement Components: Java Affects Versions: 8.0.0 Reporter: Todd Farmer Assignee: Todd Farmer [~jswenson] points out that the fix for ARROW-16035 uses the ResultSet.isLast() method, which is listed as optional for vendor support in the (likely common) condition that the result set is forward-scrollable only. This new code replaced dependency on ResultSet.isAfterLast(), which is similarly annotated as optional in the same context (and has the additional challenge of being non-deterministic in the case of empty result sets). To eliminate these dependencies, we propose the following: # The ArrowVectorIterator returned from processing ResultSets will _always_ have at least one element, meaning hasNext() will return true initially, even in the case of empty ResultSets. # Calling ArrowVectorIterator.next() will establish whether there is actual data to be supplied, and will return an "empty" VectorSchemaRoot when an empty ResultSet was supplied originally. # Subsequent calls to ArrowVectorIterator.hasNext() will return false in the case when an empty ResultSet was supplied. This is a behavior change, in that the current ARROW-16035-patched code returns false today when an empty ResultSet was supplied, _and_ the JDBC driver optionally implements ResultSet.isLast(). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16528) [JS] support for LargeUtf8 type
Dmytro Sambor created ARROW-16528: - Summary: [JS] support for LargeUtf8 type Key: ARROW-16528 URL: https://issues.apache.org/jira/browse/ARROW-16528 Project: Apache Arrow Issue Type: Improvement Reporter: Dmytro Sambor Does JS library support [LargeUtf8|https://github.com/apache/arrow/blob/a8479e9c252482438b6fc2bc0383ac5cf6a09d59/format/Schema.fbs#L165]? I'm getting error trying to read arrow produced by rust lib: {{Unrecognized type: "LargeUtf8"}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16527) [Gandiva][C++] Add binary functions
Johnnathan Rodrigo Pego de Almeida created ARROW-16527: -- Summary: [Gandiva][C++] Add binary functions Key: ARROW-16527 URL: https://issues.apache.org/jira/browse/ARROW-16527 Project: Apache Arrow Issue Type: New Feature Components: C++ - Gandiva Reporter: Johnnathan Rodrigo Pego de Almeida Implement binary functions in Gandiva side based on [Hive implementation|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFToBinary.java]. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET
Raúl Cumplido created ARROW-16526: - Summary: [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET Key: ARROW-16526 URL: https://issues.apache.org/jira/browse/ARROW-16526 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Reporter: Raúl Cumplido Fix For: 9.0.0 Our current [minimal_build examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build] for python build with: {code:java} -DARROW_PARQUET=ON \{code} but without DATASET. These produces the following failure: {code:java} _ test_partitioned_dataset[True] _tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), use_legacy_dataset = True @pytest.mark.pandas @parametrize_legacy_dataset def test_partitioned_dataset(tempdir, use_legacy_dataset): # ARROW-3208: Segmentation fault when reading a Parquet partitioned dataset # to a Parquet file path = tempdir / "ARROW-3208" df = pd.DataFrame({ 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2], 'two': [-1, 10, 2, 100, 1000, 1, 11], 'three': [0, 0, 0, 0, 0, 0, 0] }) table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path=str(path), partition_cols=['one', 'two'])pyarrow/tests/parquet/test_dataset.py:1544: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/parquet/__init__.py:3110: in write_to_dataset import pyarrow.dataset as ds _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ """Dataset is currently unstable. APIs subject to change without notice.""" import pyarrow as pa from pyarrow.util import _is_iterable, _stringify_path, _is_path_like > from pyarrow._dataset import ( # noqa CsvFileFormat, CsvFragmentScanOptions, Dataset, DatasetFactory, DirectoryPartitioning, FilenamePartitioning, FileFormat, FileFragment, FileSystemDataset, FileSystemDatasetFactory, FileSystemFactoryOptions, FileWriteOptions, Fragment, FragmentScanOptions, HivePartitioning, IpcFileFormat, IpcFileWriteOptions, InMemoryDataset, Partitioning, PartitioningFactory, Scanner, TaggedRecordBatch, UnionDataset, UnionDatasetFactory, _get_partition_keys, _filesystemdataset_write, ) E ModuleNotFoundError: No module named 'pyarrow._dataset' {code} This can be reproduced via running the minimal_build examples: {code:java} $ cd arrow/python/examples/minimal_build $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code} or via building arrow and pyarrow with PARQUET but without DATASET. -- This message was sent by Atlassian Jira (v8.20.7#820007)