[jira] [Created] (ARROW-16540) Support storing different timezone in an array

2022-05-11 Thread Gaurav Sheni (Jira)
Gaurav Sheni created ARROW-16540:


 Summary: Support storing different timezone in an array 
 Key: ARROW-16540
 URL: https://issues.apache.org/jira/browse/ARROW-16540
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format, Python
Reporter: Gaurav Sheni


As a user, I wish I could use pyarrow to store a column of datetimes with 
different timezones. In certain datasets, it is ideal to a column with mixed 
timezones (ex - taxi pickups). Even if the data is limited to a single location 
(let's say a business in NYC for example) over the time span of a single 
year... then your timezones will be EDT/EST with offsets of -4:00 and -5:00.

 

Currently, it is not possible to keep a column with different timezones.

 
{code:java}
import pytz
import pyarrow as pa
import pytz
 
arr = pa.array([datetime(2010, 1, 1,  tzinfo=pytz.timezone('US/Central')), 
datetime(2015, 1, 1, tzinfo=pytz.timezone('US/Eastern'))])
arr.type
arr[0]
arr[1]
{code}
 

 
{code:java}
TimestampType(timestamp[us, tz=US/Central])

)>

Out[25]: )>{code}
 

> Notice how both rows have Central timezone now

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16539) [C++] Bump thrift to 0.16.0

2022-05-11 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16539:
---

 Summary: [C++] Bump thrift to 0.16.0
 Key: ARROW-16539
 URL: https://issues.apache.org/jira/browse/ARROW-16539
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson
Assignee: Neal Richardson


Looking at an unrelated issue, I noticed we're on 0.13, which is 2 years old. 
Figured it wouldn't hurt to try updating to the latest.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16538) [Java] Refactor FakeResultSet to support arbitrary tests

2022-05-11 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-16538:
---

 Summary: [Java] Refactor FakeResultSet to support arbitrary tests
 Key: ARROW-16538
 URL: https://issues.apache.org/jira/browse/ARROW-16538
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 8.0.0
Reporter: Todd Farmer
Assignee: Todd Farmer


The existing FakeResultSet used in tests of the JDBC adapter is challenging to 
use to build arbitrary ResultSets - such as would be useful in dealing with 
issues like ARROW-16427.  Converting this to a more generic utility to build 
mock ResultSets would enable testing of JDBC vendor-specific behavior that is 
discovered, without actually referencing those drivers within test code.  
Finally, it would be useful to more such a utility to a general class, leaving 
just the test code in the existing test class.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16537) [Java] Patch dataset module testing failure with JSE11+

2022-05-11 Thread David Dali Susanibar Arce (Jira)
David Dali Susanibar Arce created ARROW-16537:
-

 Summary: [Java] Patch dataset module testing failure with JSE11+
 Key: ARROW-16537
 URL: https://issues.apache.org/jira/browse/ARROW-16537
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Affects Versions: 9.0.0
Reporter: David Dali Susanibar Arce
Assignee: David Dali Susanibar Arce


Current Dataset module is failing test locally distinct from CI process.

Implement current dataset module be able to test classes without this current 
problem: TestReservationListener.testDirectReservationListener:50 » Runtime 
error
{code:java}
$ cd arrow/java/dataset
$ mvn -Drat.skip=true -Darrow.cpp.build.dir=/Users/arrow/java-dist/lib/ 
-Parrow-jni clean install
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.758 s 
- in org.apache.arrow.dataset.file.TestFileSystemDataset
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   TestReservationListener.testDirectReservationListener:50 » Runtime 
java.lang.N...
[INFO] 
[ERROR] Tests run: 16, Failures: 0, Errors: 1, Skipped: 0 {code}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16536) [Doc][Cookbook][Flight] Find client address from ArrowFlightServer

2022-05-11 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16536:
--

 Summary: [Doc][Cookbook][Flight] Find client address from 
ArrowFlightServer
 Key: ARROW-16536
 URL: https://issues.apache.org/jira/browse/ARROW-16536
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


We want a cookbook entry for Python/C++/Java describing how to get Arrow Flight 
Server client's address.
See: [Java|https://stackoverflow.com/a/36140002/262727]
[Python 
|https://arrow.apache.org/docs/python/generated/pyarrow.flight.ServerCallContext.html#pyarrow.flight.ServerCallContext.peer]
[C++|https://arrow.apache.org/docs/cpp/api/flight.html#_CPPv4NK5arrow6flight17ServerCallContext4peerEv]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16535) [C++] Temporal floor/ceil/round should have settable origin unit

2022-05-11 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-16535:
--

 Summary: [C++] Temporal floor/ceil/round should have settable 
origin unit
 Key: ARROW-16535
 URL: https://issues.apache.org/jira/browse/ARROW-16535
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Rok Mihevc


Temporal rounding kernels (will) allow setting of rounding origin to a greater 
unit. This could be made more flexible by introducing a `greater_unit` 
parameter which would let user select the unit serving as origin. See [this 
discussion|https://github.com/apache/arrow/pull/12657#issuecomment-1119580484] 
for more context.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16534) Update gandiva protobuf library version to support M1

2022-05-11 Thread Larry White (Jira)
Larry White created ARROW-16534:
---

 Summary: Update gandiva protobuf library version to support M1
 Key: ARROW-16534
 URL: https://issues.apache.org/jira/browse/ARROW-16534
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 8.0.0, 9.0.0
 Environment: macOS, M1
Reporter: Larry White


Gandiva needs to generate Protobuf Java sources from the definitions, and this 
relies on a JAR that has the native Protobuf compiler embedded in it - but the 
current package doesn't have an ARMv8 build available.  protobuf-java version 
3.20.1 does have M1 support.
 
This means that building from source as documented 
(https://arrow.apache.org/docs/developers/java/building.html) cannot be done on 
M1 as the following exception occurs:
 

[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  03:38 min
[INFO] Finished at: 2022-05-10T16:19:24-04:00
[INFO] 
[ERROR] Failed to execute goal 
org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on 
project arrow-gandiva: Unable to resolve artifact: Missing:
[ERROR] --
[ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR]   Try downloading the file manually from the project website.
[ERROR]
[ERROR]   Then, install it using the command:
[ERROR]   mvn install:install-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file
[ERROR]
[ERROR]   Alternatively, if you host your own repository you can deploy the 
file there:
[ERROR]   mvn deploy:deploy-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]   Path to dependency:
[ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR] --
[ERROR] 1 required artifact is missing.
[ERROR]
[ERROR] for artifact:
[ERROR]   org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR]
[ERROR] from the specified remote repositories:
[ERROR]   apache.snapshots (https://repository.apache.org/snapshots, 
releases=false, snapshots=true),
[ERROR]   central (https://repo.maven.apache.org/maven2, releases=true, 
snapshots=false)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :arrow-gandiva
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16533) Update gandiva protobuf compilation support to include M1

2022-05-11 Thread Larry White (Jira)
Larry White created ARROW-16533:
---

 Summary: Update gandiva protobuf compilation support to include M1
 Key: ARROW-16533
 URL: https://issues.apache.org/jira/browse/ARROW-16533
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 8.0.0, 9.0.0
 Environment: macOS, M1
Reporter: Larry White


Gandiva needs to generate Protobuf Java sources from the definitions, and this 
relies on a JAR that has the native Protobuf compiler embedded in it - but the 
current package doesn't have an ARMv8 build available.  protobuf-java version 
3.20.1 does have M1 support.
 
This means that building from source as documented 
(https://arrow.apache.org/docs/developers/java/building.html) cannot be done on 
M1 as the following exception occurs:
 

[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  03:38 min
[INFO] Finished at: 2022-05-10T16:19:24-04:00
[INFO] 
[ERROR] Failed to execute goal 
org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on 
project arrow-gandiva: Unable to resolve artifact: Missing:
[ERROR] --
[ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR]   Try downloading the file manually from the project website.
[ERROR]
[ERROR]   Then, install it using the command:
[ERROR]   mvn install:install-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file
[ERROR]
[ERROR]   Alternatively, if you host your own repository you can deploy the 
file there:
[ERROR]   mvn deploy:deploy-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]   Path to dependency:
[ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR] --
[ERROR] 1 required artifact is missing.
[ERROR]
[ERROR] for artifact:
[ERROR]   org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR]
[ERROR] from the specified remote repositories:
[ERROR]   apache.snapshots (https://repository.apache.org/snapshots, 
releases=false, snapshots=true),
[ERROR]   central (https://repo.maven.apache.org/maven2, releases=true, 
snapshots=false)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :arrow-gandiva
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16532) Update gandiva protobuf compilation support to include M1

2022-05-11 Thread Larry White (Jira)
Larry White created ARROW-16532:
---

 Summary: Update gandiva protobuf compilation support to include M1
 Key: ARROW-16532
 URL: https://issues.apache.org/jira/browse/ARROW-16532
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Affects Versions: 8.0.0, 9.0.0
 Environment: macOS, M1
Reporter: Larry White


Gandiva needs to generate Protobuf Java sources from the definitions, and this 
relies on a JAR that has the native Protobuf compiler embedded in it - but the 
current package doesn't have an ARMv8 build available.  protobuf-java version 
3.20.1 does have M1 support.
 
This means that building from source as documented 
(https://arrow.apache.org/docs/developers/java/building.html) cannot be done on 
M1 as the following exception occurs:
 

[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time:  03:38 min
[INFO] Finished at: 2022-05-10T16:19:24-04:00
[INFO] 
[ERROR] Failed to execute goal 
org.xolstice.maven.plugins:protobuf-maven-plugin:0.6.1:compile (default) on 
project arrow-gandiva: Unable to resolve artifact: Missing:
[ERROR] --
[ERROR] 1) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR]   Try downloading the file manually from the project website.
[ERROR]
[ERROR]   Then, install it using the command:
[ERROR]   mvn install:install-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file
[ERROR]
[ERROR]   Alternatively, if you host your own repository you can deploy the 
file there:
[ERROR]   mvn deploy:deploy-file -DgroupId=com.google.protobuf 
-DartifactId=protoc -Dversion=2.5.0 -Dclassifier=osx-aarch_64 -Dpackaging=exe 
-Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[ERROR]
[ERROR]   Path to dependency:
[ERROR] 1) org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR] 2) com.google.protobuf:protoc:exe:osx-aarch_64:2.5.0
[ERROR]
[ERROR] --
[ERROR] 1 required artifact is missing.
[ERROR]
[ERROR] for artifact:
[ERROR]   org.apache.arrow.gandiva:arrow-gandiva:jar:9.0.0-SNAPSHOT
[ERROR]
[ERROR] from the specified remote repositories:
[ERROR]   apache.snapshots (https://repository.apache.org/snapshots, 
releases=false, snapshots=true),
[ERROR]   central (https://repo.maven.apache.org/maven2, releases=true, 
snapshots=false)
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :arrow-gandiva
 
 
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16531) [Python] Lint rules do not seem to be getting enforced

2022-05-11 Thread Weston Pace (Jira)
Weston Pace created ARROW-16531:
---

 Summary: [Python] Lint rules do not seem to be getting enforced
 Key: ARROW-16531
 URL: https://issues.apache.org/jira/browse/ARROW-16531
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Weston Pace


It seems there are some legitimate linting errors in master.

{noformat}
(arrow-dev) pace@pace-desktop:~/dev/arrow$ archery lint --python
INFO:archery:Running Python formatter (autopep8)
INFO:archery:Running Python linter (flake8)
/home/pace/dev/arrow/python/pyarrow/_parquet.pyx:156:80: E501 line too long (80 
> 79 characters)
/home/pace/dev/arrow/python/pyarrow/_parquet.pyx:170:80: E501 line too long (80 
> 79 characters)
/home/pace/dev/arrow/python/pyarrow/_parquet.pyx:242:77: W291 trailing 
whitespace
/home/pace/dev/arrow/python/pyarrow/_parquet.pyx:447:80: E501 line too long (80 
> 79 characters)
/home/pace/dev/arrow/python/pyarrow/_parquet.pyx:447:81: W291 trailing 
whitespace
...
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16530) Serial read operations on columns, even when parallel = true

2022-05-11 Thread Robert (Jira)
Robert created ARROW-16530:
--

 Summary: Serial read operations on columns, even when parallel = 
true
 Key: ARROW-16530
 URL: https://issues.apache.org/jira/browse/ARROW-16530
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Affects Versions: 8.0.0
 Environment: Linux, golang 1.18, AMD64
Reporter: Robert
 Fix For: 9.0.0


I have submitted a pull request with the changes.

In pqarrow, when getting column readers for columns and struct members, the 
default behavior is a for loop that serially processes each column.  The 
process of "getting" readers causes a read request, therefore causing these 
reads always to be issued serially.  Additionally, the logic for getting next 
batch of records is executed in the same way, a for loop iterating through the 
columns.  The performance impact is especially large on high-latency files such 
as cloud storage.

Additionally, the code to retrieve the next batch of records also issues reads 
serially.  

I'm working with complex parquet files with 500+ "root" columns where some 
fields are lists of structs.  Some of these structs have 100's of columns.  In 
my tests, 800+ read operations are being issued to GCS serially which makes the 
current state of pqarrow too slow to be usable.

The revision is to concurrently process the columns when retrieving child 
readers and column readers and to concurrently issue batch requests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16529) [Java] Remove dependency on optional JDBC ResultSet method

2022-05-11 Thread Todd Farmer (Jira)
Todd Farmer created ARROW-16529:
---

 Summary: [Java] Remove dependency on optional JDBC ResultSet method
 Key: ARROW-16529
 URL: https://issues.apache.org/jira/browse/ARROW-16529
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Affects Versions: 8.0.0
Reporter: Todd Farmer
Assignee: Todd Farmer


[~jswenson] points out that the fix for ARROW-16035 uses the ResultSet.isLast() 
method, which is listed as optional for vendor support in the (likely common) 
condition that the result set is forward-scrollable only.  This new code 
replaced dependency on ResultSet.isAfterLast(), which is similarly annotated as 
optional in the same context (and has the additional challenge of being 
non-deterministic in the case of empty result sets).  To eliminate these 
dependencies, we propose the following:
 # The ArrowVectorIterator returned from processing ResultSets will _always_ 
have at least one element, meaning hasNext() will return true initially, even 
in the case of empty ResultSets.
 # Calling ArrowVectorIterator.next() will establish whether there is actual 
data to be supplied, and will return an "empty" VectorSchemaRoot when an empty 
ResultSet was supplied originally.
 # Subsequent calls to ArrowVectorIterator.hasNext() will return false in the 
case when an empty ResultSet was supplied.

This is a behavior change, in that the current ARROW-16035-patched code returns 
false today when an empty ResultSet was supplied, _and_ the JDBC driver 
optionally implements ResultSet.isLast().



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16528) [JS] support for LargeUtf8 type

2022-05-11 Thread Dmytro Sambor (Jira)
Dmytro Sambor created ARROW-16528:
-

 Summary: [JS] support for LargeUtf8 type
 Key: ARROW-16528
 URL: https://issues.apache.org/jira/browse/ARROW-16528
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Dmytro Sambor


Does JS library support 
[LargeUtf8|https://github.com/apache/arrow/blob/a8479e9c252482438b6fc2bc0383ac5cf6a09d59/format/Schema.fbs#L165]?

I'm getting error trying to read arrow produced by rust lib:
{{Unrecognized type: "LargeUtf8"}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16527) [Gandiva][C++] Add binary functions

2022-05-11 Thread Johnnathan Rodrigo Pego de Almeida (Jira)
Johnnathan Rodrigo Pego de Almeida created ARROW-16527:
--

 Summary: [Gandiva][C++] Add binary functions
 Key: ARROW-16527
 URL: https://issues.apache.org/jira/browse/ARROW-16527
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Gandiva
Reporter: Johnnathan Rodrigo Pego de Almeida


Implement binary functions in Gandiva side based on [Hive 
implementation|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFToBinary.java].



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

2022-05-11 Thread Jira
Raúl Cumplido created ARROW-16526:
-

 Summary: [Python] test_partitioned_dataset fails when building 
with PARQUET but without DATASET
 Key: ARROW-16526
 URL: https://issues.apache.org/jira/browse/ARROW-16526
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 8.0.0
Reporter: Raúl Cumplido
 Fix For: 9.0.0


Our current [minimal_build 
examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
 for python build with:
{code:java}
 -DARROW_PARQUET=ON \{code}
but without DATASET.

These produces the following failure:
{code:java}
 _ 
test_partitioned_dataset[True] 
_tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
use_legacy_dataset = True    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_partitioned_dataset(tempdir, use_legacy_dataset):
        # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
dataset
        # to a Parquet file
        path = tempdir / "ARROW-3208"
        df = pd.DataFrame({
            'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
            'two': [-1, 10, 2, 100, 1000, 1, 11],
            'three': [0, 0, 0, 0, 0, 0, 0]
        })
        table = pa.Table.from_pandas(df)
>       pq.write_to_dataset(table, root_path=str(path),
                            partition_cols=['one', 
'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/parquet/__init__.py:3110: in write_to_dataset
    import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
"""Dataset is currently unstable. APIs subject to change without notice."""
    
    import pyarrow as pa
    from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
    
>   from pyarrow._dataset import (  # noqa
        CsvFileFormat,
        CsvFragmentScanOptions,
        Dataset,
        DatasetFactory,
        DirectoryPartitioning,
        FilenamePartitioning,
        FileFormat,
        FileFragment,
        FileSystemDataset,
        FileSystemDatasetFactory,
        FileSystemFactoryOptions,
        FileWriteOptions,
        Fragment,
        FragmentScanOptions,
        HivePartitioning,
        IpcFileFormat,
        IpcFileWriteOptions,
        InMemoryDataset,
        Partitioning,
        PartitioningFactory,
        Scanner,
        TaggedRecordBatch,
        UnionDataset,
        UnionDatasetFactory,
        _get_partition_keys,
        _filesystemdataset_write,
    )
E   ModuleNotFoundError: No module named 'pyarrow._dataset'
{code}
This can be reproduced via running the minimal_build examples:
{code:java}
$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
or via building arrow and pyarrow with PARQUET but without DATASET.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)