Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
AlenkaF commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227607162 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: Agree!

Re: [I] [Python] Unable to import arrow table to pandas if it has categorical columns with index types of unsigned ints [arrow]

2025-07-23 Thread via GitHub
raulcd commented on issue #47022: URL: https://github.com/apache/arrow/issues/47022#issuecomment-3112241428 @AlenkaF @WillAyd thanks for the investigation and context! I suppose it requires someone with time to work on this :) -- This is an automated message from the Apache Git Service.

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
lidavidm commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227558375 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: The o

Re: [I] `ArrowTypeError` when reading a parquet dataset that's partitioned on a `large_string` column [arrow]

2025-07-23 Thread via GitHub
raulcd commented on issue #47177: URL: https://github.com/apache/arrow/issues/47177#issuecomment-3112203729 Hi @TomAugspurger thanks for opening the issue, is this something new for Arrow / PyArrow v 21.0.0 or is this something that was already happening on Arrow v20.0.0? -- This is an a

Re: [I] [DISCUSS] Decouple IO and CPU operations in the Parquet Reader (push decoder?) [arrow-rs]

2025-07-23 Thread via GitHub
tustvold commented on issue #7983: URL: https://github.com/apache/arrow-rs/issues/7983#issuecomment-3112166311 Broadly speaking I agree with this, in fact my original proposal was for such a reader https://github.com/apache/arrow-rs/issues/1605 however the realities of the current code and

Re: [I] [Release] Unify GitHub token related environment variables [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47075: URL: https://github.com/apache/arrow/issues/47075#issuecomment-3112149316 OK. Let's use `GH_TOKEN`: #47181 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] GH-47075: [Release][Dev] Use GH_TOKEN as GitHub token environment variable [arrow]

2025-07-23 Thread via GitHub
github-actions[bot] commented on PR #47181: URL: https://github.com/apache/arrow/pull/47181#issuecomment-3112149866 :warning: GitHub issue #47075 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the mes

[PR] GH-47075: [Release][Dev] Use GH_TOKEN as GitHub token environment variable [arrow]

2025-07-23 Thread via GitHub
kou opened a new pull request, #47181: URL: https://github.com/apache/arrow/pull/47181 ### Rationale for this change We have many environment variables for GitHub token: `GH_TOKEN`, `ARROW_GITHUB_API_TOKEN` and `CROSSBOW_GITHUB_TOKEN` It's difficult to maintain. For example, we

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
AlenkaF commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227497353 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: Ah, go

Re: [I] [Python] Unable to import arrow table to pandas if it has categorical columns with index types of unsigned ints [arrow]

2025-07-23 Thread via GitHub
AlenkaF commented on issue #47022: URL: https://github.com/apache/arrow/issues/47022#issuecomment-3112052561 This is the PR that implemented this logic and it has a note about the unsigned indices: https://github.com/apache/arrow/pull/7659. I wasn't able to find a follow-up issue though.

Re: [I] [R][Release] CRAN packaging checklist for version 21.0.0 [arrow]

2025-07-23 Thread via GitHub
thisisnic commented on issue #46950: URL: https://github.com/apache/arrow/issues/46950#issuecomment-3112031058 Ah yes, sorry, meant to ping you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

2025-07-23 Thread via GitHub
toddmeng-db commented on PR #3140: URL: https://github.com/apache/arrow-adbc/pull/3140#issuecomment-3112026759 @CurtHagenlocher I think we're considering going ahead and merging this, since we expect that this is still a correctness improvement and ArrowSchema should be the same for other

Re: [PR] GH-45382: [Python] Add support for pandas DataFrame.attrs [arrow]

2025-07-23 Thread via GitHub
AlenkaF commented on PR #47147: URL: https://github.com/apache/arrow/pull/47147#issuecomment-3112020390 The linter error is related, but before you make any changes, I’d like to revise my earlier suggestion regarding the test location — apologies for pointing you in the wrong direction init

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
lidavidm commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227362031 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: No, w

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
AlenkaF commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227357938 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: You me

Re: [PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

2025-07-23 Thread via GitHub
klion26 commented on code in PR #7987: URL: https://github.com/apache/arrow-rs/pull/7987#discussion_r2227352892 ## parquet-variant/src/builder.rs: ## @@ -1216,24 +1245,46 @@ impl<'a> ListBuilder<'a> { /// Finalizes this list and appends it to its parent, which otherwise

Re: [PR] GH-46937 : [C++] Enable arrow::EqualOptions for arrow::Table [arrow]

2025-07-23 Thread via GitHub
kou commented on code in PR #47164: URL: https://github.com/apache/arrow/pull/47164#discussion_r2227324901 ## cpp/src/arrow/table.h: ## @@ -207,7 +207,8 @@ class ARROW_EXPORT Table { /// /// Two tables can be equal only if they have equal schemas. /// However, they may

Re: [I] [C++] Compilation Failure - SortAndMarkDuplicate lacks expected .Run() method [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47180: URL: https://github.com/apache/arrow/issues/47180#issuecomment-3111892979 Could you try 21.0.0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

2025-07-23 Thread via GitHub
klion26 commented on code in PR #7987: URL: https://github.com/apache/arrow-rs/pull/7987#discussion_r2227278027 ## parquet-variant/src/builder.rs: ## @@ -1242,7 +1294,18 @@ impl<'a> ListBuilder<'a> { /// This is to ensure that the list is always finalized before its parent bui

Re: [PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

2025-07-23 Thread via GitHub
klion26 commented on PR #7987: URL: https://github.com/apache/arrow-rs/pull/7987#issuecomment-3111877503 @alamb @scovich @viirya, please help review this when you're free, thanks. I've created benchmarks for various implementations. The current implementation is the winner, and the al

[PR] [Variant] Avoid extra buffer allocation in ListBuilder [arrow-rs]

2025-07-23 Thread via GitHub
klion26 opened a new pull request, #7987: URL: https://github.com/apache/arrow-rs/pull/7987 This commit will reuse parent buffer for ListBuilder, so that it doesn't need to copy the buffer when finishing the builder. # Which issue does this PR close? We generally require a GitH

Re: [PR] Perf: Support partition_validity to use fast path for bit map scan [arrow-rs]

2025-07-23 Thread via GitHub
zhuqi-lucas commented on PR #7962: URL: https://github.com/apache/arrow-rs/pull/7962#issuecomment-3111853233 > sort i32 to indices 2^10 Good point @alamb , i try to increase the length of i32, it still no regression for this PR: ```rust sort i32 to indices 2^16

Re: [PR] feat(c/driver/postgresql): add test for composite type behavior [arrow-adbc]

2025-07-23 Thread via GitHub
lidavidm commented on PR #3196: URL: https://github.com/apache/arrow-adbc/pull/3196#issuecomment-3111847567 Hmm, maybe it isn't actually supported then...Or maybe you have to reconnect after creating the type to reload the type mapping -- This is an automated message from the Apache Git S

Re: [PR] Use `Vec` directly in builders [arrow-rs]

2025-07-23 Thread via GitHub
liamzwbao commented on code in PR #7984: URL: https://github.com/apache/arrow-rs/pull/7984#discussion_r2227242196 ## arrow-array/src/builder/primitive_builder.rs: ## @@ -296,7 +297,7 @@ impl PrimitiveBuilder { .expect("append_trusted_len_iter requires an upper bound

Re: [PR] Use `Vec` directly in builders [arrow-rs]

2025-07-23 Thread via GitHub
liamzwbao commented on PR #7984: URL: https://github.com/apache/arrow-rs/pull/7984#issuecomment-3111830811 Hi @alamb and @Dandandan, this PR is ready for review. I have changed the implementation to use `Vec` in a few builders. Not sure if some of them are not appropriate to migrate to this

Re: [I] [R][Release] CRAN packaging checklist for version 21.0.0 [arrow]

2025-07-23 Thread via GitHub
jonkeane commented on issue #46950: URL: https://github.com/apache/arrow/issues/46950#issuecomment-3111830210 Thanks for pushing this forward, y'all! Did one of you submit the package? I'm happy to confirm it if you did, but want to check before I do! -- This is an automated message from

Re: [PR] Implement full-range `i256::to_f64` to eliminate ±∞ saturation for Decimal256 → Float64 casts [arrow-rs]

2025-07-23 Thread via GitHub
kosiew commented on PR #7986: URL: https://github.com/apache/arrow-rs/pull/7986#issuecomment-3111827789 @scovich Can you review this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[PR] Implement full-range `i256::to_f64` to eliminate ±∞ saturation for Decimal256 → Float64 casts [arrow-rs]

2025-07-23 Thread via GitHub
kosiew opened a new pull request, #7986: URL: https://github.com/apache/arrow-rs/pull/7986 # Which issue does this PR close? Closes #7985 --- # Rationale for this change The existing Decimal256 → Float64 conversion was changed to **saturate** out-of-range

Re: [PR] docs: add arrow struct <> postgresql composite status [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba commented on PR #3195: URL: https://github.com/apache/arrow-adbc/pull/3195#issuecomment-3111807929 Testing this in https://github.com/apache/arrow-adbc/pull/3196. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] feat(c/driver/postgresql): add test for composite type behavior [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba commented on PR #3196: URL: https://github.com/apache/arrow-adbc/pull/3196#issuecomment-3111807590 Still working on this, currently this fails with, ``` /Users/bryce/src/apache/arrow-adbc/c/driver/postgresql/postgresql_test.cc:1143: Failure Expected equality of these val

[PR] feat(postgresql): add test for composite type behavior [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba opened a new pull request, #3196: URL: https://github.com/apache/arrow-adbc/pull/3196 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] docs: fix typo in python/adbc_driver_postgresql/README.md [arrow-adbc]

2025-07-23 Thread via GitHub
lidavidm merged PR #3194: URL: https://github.com/apache/arrow-adbc/pull/3194 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.a

[I] Implement full-range `i256::to_f64` to replace current ±∞ saturation for Decimal256 → Float64 [arrow-rs]

2025-07-23 Thread via GitHub
kosiew opened a new issue, #7985: URL: https://github.com/apache/arrow-rs/issues/7985 ## Background - **Current behavior:** the latest commit in #7887 removed the panic on overflow and now **saturates** any out-of-range `Decimal256` → `Float64` conversions to `f64::INFINITY` or `f64:

Re: [PR] Fix panic on lossy decimal to float casting: round to saturation for overflows [arrow-rs]

2025-07-23 Thread via GitHub
kosiew commented on PR #7887: URL: https://github.com/apache/arrow-rs/pull/7887#issuecomment-3111763279 Thanks @alamb, @klion26 , @scovich for your review comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] docs: add arrow struct <> postgresql composite status [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba commented on PR #3195: URL: https://github.com/apache/arrow-adbc/pull/3195#issuecomment-3111754930 Still need to test this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [I] [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate [arrow]

2025-07-23 Thread via GitHub
andishgar commented on issue #47101: URL: https://github.com/apache/arrow/issues/47101#issuecomment-3111753587 Thanks! I actually came up with an even simpler solution in my implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[PR] docs: add arrow struct <> postgresql composite status [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba opened a new pull request, #3195: URL: https://github.com/apache/arrow-adbc/pull/3195 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] GH-45055: [C++][Flight] Update Flight Server RecordBatchStreamImpl to reuse ipc::RecordBatchWriter with custom IpcPayloadWriter instead of manually generating FlightPayload [arrow]

2025-07-23 Thread via GitHub
lidavidm commented on code in PR #47115: URL: https://github.com/apache/arrow/pull/47115#discussion_r2227169956 ## cpp/src/arrow/flight/sql/example/sqlite_tables_schema_batch_reader.cc: ## Review Comment: Thanks for tracking this down! ## cpp/src/arrow/flight/ser

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
lidavidm commented on code in PR #47139: URL: https://github.com/apache/arrow/pull/47139#discussion_r2227145360 ## python/pyarrow/types.py: ## @@ -46,6 +48,79 @@ lib.Type_STRUCT, lib.Type_MAP} | _UNION_TYPES +class TypesEnum(Enum): Review Comment: Hmm.

Re: [I] [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47101: URL: https://github.com/apache/arrow/issues/47101#issuecomment-3111699602 I see. The following may work too: ```diff diff --git a/cpp/src/arrow/record_batch.cc b/cpp/src/arrow/record_batch.cc index 04d6890d39..42fd375b0f 100644 --- a/cpp/src/arro

[PR] docs: fix typo in python/adbc_driver_postgresql/README.md [arrow-adbc]

2025-07-23 Thread via GitHub
amoeba opened a new pull request, #3194: URL: https://github.com/apache/arrow-adbc/pull/3194 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-

Re: [PR] GH-47123: [Python] Add Enums to PyArrow Types [arrow]

2025-07-23 Thread via GitHub
rmnskb commented on PR #47139: URL: https://github.com/apache/arrow/pull/47139#issuecomment-3111678300 > All the comments have been addressed. One more suggestion I have is that this enum could be added to the docs. For example to the API doc: > > https://github.com/apache/arrow/blob

Re: [PR] GH-47179: [Python] Revert FileSystem.from_uri to be a staticmethod again [arrow]

2025-07-23 Thread via GitHub
kou commented on PR #47178: URL: https://github.com/apache/arrow/pull/47178#issuecomment-3111623894 @kszucs Could you take a look at this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

[PR] Use Vec directly in generic_bytes_builder [arrow-rs]

2025-07-23 Thread via GitHub
liamzwbao opened a new pull request, #7984: URL: https://github.com/apache/arrow-rs/pull/7984 # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an

[PR] Fix ShouldReturnEmptyPkFkResult_WorksAsExpected test case [arrow-adbc]

2025-07-23 Thread via GitHub
toddmeng-db opened a new pull request, #3193: URL: https://github.com/apache/arrow-adbc/pull/3193 This test does not pass, since catalog from OpenSessionResponse is used as catalog, if user does not explictly set catalog as a metadata request parameter -- This is an automated message from

Re: [I] [C++] Define an official support policy [arrow]

2025-07-23 Thread via GitHub
amoeba commented on issue #46002: URL: https://github.com/apache/arrow/issues/46002#issuecomment-3111594886 This topic came up in https://github.com/apache/arrow/issues/47136 and I wanted to add a couple of notes from there: - Should all changes have a dedicated Issue and PR? Sometim

Re: [I] [Python] Pyarrow drop support for old Linux distros? [arrow]

2025-07-23 Thread via GitHub
amoeba commented on issue #47136: URL: https://github.com/apache/arrow/issues/47136#issuecomment-3111585261 Thanks @kou. I kinda derailed this issue a bit so I think we can discuss in https://github.com/apache/arrow/issues/46002. I'll add a comment there. @lscheilling it sounds like y

Re: [PR] feat(csharp/src/Drivers/Databricks): Use ArrowSchema for Response Schema [arrow-adbc]

2025-07-23 Thread via GitHub
toddmeng-db commented on code in PR #3140: URL: https://github.com/apache/arrow-adbc/pull/3140#discussion_r2227030266 ## csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs: ## @@ -536,7 +546,7 @@ protected virtual async Task GetColumnsAsync(CancellationToken canc

Re: [PR] feat(go/adbc/driver/snowflake): Add support for table constraints when calling GetObjects [arrow-adbc]

2025-07-23 Thread via GitHub
vleslief-ms closed pull request #1455: feat(go/adbc/driver/snowflake): Add support for table constraints when calling GetObjects URL: https://github.com/apache/arrow-adbc/pull/1455 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] Composite Postgres types <-> Arrow structs [arrow-adbc]

2025-07-23 Thread via GitHub
lidavidm commented on issue #3190: URL: https://github.com/apache/arrow-adbc/issues/3190#issuecomment-3111565225 For ingest, there's no support currently. For reading, PostgreSQL RECORD types should become Arrow structs. -- This is an automated message from the Apache Git Service. To resp

Re: [I] Python's `FileSystem.from_uri` changed to be a non-staticmethod [arrow]

2025-07-23 Thread via GitHub
ff-kamal commented on issue #47179: URL: https://github.com/apache/arrow/issues/47179#issuecomment-3111562011 Opened a PR at #47178 to fix this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] GH-47179: [Python] Revert FileSystem.from_uri to be a staticmethod again [arrow]

2025-07-23 Thread via GitHub
github-actions[bot] commented on PR #47178: URL: https://github.com/apache/arrow/pull/47178#issuecomment-3111560978 :warning: GitHub issue #47179 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111557653 @zeroshade Great, that's helpful guidance thank you. Hopefully someone gets time to take a look at the Spark Connect side of things*, otherwise I will in a month or so. I updated m

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111553446 @zeroshade Great, that's helpful guidance thank you. Hopefully someone gets time to take a look, otherwise I will in a month or so. -- This is an automated message from the Apac

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111552721 @zeroshade Great, that's helpful guidance thank you. Hopefully someone gets time to take a look, otherwise I will in a month or so. -- This is an automated message from the Apach

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
zeroshade commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111531268 I'll hopefully be able to pinpoint the performance issue tomorrow. But only getting the number of rows passed to the WithChunk option feels like spark-connect-go isn't looping with

Re: [PR] Revert FileSystem.from_uri to be a staticmethod again [arrow]

2025-07-23 Thread via GitHub
github-actions[bot] commented on PR #47178: URL: https://github.com/apache/arrow/pull/47178#issuecomment-3111529703 Thanks for opening a pull request! If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue f

[PR] Revert FileSystem.from_uri to be a staticmethod again [arrow]

2025-07-23 Thread via GitHub
ff-kamal opened a new pull request, #47178: URL: https://github.com/apache/arrow/pull/47178 ### Rationale for this change It seems as part of #45089 `FileSystem.from_uri` was changed from a static function to a regular method. I believe this to be an error unintentionally introduced

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111522827 @zeroshade Hey I appreciate the fast responses. The `TableFromJSON` function actually returns one row out of 10K in my test when passed to a DataFrame, and I've updated **_that_**

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111510998 @zeroshade Hey I appreciate the fast responses. The `TableFromJSON` function actually returns one row out of 10K in my test. Yes the initial helper functions shouldn't have seen t

[PR] feat(csharp/test/Drivers/Databricks): Add mandatory token exchange [arrow-adbc]

2025-07-23 Thread via GitHub
alexguo-db opened a new pull request, #3192: URL: https://github.com/apache/arrow-adbc/pull/3192 ## Motivation Databricks will eventually require that all non-inhouse OAuth tokens be exchanged for Databricks OAuth tokens before accessing resources. This change implements mandatory to

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111275974 @zeroshade The TableFromJSON function actually returns one row out of 10K, so I don't think the problem is there -- This is an automated message from the Apache Git Service.

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
zeroshade commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-317385 I got waylaid by other things today but I plan on digging into this tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
zeroshade commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111082062 Looking at the issue you linked, the problem is the loop in the JSON path: ```go for rdr.Next() { rec := rdr.Record() if kept == nil { // intentionall

Re: [I] [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate [arrow]

2025-07-23 Thread via GitHub
andishgar commented on issue #47101: URL: https://github.com/apache/arrow/issues/47101#issuecomment-3110830173 >Is this part correct? ValueTypeSubset v1 = v; not ValueTypeSubset v1 = int64_t{1};? Regarding this, I mean first creating an object of type std::variant, and then assignin

Re: [PR] [Draft] implements Sum,sum_checked,min,max,is Distict,inverse for REE. [arrow-rs]

2025-07-23 Thread via GitHub
rich-t-kid-datadog commented on code in PR #7933: URL: https://github.com/apache/arrow-rs/pull/7933#discussion_r2226888468 ## arrow-ord/src/cmp.rs: ## @@ -224,6 +223,14 @@ fn compare_op(op: Op, lhs: &dyn Datum, rhs: &dyn Datum) -> Result

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3110786681 @zeroshade Thanks! The weird thing is there's also a bug with the chunking in relation to Spark Connect. https://github.com/apache/spark-connect-go/issues/155. The r

Re: [I] [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate [arrow]

2025-07-23 Thread via GitHub
andishgar commented on issue #47101: URL: https://github.com/apache/arrow/issues/47101#issuecomment-3110786284 We need a syntax like [this](https://github.com/apache/arrow/blob/bb33493bd34dcd21d71b6b942203992e67f5ef3c/cpp/src/arrow/record_batch.cc#L571) if we want to treat `distinct_count `

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
caldempsey commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3110739741 @zeroshade Thanks! The weird thing is there's also a bug with the chunking in relation to Spark Connect. https://github.com/apache/spark-connect-go/issues/155. The r

[I] [DISCUSS] Decouple IO and CPU operations in the Parquet Reader (push decoder?) [arrow-rs]

2025-07-23 Thread via GitHub
alamb opened a new issue, #7983: URL: https://github.com/apache/arrow-rs/issues/7983 (This is based on discussions with @crepererum and @XiangpengHao over the last few days) ## Is your feature request related to a problem or challenge? After working with the Parquet Reader, I

Re: [PR] GH-47167: [C++][Dev] Update clang-format dependency [arrow]

2025-07-23 Thread via GitHub
kou commented on PR #47168: URL: https://github.com/apache/arrow/pull/47168#issuecomment-3110282441 Supporting old pre-commit (system pre-commit) is for new contributors on Ubuntu 22.04. If we have enough documentation (or something) for new contributors, we can require more newer pr

Re: [PR] chore: Add rust-toolchain.toml to ensure consistent toolchain version [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7972: URL: https://github.com/apache/arrow-rs/pull/7972#issuecomment-3110235208 > > Hmm, looks like the CI jobs will need to be updated to account for the new toolchain somehow > > Yes, I see -- I can give it a try to modify the CI part. By the way, is there a

Re: [PR] Parquet filter pushdown v4 [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7850: URL: https://github.com/apache/arrow-rs/pull/7850#issuecomment-3110230266 Update here: I have hooked up a configuration option, and tests in a PR: - https://github.com/XiangpengHao/arrow-rs/pull/7 The tests are failing because the sync reader does

Re: [I] [Python] Pyarrow drop support for old Linux distros? [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47136: URL: https://github.com/apache/arrow/issues/47136#issuecomment-3110231222 In general, I agree with @amoeba's suggestion. But I want to drop support for EOL-ed platforms without announcing in N-X releases. Because we may forget to announce it... If we for

Re: [I] [Statistics][C++] Implement Statistics specification attribute ARROW:distinct_count:approximate [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47101: URL: https://github.com/apache/arrow/issues/47101#issuecomment-3110199079 We don't want to use Boost as much as possible... It's difficult to maintain... ```cpp ValueType v; ValueTypeSubset v1 = int64_t{1}; // Compile-time error

Re: [PR] GH-47052: [CI] Use Alpine Linux 3.20 instead of 3.18 [arrow]

2025-07-23 Thread via GitHub
kou commented on PR #47148: URL: https://github.com/apache/arrow/pull/47148#issuecomment-3110187913 https://github.com/apache/arrow/actions/runs/16442827719/job/46467431083#step:6:4162 ```text 70/104 Test #72: arrow-dataset-file-orc-test ..***Failed 11.12 sec

Re: [PR] GH-45382: [Python] Add support for pandas DataFrame.attrs [arrow]

2025-07-23 Thread via GitHub
rmnskb commented on PR #47147: URL: https://github.com/apache/arrow/pull/47147#issuecomment-3110158435 > One general comment on the testing: I think we can take use of `_check_pandas_roundtrip` utility and move the tests under the `TestConvertMetadata` class adding assert for pandas attribu

Re: [PR] Minor: Restore warning comment on Int96 statistics read [arrow-rs]

2025-07-23 Thread via GitHub
alamb merged PR #7975: URL: https://github.com/apache/arrow-rs/pull/7975 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache

Re: [PR] Minor: Restore warning comment on Int96 statistics read [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7975: URL: https://github.com/apache/arrow-rs/pull/7975#issuecomment-3110127588 Thank you @emkornfield and @mbutrovich -- since I do not think this PR is controversial (it puts back a comment that was previously present) I will just merge it now without waiting for a

Re: [I] Loading of feather file created with recent pandas fails in JS arrow lib [arrow-js]

2025-07-23 Thread via GitHub
kou commented on issue #99: URL: https://github.com/apache/arrow-js/issues/99#issuecomment-3110112600 This is caused by missing record batch compression implementation. #109 is an issue for the feature. So I close this as duplicated. -- This is an automated message from the Apache G

Re: [I] [C++] Nightly verification jobs fail on Ubuntu 24.04 to build byte_stream_split_internal.cc due to XSIMD failure [arrow]

2025-07-23 Thread via GitHub
kou commented on issue #47175: URL: https://github.com/apache/arrow/issues/47175#issuecomment-3110080131 Ah, we should have changed required xsimd version in #46963. ```diff diff --git a/cpp/cmake_modules/ThirdpartyToolchain.cmake b/cpp/cmake_modules/ThirdpartyToolchain.cmake in

Re: [PR] fix(csharp/src/Drivers/BigQuery): Adjust default dataset id [arrow-adbc]

2025-07-23 Thread via GitHub
CurtHagenlocher merged PR #3187: URL: https://github.com/apache/arrow-adbc/pull/3187 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] GH-47016: [C++][FlightSQL] Fix negative timestamps to date types [arrow]

2025-07-23 Thread via GitHub
alinaliBQ commented on code in PR #47017: URL: https://github.com/apache/arrow/pull/47017#discussion_r2226232883 ## cpp/src/arrow/flight/sql/odbc/flight_sql/accessors/date_array_accessor_test.cc: ## @@ -60,19 +64,22 @@ TEST(DateArrayAccessor, Test_Date32Array_CDataType_DATE) {

Re: [D] Is there any advantage to preallocate memory with Builder::with_capacity? [arrow-rs]

2025-07-23 Thread via GitHub
GitHub user surister closed the discussion with a comment: Is there any advantage to preallocate memory with Builder::with_capacity? Actually, the number of values I was pushing to the array builders were always the `allocated` +1 due to another issue, I guess that triggers new allocation. Wi

Re: [D] Is there any advantage to preallocate memory with Builder::with_capacity? [arrow-rs]

2025-07-23 Thread via GitHub
GitHub user surister closed a discussion: Is there any advantage to preallocate memory with Builder::with_capacity? Hi, I'm consuming async iterators that fetch data in row format and then transpose it to arrays with `ArrayBuilder`s, I know the count beforehand so I thought that I could decre

Re: [PR] GH-47137: [PYTHON] Switch to `[dependency-groups]` [arrow]

2025-07-23 Thread via GitHub
github-actions[bot] commented on PR #47176: URL: https://github.com/apache/arrow/pull/47176#issuecomment-3109520437 :warning: GitHub issue #47137 **has been automatically assigned in GitHub** to PR creator. -- This is an automated message from the Apache Git Service. To respond to the mes

[PR] GH-47137: [PYTHON] Switch to `[dependency-groups]` [arrow]

2025-07-23 Thread via GitHub
paddyroddy opened a new pull request, #47176: URL: https://github.com/apache/arrow/pull/47176 ### Rationale for this change Modern packaging requires non-user facing dependencies to be placed in a `dependency-groups as per [PEP 735](https://peps.python.org/pep-0735.) ### What chang

[PR] [WIP] Implement Telemetry Reporting for Databricks [arrow-adbc]

2025-07-23 Thread via GitHub
jeremytang-db opened a new pull request, #3191: URL: https://github.com/apache/arrow-adbc/pull/3191 Created Telemetry models to fit Telemetry data in required format. Uses ActivityListener to collect activities. -- This is an automated message from the Apache Git Service. To respon

Re: [PR] Perf: optimize actual_buffer_size to use only data buffer capacity for coalesce [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7967: URL: https://github.com/apache/arrow-rs/pull/7967#issuecomment-310945 Thanks again @zhuqi-lucas -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] Perf: optimize actual_buffer_size to use only data buffer capacity for coalesce [arrow-rs]

2025-07-23 Thread via GitHub
alamb merged PR #7967: URL: https://github.com/apache/arrow-rs/pull/7967 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache

Re: [PR] Perf: Support partition_validity to use fast path for bit map scan [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7962: URL: https://github.com/apache/arrow-rs/pull/7962#issuecomment-3109426806 Maybe we need to crank up the test somehow -- trying to measure changes in `usec` may be too subject to noise 🤔 -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] chore: Add rust-toolchain.toml to ensure consistent toolchain version [arrow-rs]

2025-07-23 Thread via GitHub
EricccTaiwan commented on PR #7972: URL: https://github.com/apache/arrow-rs/pull/7972#issuecomment-3109308802 > Hmm, looks like the CI jobs will need to be updated to account for the new toolchain somehow Yes, I see -- I can give it a try to modify the CI part. By the way, is there

Re: [PR] chore: Add rust-toolchain.toml to ensure consistent toolchain version [arrow-rs]

2025-07-23 Thread via GitHub
alamb commented on PR #7972: URL: https://github.com/apache/arrow-rs/pull/7972#issuecomment-3109257616 Hmm, looks like the CI jobs will need to be updated to account for the new toolchain somehow -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

2025-07-23 Thread via GitHub
zeroshade commented on issue #448: URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3109232100 Wow. Well, that's definitely surprising. I'm gonna take a look and see if I can isolate the cause of the difference in performance as there shouldn't really be a difference in perfo

[I] Field metadata for dictionary-encoded extension types is lost [arrow-rs]

2025-07-23 Thread via GitHub
paleolimbot opened a new issue, #7982: URL: https://github.com/apache/arrow-rs/issues/7982 **Describe the bug** The representation of Dictionary in the data types enum seems to exclude field metadata, so extension types are dropped when they go through arrow-rs structures: ht

Re: [PR] feat: Implement IPC RecordBatch body buffer compression [arrow-js]

2025-07-23 Thread via GitHub
brancz commented on PR #14: URL: https://github.com/apache/arrow-js/pull/14#issuecomment-3109185600 We would love to see lz4 support in arrow-js. @westonpace @trxcllnt any chance you could give this another review? -- This is an automated message from the Apache Git Service. To resp

Re: [PR] feat: Implement IPC RecordBatch body buffer compression [arrow-js]

2025-07-23 Thread via GitHub
brancz commented on code in PR #14: URL: https://github.com/apache/arrow-js/pull/14#discussion_r2226003816 ## src/ipc/writer.ts: ## @@ -251,34 +274,99 @@ export class RecordBatchWriter extends ReadableInterop< } protected _writeRecordBatch(batch: RecordBatch) { -

Re: [I] [C++] Nightly verification jobs fail on Ubuntu 24.04 to build byte_stream_split_internal.cc due to XSIMD failure [arrow]

2025-07-23 Thread via GitHub
raulcd commented on issue #47175: URL: https://github.com/apache/arrow/issues/47175#issuecomment-3109163185 It seems we might have to use `xsimd_SOURCE=BUNDLED` as Ubuntu 24.04 uses [libxsimd-dev 12.1.1](https://packages.ubuntu.com/search?suite=noble&arch=any&searchon=names&keywords=xsimd-d

Re: [I] [C++] Nightly verification jobs fail on Ubuntu 24.04 to build byte_stream_split_internal.cc due to XSIMD failure [arrow]

2025-07-23 Thread via GitHub
raulcd commented on issue #47175: URL: https://github.com/apache/arrow/issues/47175#issuecomment-3109099190 cc @AntoinePrv -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] TESTING: Use standard BitIndexIterator instead of specialized u32 iterator [arrow-rs]

2025-07-23 Thread via GitHub
zhuqi-lucas commented on PR #7979: URL: https://github.com/apache/arrow-rs/pull/7979#issuecomment-3109047271 The u32 is: ```rust sort f32 nulls to indices 2^12 1.00 39.7±0.10µs ? ?/sec1.37 54.5±0.17µs? ?/sec ``` This PR is:

  1   2   >