Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


JingsongLi merged PR #133:
URL: https://github.com/apache/paimon-rust/pull/133


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2938214535


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   > Currently BinaryRow's Serialize/Deserialize is not used anywhere — 
DataSplit only derives Debug, and DataFileMeta._MIN_KEY/_MAX_KEY are Vec, not 
BinaryRow.
   > 
   > The #[serde(skip)] on data is problematic: if we later need DataSplit 
(which contains BinaryRow) to be serializable (e.g. for distributed task 
dispatch), the data field must participate in serde — otherwise the receiver 
gets an empty partition. Skipping it now risks a subtle bug later.
   > 
   > I'd suggest either:
   > 
   > Remove Serialize/Deserialize and #[serde(skip)] entirely since nothing 
uses them today, or Keep Serialize/Deserialize but remove #[serde(skip)] (use 
#[serde(with = "serde_bytes")] instead) so that when DataSplit eventually 
derives Serialize/Deserialize, BinaryRow is already correct and we don't forget 
to fix this.
   
   Good point, thanks for the review. I adopted option B by replacing 
`#[serde(skip)` with `#[serde(with = "serde_bytes")]`, so `BinaryRow.data` is 
preserved in Rust serde roundtrips.
   
   I also added test_serde_roundtrip_populated to verify that a populated 
BinaryRow roundtrips with its backing bytes intact, in addition to the 
empty-row case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2937976765


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   > won't `SerializationUtils.serializeBinaryRow` serialize the data bytes?
   
   Yes, `SerializationUtils.serializeBinaryRow` does serialize the full row 
bytes (including the 4-byte arity prefix), but that is a separate binary codec 
used to fill a parent bytes field such as `_PARTITION`.
   
   On the Rust side, `_PARTITION` is first deserialized into 
`ManifestEntry.partition: Vec`, and only then converted to `BinaryRow` via 
`BinaryRow::from_bytes(...)` at runtime.
   
   `#[serde(skip)]` only affects `BinaryRow`'s own derived 
`Serialize`/`Deserialize`. I added it so introducing this runtime backing 
buffer does not change BinaryRow's current serde shape or the existing 
empty-row roundtrip behavior. If we ever need serde to roundtrip a populated 
`BinaryRow`, we should switch to a dedicated custom serializer instead of 
relying on the derived struct serde.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


luoyuxia commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2938009613


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   Currently BinaryRow's Serialize/Deserialize is not used anywhere — DataSplit 
only derives Debug, and DataFileMeta._MIN_KEY/_MAX_KEY are Vec, not 
BinaryRow.
   
   The #[serde(skip)] on data is problematic: if we later need DataSplit (which 
contains BinaryRow) to be serializable (e.g. for distributed task dispatch), 
the data field must participate in serde — otherwise the receiver gets an empty 
partition. Skipping it now risks a subtle bug later.
   
   I'd suggest either:
   
   Remove Serialize/Deserialize and #[serde(skip)] entirely since nothing uses 
them today, or
   Keep Serialize/Deserialize but remove #[serde(skip)] (use #[serde(with = 
"serde_bytes")] instead) so that when DataSplit eventually derives 
Serialize/Deserialize, BinaryRow is already correct and we don't forget to fix 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2937976765


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   > won't `SerializationUtils.serializeBinaryRow` serialize the data bytes?
   
   Yes, serializeBinaryRow does serialize the full data bytes — but it produces 
a flat `byte[]` stored in the `_PARTITION` (or `_MIN_KEY`/`_MAX_KEY`) 
**bytes-typed** Avro field. It is not part of BinaryRow's own struct-level 
serde.
   
   On the Rust side, when we deserialize the Avro manifest, `_PARTITION` lands 
as `Vec` in `ManifestEntry.partition`. We then call 
`BinaryRow::from_bytes(arity, data)` to interpret those bytes at runtime. The 
`#[serde(skip)]` on data only affects BinaryRow's own Serialize/Deserialize 
derive (used for the EMPTY_BINARY_ROW stub), keeping it backward-compatible — 
only arity and null_bits_size_in_bytes participate in serde, matching the 
existing contract.
   
   In short: serializeBinaryRow is a standalone binary codec that feeds into a 
parent struct's bytes field; it's a separate concern from BinaryRow's 
struct-level `#[derive(Serialize, Deserialize)]`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


luoyuxia commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2937936478


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   won't `SerializationUtils.serializeBinaryRow` serialize the data bytes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


QuakeWang commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2936684166


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   > why skip serde for this field?
   
   The data field is populated at runtime via `BinaryRow::from_bytes()`, not 
through serde. In the Avro manifest, `_PARTITION` is stored as raw bytes (with 
a 4-byte arity prefix, per Java's `SerializationUtils.serializeBinaryRow`). The 
`#[serde(skip)]` is here to avoid breaking existing JSON serde for 
`DataFileMeta._MIN_KEY`/`_MAX_KEY`, which only serializes arity and 
`null_bits_size_in_bytes`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-15 Thread via GitHub


luoyuxia commented on code in PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#discussion_r2936500395


##
crates/paimon/src/spec/data_file.rs:
##
@@ -24,30 +24,203 @@ use std::fmt::{Display, Formatter};
 
 pub const EMPTY_BINARY_ROW: BinaryRow = BinaryRow::new(0);
 
-/// An implementation of InternalRow.
+/// Highest bit mask for detecting inline vs variable-length encoding.
+///
+/// If the highest bit of the 8-byte fixed-part value is 1, the data is stored
+/// inline (≤7 bytes). If 0, the data is in the variable-length part.
+///
+/// Reference: `BinarySection.HIGHEST_FIRST_BIT` in Java Paimon.
+const HIGHEST_FIRST_BIT: u64 = 0x80 << 56;
+
+/// Mask to extract the 7-bit length from an inline-encoded value.
+///
+/// Reference: `BinarySection.HIGHEST_SECOND_TO_EIGHTH_BIT` in Java Paimon.
+const HIGHEST_SECOND_TO_EIGHTH_BIT: u64 = 0x7F << 56;
+
+/// An implementation of InternalRow backed by raw binary bytes.
+///
+/// Binary layout (little-endian):
+/// ```text
+/// | header (8 bytes) | null bit set (8-byte aligned) | fixed-length (8B per 
field) | variable-length |
+/// ```
+///
+/// - **Header**: byte 0 = RowKind, bytes 1-7 reserved.
+/// - **Null bit set**: starts at bit index 8 (skip header bits), 1 bit per 
field, padded to 8-byte boundary.
+/// - **Fixed-length part**: 8 bytes per field. Primitives stored directly; 
variable-length types store offset+length.
+/// - **Variable-length part**: String / Binary data referenced by 
offset+length from fixed part.
 ///
 /// Impl Reference: 

-#[derive(Debug, Eq, PartialEq, Serialize, Deserialize)]
+#[derive(Debug, Clone, Eq, PartialEq, Serialize, Deserialize)]
 #[serde(rename_all = "camelCase")]
 pub struct BinaryRow {
 arity: i32,
 null_bits_size_in_bytes: i32,
+
+/// Raw binary data backing this row. Empty when constructed via `new()`.
+/// Populated via `from_bytes()` for partition data from manifest entries.
+#[serde(skip)]

Review Comment:
   why skip serde for this field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [PR] feat: implement BinaryRow deserialization for partition bytes [paimon-rust]

2026-03-14 Thread via GitHub


QuakeWang commented on PR #133:
URL: https://github.com/apache/paimon-rust/pull/133#issuecomment-4060310833

   @luoyuxia @JingsongLi PTAL 👀


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]