Re: [PR] feat: support cpcsketch serde (datasketches-rust)
ZENOTME commented on code in PR #84:
URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778811362
##
datasketches/src/hash/mod.rs:
##
@@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64;
/// a history of stored sketches you are stuck with it.
pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001;
+/// Computes and checks the 16-bit seed hash from the given long seed.
+///
+/// The seed hash may not be zero in order to maintain compatibility with
older serialized
+/// versions that did not have this concept.
Review Comment:
So do we need to return an error here for the result is 0 as Java:
```
public static short computeSeedHash(final long seed) {
final long[] seedArr = {seed};
final short seedHash = (short)(hash(seedArr, 0L)[0] & 0xL);
if (seedHash == 0) {
throw new SketchesArgumentException(
"The given seed: " + seed + " produced a seedHash of zero. "
+ "You must choose a different seed.");
}
return seedHash;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
ZENOTME commented on code in PR #84:
URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778811362
##
datasketches/src/hash/mod.rs:
##
@@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64;
/// a history of stored sketches you are stuck with it.
pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001;
+/// Computes and checks the 16-bit seed hash from the given long seed.
+///
+/// The seed hash may not be zero in order to maintain compatibility with
older serialized
+/// versions that did not have this concept.
Review Comment:
Look like we need to return an error here for the result is 0 as Java:
```
public static short computeSeedHash(final long seed) {
final long[] seedArr = {seed};
final short seedHash = (short)(hash(seedArr, 0L)[0] & 0xL);
if (seedHash == 0) {
throw new SketchesArgumentException(
"The given seed: " + seed + " produced a seedHash of zero. "
+ "You must choose a different seed.");
}
return seedHash;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
leerho commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778054604 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: It is exactly what the comment says. It is to remain compatible with older sketch versions (in other languages) that did not have the concept of the seedHash. Once you have serialized a sketch, it no longer retains any information about what language generated the serialized image. That is the whole idea and quite powerful! Once you have properly created this sketch in Rust, you will be able to import sketch images created years ago from Java, C++, or whatever. The fact that "older versions of Rust" don't have this problem is irrelevant. :) And yes, the method that generates the seed must check for 0, as it does in Java. And, hmmm, it looks like C++ doesn't check for zero either. Which is a bug. The likely reason this has not been noticed before is because we always use the DEFAULT_UPDATE_SEED, which has a non-zero seed_hash. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
leerho commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778054604 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: It is exactly what the comment says. It is to remain compatible with older versions of CPC (in other languages) that did not have the concept of the seedHash. Once you have serialized a sketch, it no longer retains any information about what language generated the serialized image. That is the whole idea and quite powerful! Once you have properly created this sketch in Rust, you will be able to import CPC sketch images created years ago from Java, C++, or whatever. The fact that "older versions of Rust" don't have this problem is irrelevant. :) And yes, the method that generates the seed must check for 0, as it does in Java. And, hmmm, it looks like C++ doesn't check for zero either. Which is a bug. The likely reason this has not been noticed before is because we always use the DEFAULT_UPDATE_SEED, which has a non-zero seed_hash. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
leerho commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778054604 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: It is exactly what the comment says. It is to remain compatible with older versions of CPC (in other languages) that did not have the concept of the seedHash. Once you have serialized a sketch, it no longer retains any information about what language generated the serialized image. That is the whole idea and quite powerful! Once you have properly created this sketch in Rust, you will be able to import CPC sketch images created years ago from Java, C++, or whatever. The fact that "older versions of Rust" don't have this problem is irrelevant. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
leerho commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2778054604 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: It is exactly what the comment says. It is to remain compatible with older versions that did not have the concept of the seedHash. Once you have serialized a sketch, it no longer retains any information about what language generated the serialized image. That is the whole idea and quite powerful! Once you have properly created this sketch in Rust, you will be able to import CPC sketch images created years ago from Java, C++, or whatever. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun merged PR #84: URL: https://github.com/apache/datasketches-rust/pull/84 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#issuecomment-3857514456 I'm going to merge this patch now. Review after commit is welcome. To reduce binary size, we'd follow #32 to exclude CpcSketch's code when users doesn't need it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#issuecomment-3851097618 I'm going to do the following tasks after this patch is merged: 1. `CpcWrapper` to read fields without fully deserializing the sketch. This is implemented in the Java impl as well. 2. Investigate whether we need `introspective_insertion_sort`. Rust's slice sort should properly leverage existing ordered items already. For this patch, one open question is whether to include the decoding table as static values, or build it at the first access (using `OnceLock` or so). I tend to keep the static decoding tables. They should not increase the binary size too much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2766841903 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: BTW this comment is copied from `computeSeedHash`'s Java version. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on code in PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2764759992 ## datasketches/src/hash/mod.rs: ## @@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64; /// a history of stored sketches you are stuck with it. pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001; +/// Computes and checks the 16-bit seed hash from the given long seed. +/// +/// The seed hash may not be zero in order to maintain compatibility with older serialized +/// versions that did not have this concept. Review Comment: I suppose so. cc @leerho I can't see similar requiremeny based on barely the Rust code. Could you provide more context why 0 is not allowed here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on code in PR #84:
URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2764754983
##
datasketches/src/cpc/pair_table.rs:
##
@@ -64,17 +64,13 @@ impl PairTable {
// sorted pairs array. However, we are starting out with the correct
final table size, so
// the problem might not occur.
-for slot in slots {
-table.must_insert(slot);
+for i in 0..num_items {
Review Comment:
I suppose the following slice index access would provide necessary bound
check and panic better.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
ZENOTME commented on code in PR #84:
URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2764488941
##
datasketches/src/cpc/pair_table.rs:
##
@@ -64,17 +64,13 @@ impl PairTable {
// sorted pairs array. However, we are starting out with the correct
final table size, so
// the problem might not occur.
-for slot in slots {
-table.must_insert(slot);
+for i in 0..num_items {
Review Comment:
In here, we have the invariant: `num_items <= slots.size()`, should we
expression explicitly using assert or comment?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
ZENOTME commented on code in PR #84:
URL: https://github.com/apache/datasketches-rust/pull/84#discussion_r2764002660
##
datasketches/src/hash/mod.rs:
##
@@ -37,6 +37,19 @@ pub(crate) use self::xxhash::XxHash64;
/// a history of stored sketches you are stuck with it.
pub(crate) const DEFAULT_UPDATE_SEED: u64 = 9001;
+/// Computes and checks the 16-bit seed hash from the given long seed.
+///
+/// The seed hash may not be zero in order to maintain compatibility with
older serialized
+/// versions that did not have this concept.
Review Comment:
Does this mean that we should check the return value to prevent 0?
##
datasketches/src/cpc/pair_table.rs:
##
@@ -64,17 +64,13 @@ impl PairTable {
// sorted pairs array. However, we are starting out with the correct
final table size, so
// the problem might not occur.
-for slot in slots {
-table.must_insert(slot);
+for i in 0..num_items {
Review Comment:
In here, we have the invariant: `num_items <= slots.size()`, should we
expression explicitly using assert or comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: support cpcsketch serde (datasketches-rust)
tisonkun commented on PR #84: URL: https://github.com/apache/datasketches-rust/pull/84#issuecomment-3847361853 cc @PsiACE @Xuanwo @ZENOTME -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
