I think this is https://issues.apache.org/jira/browse/PARQUET-1655 would you like to provide a patch?
On Tue, Feb 2, 2021 at 12:42 PM g. g. grey <[email protected]> wrote: > Hi. > > I'm relatively new to arrow/parquet, so I'd appreciate help trying to > figure out where I've gone awry. > > When writing integers as fixed_len_byte_arrays in parquet files, I've run > into a scenario where the min/max statistics appear to be incorrect. I've > debugged into the code and it looks like the CompareHelper<FLBAType, > is_signed=true> is being used for comparison. This seems to do a > lexicographic_compare of signed bytes; this is counter to what I expected, > which is that the MSB would be used for sign comparison but that all other > bytes would be unsigned comparison. > > I've hacked together a small example to show what I'm talking about, > focusing on the comparator itself. I'm wondering if I'm going awry in > creating my FLBA, if my understanding of the comparison itself is flawed, > or if there is an issue with the comparison. > > Thanks for your help! > ggg > > namespace parquet { >> >> using schema::GroupNode; >> using schema::NodePtr; >> using schema::PrimitiveNode; >> >> namespace test { >> >> // ---------------------------------------------------------------------- >> // Test comparators >> >> void printArray(const unsigned char * arr) >> { >> for(int i = 0; i < 8; i++ ) >> { >> printf("[%d] 0x%x ", i, arr[i]); >> } >> printf("\n"); >> } >> >> TEST(Comparison, SignedFLBA_error1) { >> int size = 8; >> auto comparator = >> MakeComparator<FLBAType>(Type::FIXED_LEN_BYTE_ARRAY, >> SortOrder::SIGNED, size); >> >> int64_t low = 1234; >> int64_t high = 0x8000; >> >> // convert to big endian >> int64_t lowBE = ::arrow::BitUtil::ToBigEndian(low); >> printf("low 0x%llx lowBE 0x%llx\n", low, lowBE); >> printArray((unsigned char*)&lowBE); >> >> int64_t highBE = ::arrow::BitUtil::ToBigEndian(high); >> printf("high 0x%llx highBE 0x%llx\n", high, highBE); >> printArray((unsigned char*)&highBE); >> >> FLBA lowBEFlba((uint8_t*)&lowBE); >> FLBA highBEFlba((uint8_t*)&highBE); >> >> // compare. Uses CompareHelper<FLBAType, is_signed=true> >> // This fails but should return true b/c 1234 < 0x8000. >> ASSERT_TRUE(comparator->Compare(lowBEFlba, highBEFlba)); >> } >> > > The output from running the test > > [==========] Running 2 tests from 1 test suite. >> [----------] Global test environment set-up. >> [----------] 2 tests from Comparison >> [ RUN ] Comparison.SignedFLBA_error1 >> low 0x4d2 lowBE 0xd204000000000000 >> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x4 [7] 0xd2 >> high 0x8000 highBE 0x80000000000000 >> [0] 0x0 [1] 0x0 [2] 0x0 [3] 0x0 [4] 0x0 [5] 0x0 [6] 0x80 [7] 0x0 >> /Users/kbhmr/Software/arrow/arrow-master/cpp/src/parquet/short_statistics_test.cc:89: >> Failure >> Value of: comparator->Compare(lowBEFlba, highBEFlba) >> Actual: false >> Expected: true >> [ FAILED ] Comparison.SignedFLBA_error1 (0 ms) >> > >
