Hi all,

I would like to ask whether there are any known or potential workarounds on
the Spark side for a reproducible failure in Hadoop’s native ZSTD
decompression. The issue appears to be triggered specifically when the
original (uncompressed) file size is smaller than 129 KiB.

Environment:
- Apache Spark 3.5.7 (Scala 2.12) with Hadoop 3.3.4
- libhadoop.so from Apache Hadoop 3.3.6
- libzstd 1.5.4

Summary of the problem:
When Spark reads a ZSTD-compressed file through Hadoop’s native
ZStandardDecompressor, the following errors can be reproduced reliably:

1. For files whose original size is <129 KiB:
   java.lang.InternalError: Src size is incorrect

2. Under a slightly different sequence of reads:
   java.lang.InternalError: Restored data doesn't match checksum

These errors occur even though the ZSTD files are valid and can be
decompressed normally with the `zstd` CLI tools.

Reproduction procedure:
1. `yes a | head -n 65536 > file_128KiB.txt`  (128 KiB)
2. `zstd file_128KiB.txt`
3. Validate with `zstd -lv` and `zstdcat`.
4. In PySpark:

 `spark.read.text("hdfs://dhome/camepr42/test_zstd/file_128KiB.txt.zst").show()`
5. The executor raises `InternalError: Src size is incorrect`.

A second sequence involving both 129 KiB and 128 KiB files can reproduce:
`InternalError: Restored data doesn't match checksum`.

Details including stack traces and command steps are included in my comment
to Hadoop. https://issues.apache.org/jira/browse/HADOOP-18799

Thanks
-- 
*camper42*
Douban, Inc.

E-mail:  [email protected]

Reply via email to