I See.
In our case, we use SingleBufferInputStream, so time spent is duplicating
the backing byte buffer.
Thanks
Chang
Ryan Blue 于2020年9月15日周二 上午2:04写道:
> Before, the input was a byte array so we could read from it directly. Now,
> the input is a `ByteBufferInputStream` so that Parquet can
Before, the input was a byte array so we could read from it directly. Now,
the input is a `ByteBufferInputStream` so that Parquet can choose how to
allocate buffers. For example, we use vectored reads from S3 that pull back
multiple buffers in parallel.
Now that the input is a stream based on poss
Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.
On Sun, Sep 13, 2020 at 10:
I think we can copy all encoded data into a ByteBuffer once, and unpack
values in the loop
while (valueIndex < this.currentCount) {
// values are bit packed 8 at a time, so reading bitWidth will always
work
this.packer.unpack8Values(buffer, buffer.position() + valueIndex,
this.currentBuff
It certainly can't be called once - it's reading different data each time.
There might be a faster way to do it, I don't know. Do you have ideas?
On Sun, Sep 13, 2020 at 9:25 PM Chang Chen wrote:
>
> Hi export
>
> it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup()
>
>
Hi export
it looks like there is a hot spot in VectorizedRleValuesReader#readNextGroup
()
case PACKED:
int numGroups = header >>> 1;
this.currentCount = numGroups * 8;
if (this.currentBuffer.length < this.currentCount) {
this.currentBuffer = new int[this.currentCount];
}
currentBuf