[DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-14 Thread Jungtaek Lim
Hi devs,

It was Spark 2.3 in Feb 2018 which introduced continuous mode in Structured
Streaming as "experimental".

Now we are here at 2.5 years after its release - I feel it would be a good
time to evaluate the mode, whether the mode has been widely used or not,
and the mode has been making progress, as the mode is "experimental".

At least from the surface I don't see any active effort for continuous mode
around the community - the last major effort was stateful operation which
was incomplete and I removed that. There were some couples of bug reports
as well as fixes more than a year ago and almost nothing has been handled.
(A trivial bugfix PR has been merged recently but that's all.) The new
features introduced to the Structured Streaming (at least observable
metrics, SS UI) don't apply to continuous mode, and no one made "support
continuous mode" as a hard requirement on passing review in these PRs.

I have no idea how many companies are using the mode in production (please
add the voice if someone has statistics about this) but I don't see any bug
reports recently, and see only a few questions in SO, which makes me think
about cost on maintenance.

I know there's a mood to avoid discontinue support as possible, but it
sounds weird to keep something as "unmaintained", especially it's still
"experimental" and main authors are no more active enough to promise
maintenance/improvement on the module. Thoughts?

Thanks,
Jungtaek Lim (HeartSaVioR)


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread kalyan
+1

Will positively improve the performance and reliability of spark...
Looking fwd to this..

Regards
Kalyan.

On Tue, Sep 15, 2020, 9:26 AM Joseph Torres 
wrote:

> +1
>
> On Mon, Sep 14, 2020 at 6:39 PM angers.zhu  wrote:
>
>> +1
>>
>> angers.zhu
>> angers@gmail.com
>>
>> 
>> 签名由 网易邮箱大师  定制
>>
>> On 09/15/2020 08:21,Xiao Li 
>> wrote:
>>
>> +1
>>
>> Xiao
>>
>> DB Tsai  于2020年9月14日周一 下午4:09写道:
>>
>>> +1
>>>
>>> On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh 
>>> wrote:
>>>
 +1

 Chandni

 On Mon, Sep 14, 2020 at 11:41 AM Tom Graves
  wrote:

> +1
>
> Tom
>
> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>
> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020),
> 5pm CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>

>>>
>>> --
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 42E5B25A8F7A82C1
>>>
>>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Joseph Torres
+1

On Mon, Sep 14, 2020 at 6:39 PM angers.zhu  wrote:

> +1
>
> angers.zhu
> angers@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
>
> On 09/15/2020 08:21,Xiao Li 
> wrote:
>
> +1
>
> Xiao
>
> DB Tsai  于2020年9月14日周一 下午4:09写道:
>
>> +1
>>
>> On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:
>>
>>> +1
>>>
>>> Chandni
>>>
>>> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
>>> wrote:
>>>
 +1

 Tom

 On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
 mri...@gmail.com> wrote:


 Hi,

 I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
 shuffle to improve shuffle efficiency.
 Please take a look at:

- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
- SPIP doc:

 https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
- POC against master and results summary :

 https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit

 Active discussions on the jira and SPIP document have settled.

 I will leave the vote open until Friday (the 18th September 2020), 5pm
 CST.

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don't think this is a good idea because ...


 Thanks,
 Mridul

>>>
>>
>> --
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>>
>


Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Chang Chen
I See.

In our case,  we use SingleBufferInputStream, so time spent is duplicating
the backing byte buffer.


Thanks
Chang


Ryan Blue  于2020年9月15日周二 上午2:04写道:

> Before, the input was a byte array so we could read from it directly. Now,
> the input is a `ByteBufferInputStream` so that Parquet can choose how to
> allocate buffers. For example, we use vectored reads from S3 that pull back
> multiple buffers in parallel.
>
> Now that the input is a stream based on possibly multiple byte buffers, it
> provides a method to get a buffer of a certain length. In most cases, that
> will create a ByteBuffer with the same backing byte array, but it may need
> to copy if the request spans multiple buffers in the stream. Most of the
> time, the call to `slice` only requires duplicating the buffer and setting
> its limit, but a read that spans multiple buffers is expensive. It would be
> helpful to know whether the time spent is copying data, which would
> indicate the backing buffers are too small, or whether it is spent
> duplicating the backing byte buffer.
>
> On Mon, Sep 14, 2020 at 5:29 AM Sean Owen  wrote:
>
>> Ryan do you happen to have any opinion there? that particular section
>> was introduced in the Parquet 1.10 update:
>>
>> https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
>> It looks like it didn't use to make a ByteBuffer each time, but read from
>> in.
>>
>> On Sun, Sep 13, 2020 at 10:48 PM Chang Chen  wrote:
>> >
>> > I think we can copy all encoded data into a ByteBuffer once, and unpack
>> values in the loop
>> >
>> >  while (valueIndex < this.currentCount) {
>> > // values are bit packed 8 at a time, so reading bitWidth will
>> always work
>> > this.packer.unpack8Values(buffer, buffer.position() + valueIndex,
>> this.currentBuffer, valueIndex);
>> > valueIndex += 8;
>> >   }
>> >
>> > Sean Owen  于2020年9月14日周一 上午10:40写道:
>> >>
>> >> It certainly can't be called once - it's reading different data each
>> time.
>> >> There might be a faster way to do it, I don't know. Do you have ideas?
>> >>
>> >> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen 
>> wrote:
>> >> >
>> >> > Hi export
>> >> >
>> >> > it looks like there is a hot spot in
>> VectorizedRleValuesReader#readNextGroup()
>> >> >
>> >> > case PACKED:
>> >> >   int numGroups = header >>> 1;
>> >> >   this.currentCount = numGroups * 8;
>> >> >
>> >> >   if (this.currentBuffer.length < this.currentCount) {
>> >> > this.currentBuffer = new int[this.currentCount];
>> >> >   }
>> >> >   currentBufferIdx = 0;
>> >> >   int valueIndex = 0;
>> >> >   while (valueIndex < this.currentCount) {
>> >> > // values are bit packed 8 at a time, so reading bitWidth will
>> always work
>> >> > ByteBuffer buffer = in.slice(bitWidth);
>> >> > this.packer.unpack8Values(buffer, buffer.position(),
>> this.currentBuffer, valueIndex);
>> >> > valueIndex += 8;
>> >> >   }
>> >> >
>> >> >
>> >> > Per my profile, the codes will spend 30% time of readNextGrou() on
>> slice , why we can't call slice out of the loop?
>>
>
>
> --
> Ryan Blue
>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread angers . zhu






+1






  










angers.zhu




angers@gmail.com








签名由
网易邮箱大师
定制

 


On 09/15/2020 08:21,Xiao Li wrote: 


+1 XiaoDB Tsai  于2020年9月14日周一 下午4:09写道:+1On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:+1 ChandniOn Mon, Sep 14, 2020 at 11:41 AM Tom Graves  wrote:
+1Tom





On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan  wrote:



Hi,I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based shuffle to improve shuffle efficiency.Please take a look at:SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602SPIP doc: https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/editPOC against master and  results summary : https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/editActive discussions on the jira and SPIP document have settled.I will leave the vote open until Friday (the 18th September 2020), 5pm CST.[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this is a good idea because ...Thanks,Mridul


-- Sincerely,DB Tsai--Web: https://www.dbtsai.comPGP Key ID: 42E5B25A8F7A82C1






Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xiao Li
+1

Xiao

DB Tsai  于2020年9月14日周一 下午4:09写道:

> +1
>
> On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:
>
>> +1
>>
>> Chandni
>>
>> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
>> wrote:
>>
>>> +1
>>>
>>> Tom
>>>
>>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
>>> mri...@gmail.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
>>> shuffle to improve shuffle efficiency.
>>> Please take a look at:
>>>
>>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>>- SPIP doc:
>>>
>>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>>- POC against master and results summary :
>>>
>>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>>
>>> Active discussions on the jira and SPIP document have settled.
>>>
>>> I will leave the vote open until Friday (the 18th September 2020), 5pm
>>> CST.
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don't think this is a good idea because ...
>>>
>>>
>>> Thanks,
>>> Mridul
>>>
>>
>
> --
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 42E5B25A8F7A82C1
>


Re: How to clear spark Shuffle files

2020-09-14 Thread lsn248
Our use case is as follows:
   We repartition 6 months worth of data for each client on clientId &
recordcreationdate, so that it can write one file per partition. Our
partition is on client and recordcreationdate. 

The job fills up the disk after it process say 30 tenants out of 50.  I am
looking for a way to clear the shuffle files once the jobs finishes writing
to the disk for a client before it moves on to next.

We process a client or group of clients (depends on data size) in one go,
sparksession is shared. We noticed that once you create a new sparksession
it clears the disk. But new sparksession is not a option for us.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread DB Tsai
+1

On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh  wrote:

> +1
>
> Chandni
>
> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
> wrote:
>
>> +1
>>
>> Tom
>>
>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
>> mri...@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
>> shuffle to improve shuffle efficiency.
>> Please take a look at:
>>
>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>- SPIP doc:
>>
>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>- POC against master and results summary :
>>
>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>
>> Active discussions on the jira and SPIP document have settled.
>>
>> I will leave the vote open until Friday (the 18th September 2020), 5pm
>> CST.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks,
>> Mridul
>>
>

-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1


Re: How to clear spark Shuffle files

2020-09-14 Thread Holden Karau
There's a second new mechanism which uses TTL for cleanup of shuffle files.
Can you share more about your use case?

On Mon, Sep 14, 2020 at 1:33 PM Edward Mitchell  wrote:

> We've also had some similar disk fill issues.
>
> For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM
> garbage collection. I've noticed that if RDDs maintain references in the
> code, and cannot be garbage collected, then immediate shuffle files hang
> around.
>
> Best way to handle this is by organizing your code such that when an RDD
> is finished, it falls out of scope, and thus is able to be garbage
> collected.
>
> There's also an experimental API created in Spark 3 (I think), that allows
> you to have more granular control by calling a method to clean up the
> shuffle files.
>
> On Mon, Sep 14, 2020 at 11:02 AM lsn248  wrote:
>
>> Hi,
>>
>>  I have a long running application and spark seem to fill up the disk with
>> shuffle files.  Eventually the job fails running out of disk space. Is
>> there
>> a way for me to clean the shuffle files ?
>>
>> Thanks
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: How to clear spark Shuffle files

2020-09-14 Thread Edward Mitchell
We've also had some similar disk fill issues.

For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM
garbage collection. I've noticed that if RDDs maintain references in the
code, and cannot be garbage collected, then immediate shuffle files hang
around.

Best way to handle this is by organizing your code such that when an RDD is
finished, it falls out of scope, and thus is able to be garbage collected.

There's also an experimental API created in Spark 3 (I think), that allows
you to have more granular control by calling a method to clean up the
shuffle files.

On Mon, Sep 14, 2020 at 11:02 AM lsn248  wrote:

> Hi,
>
>  I have a long running application and spark seem to fill up the disk with
> shuffle files.  Eventually the job fails running out of disk space. Is
> there
> a way for me to clean the shuffle files ?
>
> Thanks
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Chandni Singh
+1

Chandni

On Mon, Sep 14, 2020 at 11:41 AM Tom Graves 
wrote:

> +1
>
> Tom
>
> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan <
> mri...@gmail.com> wrote:
>
>
> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020), 5pm
> CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Tom Graves
 +1
Tom
On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan 
 wrote:  
 
 Hi,
I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based shuffle 
to improve shuffle efficiency.Please take a look at:   
   - SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
   - SPIP doc: 
https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
   - POC against master and results summary : 
https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
Active discussions on the jira and SPIP document have settled.
I will leave the vote open until Friday (the 18th September 2020), 5pm CST.

[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don't think this 
is a good idea because ...

Thanks,Mridul  

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Venkatakrishnan Sowrirajan
+1. Interesting indeed :)

Regards
Venkata krishnan


On Mon, Sep 14, 2020 at 11:14 AM Xingbo Jiang  wrote:

> +1 This is an exciting new feature!
>
> On Sun, Sep 13, 2020 at 8:00 PM Mridul Muralidharan 
> wrote:
>
>> Hi,
>>
>> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
>> shuffle to improve shuffle efficiency.
>> Please take a look at:
>>
>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>- SPIP doc:
>>
>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>- POC against master and results summary :
>>
>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>
>> Active discussions on the jira and SPIP document have settled.
>>
>> I will leave the vote open until Friday (the 18th September 2020), 5pm
>> CST.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks,
>> Mridul
>>
>


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xingbo Jiang
+1 This is an exciting new feature!

On Sun, Sep 13, 2020 at 8:00 PM Mridul Muralidharan 
wrote:

> Hi,
>
> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based
> shuffle to improve shuffle efficiency.
> Please take a look at:
>
>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>- SPIP doc:
>
> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>- POC against master and results summary :
>
> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>
> Active discussions on the jira and SPIP document have settled.
>
> I will leave the vote open until Friday (the 18th September 2020), 5pm
> CST.
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
>
> Thanks,
> Mridul
>


Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Ryan Blue
Before, the input was a byte array so we could read from it directly. Now,
the input is a `ByteBufferInputStream` so that Parquet can choose how to
allocate buffers. For example, we use vectored reads from S3 that pull back
multiple buffers in parallel.

Now that the input is a stream based on possibly multiple byte buffers, it
provides a method to get a buffer of a certain length. In most cases, that
will create a ByteBuffer with the same backing byte array, but it may need
to copy if the request spans multiple buffers in the stream. Most of the
time, the call to `slice` only requires duplicating the buffer and setting
its limit, but a read that spans multiple buffers is expensive. It would be
helpful to know whether the time spent is copying data, which would
indicate the backing buffers are too small, or whether it is spent
duplicating the backing byte buffer.

On Mon, Sep 14, 2020 at 5:29 AM Sean Owen  wrote:

> Ryan do you happen to have any opinion there? that particular section
> was introduced in the Parquet 1.10 update:
>
> https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
> It looks like it didn't use to make a ByteBuffer each time, but read from
> in.
>
> On Sun, Sep 13, 2020 at 10:48 PM Chang Chen  wrote:
> >
> > I think we can copy all encoded data into a ByteBuffer once, and unpack
> values in the loop
> >
> >  while (valueIndex < this.currentCount) {
> > // values are bit packed 8 at a time, so reading bitWidth will
> always work
> > this.packer.unpack8Values(buffer, buffer.position() + valueIndex,
> this.currentBuffer, valueIndex);
> > valueIndex += 8;
> >   }
> >
> > Sean Owen  于2020年9月14日周一 上午10:40写道:
> >>
> >> It certainly can't be called once - it's reading different data each
> time.
> >> There might be a faster way to do it, I don't know. Do you have ideas?
> >>
> >> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen 
> wrote:
> >> >
> >> > Hi export
> >> >
> >> > it looks like there is a hot spot in
> VectorizedRleValuesReader#readNextGroup()
> >> >
> >> > case PACKED:
> >> >   int numGroups = header >>> 1;
> >> >   this.currentCount = numGroups * 8;
> >> >
> >> >   if (this.currentBuffer.length < this.currentCount) {
> >> > this.currentBuffer = new int[this.currentCount];
> >> >   }
> >> >   currentBufferIdx = 0;
> >> >   int valueIndex = 0;
> >> >   while (valueIndex < this.currentCount) {
> >> > // values are bit packed 8 at a time, so reading bitWidth will
> always work
> >> > ByteBuffer buffer = in.slice(bitWidth);
> >> > this.packer.unpack8Values(buffer, buffer.position(),
> this.currentBuffer, valueIndex);
> >> > valueIndex += 8;
> >> >   }
> >> >
> >> >
> >> > Per my profile, the codes will spend 30% time of readNextGrou() on
> slice , why we can't call slice out of the loop?
>


-- 
Ryan Blue


How to clear spark Shuffle files

2020-09-14 Thread lsn248
Hi,

 I have a long running application and spark seem to fill up the disk with
shuffle files.  Eventually the job fails running out of disk space. Is there
a way for me to clean the shuffle files ?

Thanks





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Sean Owen
Ryan do you happen to have any opinion there? that particular section
was introduced in the Parquet 1.10 update:
https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20
It looks like it didn't use to make a ByteBuffer each time, but read from in.

On Sun, Sep 13, 2020 at 10:48 PM Chang Chen  wrote:
>
> I think we can copy all encoded data into a ByteBuffer once, and unpack 
> values in the loop
>
>  while (valueIndex < this.currentCount) {
> // values are bit packed 8 at a time, so reading bitWidth will always work
> this.packer.unpack8Values(buffer, buffer.position() + valueIndex, 
> this.currentBuffer, valueIndex);
> valueIndex += 8;
>   }
>
> Sean Owen  于2020年9月14日周一 上午10:40写道:
>>
>> It certainly can't be called once - it's reading different data each time.
>> There might be a faster way to do it, I don't know. Do you have ideas?
>>
>> On Sun, Sep 13, 2020 at 9:25 PM Chang Chen  wrote:
>> >
>> > Hi export
>> >
>> > it looks like there is a hot spot in 
>> > VectorizedRleValuesReader#readNextGroup()
>> >
>> > case PACKED:
>> >   int numGroups = header >>> 1;
>> >   this.currentCount = numGroups * 8;
>> >
>> >   if (this.currentBuffer.length < this.currentCount) {
>> > this.currentBuffer = new int[this.currentCount];
>> >   }
>> >   currentBufferIdx = 0;
>> >   int valueIndex = 0;
>> >   while (valueIndex < this.currentCount) {
>> > // values are bit packed 8 at a time, so reading bitWidth will always 
>> > work
>> > ByteBuffer buffer = in.slice(bitWidth);
>> > this.packer.unpack8Values(buffer, buffer.position(), 
>> > this.currentBuffer, valueIndex);
>> > valueIndex += 8;
>> >   }
>> >
>> >
>> > Per my profile, the codes will spend 30% time of readNextGrou() on slice , 
>> > why we can't call slice out of the loop?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org