[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...

2018-01-23 Thread MikhailErofeev
Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
@srowen, yes,  the processing is no longer IO-bound after backporting 
SPARK-20923


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19978: [SPARK-22784][CORE][WIP] Configure reading buffer...

2018-01-23 Thread MikhailErofeev
Github user MikhailErofeev closed the pull request at:

https://github.com/apache/spark/pull/19978


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...

2017-12-16 Thread MikhailErofeev
Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
@squito
Your guess was right, and I can remove these blocks by 
https://issues.apache.org/jira/browse/SPARK-20923. I will test the performance 
after this patch and refine or close the ticket. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...

2017-12-15 Thread MikhailErofeev
Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
Thanks for the constuctive feedback. 
Here is my benchmark for a step of 1MB. During this run the speedup was 
23%, I think there was some interference on my workstation. 
```
2048 213.481
1048576 212.256
2097152 206.07
3145728 199.057
4194304 192.292
5242880 187.91
6291456 182.95
7340032 179.837
8388608 176.994
9437184 174.602
10485760 173.313 < 10MB, 19% 
11534336 172.1
12582912 176.085
13631488 171.782
14680064 172.209
15728640 172.048
16777216 168.313
16777216 168.313
17825792 169.521
18874368 167.466
19922944 167.023
20971520 167.7 < 10M. 21% saving
22020096 166.754
23068672 166.141
24117248 165.809
25165824 166.053
26214400 165.281
27262976 165.381
28311552 164.564
29360128 164.894
30408704 164.599 < 30M. x20 of avg line size, 23% saving
31457280 164.019
32505856 164.289
33554432 164.517
34603008 163.96
35651584 163.936
36700160 163.381
37748736 164.156
38797312 164.061
39845888 163.636
40894464 163.73
41943040 162.462
42991616 163.006
44040192 162.586
45088768 162.363
```

@squito 
our main users run iterative algorithms with a lot of partitions. They have 
a lot of data and prefer to have smaller partitions (it's another improvement 
axis). So, long lines come from SparkListenerTaskEnd and this blocks info:
```
"Block ID": "rdd_5_30129",
"Status": {
  "Storage Level": {
"Use Disk": true,
"Use Memory": false,
"Deserialized": false,
"Replication": 1
  },
  "Memory Size": 0,
  "Disk Size": 57174029
}
  },
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19978: [SPARK-22784][CORE] Configure reading buffer size in Spa...

2017-12-14 Thread MikhailErofeev
Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
I don't mind to just set it to a higher value. Moreover, the current 
default value (2048) is small in any case. 
For my log files, 30M buffer was the best value (a bigger one did not bring 
a lot of speedup), although for other files the optimal value could be bigger. 

What do you think? Is it ok to keep the value as 30M? With 50 cores it 
could eat 1.5G. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19978: [SPARK-22784][CORE] Configure reading buffer size...

2017-12-14 Thread MikhailErofeev
GitHub user MikhailErofeev opened a pull request:

https://github.com/apache/spark/pull/19978

[SPARK-22784][CORE] Configure reading buffer size in Spark History Server

## What changes were proposed in this pull request?
Added debug logging of spent time and line size for each job.
Parametrized `ReplayListenerBus` with a new buffer size parameter  
`spark.history.fs.buffer.size`. 
Added documentation to the parameter. 

## How was this patch tested?
Existing tests for correctness, manual tests (reading a file in a loop with 
different buffer sizes) for performance measurements. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MikhailErofeev/spark 
feature/shs-buffer-upstream

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19978.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19978


commit db1dc533a17564d531e84ad4f41ae9152d8619a5
Author: m.erofeev <m.erof...@criteo.com>
Date:   2017-12-14T10:46:59Z

[SPARK-22784][CORE] Configure reading buffer size in Spark History Server

Increasing buffer size in ReplayListenerBus speedups reading in
case of long event strings




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org