Hi Clemens,

Thanks for those really helpful pointers.

I have continued digging into this for the last week or so, and have put my 
current code up GitHub for anyone interested:
https://github.com/tangobravo/webassembly-halfsample-benchmark

Results are that the bit-twiddling approaches do offer a pretty decent 
speed-up on mobile platforms (and 64-bit desktop), so this is a promising 
route :)

On Android it seems the version of Chrome distributed through Google Play 
is still the 32-bit one, even on 64-bit devices (tested on a Google Pixel 2 
- chrome://version states 32-bit).

Despite that, there is still a speedup using the packed implementations. 
The fastest on the Pixel 2 is the half_sample_uint32x2_blocks 
implementation, which gives something like a 2.4x speedup (2.04ms to 0.84ms 
for half-sampling a 720p image).

Safari in iOS 12.4 is 64-bit, and there the half_sample_uint64_blocks 
implementation is fastest and gives more than a 3x speedup on an iPod Touch 
7 (1.0ms to 0.3ms for 720p input).

Each benchmark run does 10 iterations with different but overlapping input 
data (so both input and output are likely to be at least in L2 cache - this 
is the case I expect in practice). The timing numbers that are printed by 
the test code show the total time for all 10 iterations, so the numbers 
above are divided by 10 from typical outputs over multiple runs. Safari 
only offers 1ms resolution on Performance.now() hence is harder to get 
accurate measurements, but the numbers above look pretty consistent over 
multiple runs.

Out of interest I've also tried to write an implementation targeting 
WebAssembly SIMD. I was able to get it to compile with emcc from 
latest-upstream but it doesn't run in my self-built d8 7.7 with the 
--experimental-wasm-simd flag. More details in the README of the repo 
linked above. I'd appreciate any help to get that one running for a further 
comparison datapoint.

So the questions that arise:

1) Is there any way for a page to detect if the browser is 32-bit or 
64-bit? navigator.platform reports Linux armv8l on the Pixel 2, so that 
doesn't help unfortunately. I have 32 and 64 bit "busy loops" that report 
approximately equal counts on 64-bit platforms (and not on 32-bit), but it 
would be nice if there was a more direct way to determine this!

2) Are there plans to transition Play Store Chrome releases to 64-bit for 
64-bit Android? I did some searching but couldn't find any official 
information about the reasons for sticking with 32-bit there (though I 
assume memory usage, apk size, or both).

3) Any hints on how I can get the SIMD version to run or is this likely 
just a bug / spec instability between emcc and d8?

Cheers,

Simon

On Tuesday, 3 September 2019 10:09:28 UTC+1, Clemens Hammacher wrote:
>
> Hi Simon,
>
> that's an interesting project, please keep us updated :)
>
> 1) Does v8 codegen emit 64-bit machine instructions for 64-bit wasm 
>> instructions on 64-bit architectures (specifically Android)? I imagine the 
>> speedup from SWAR techniques will be significantly reduced using just 
>> 32-bit registers, perhaps to the level that doesn't make this worthwhile.
>>
>
> Yes, of course we use 64-bit instructions for 64-bit operations (if the 
> binary was compiled for a 64-bit platform).
>
> 2) Does v8's codegen currently do any vectorization? If not, are there 
>> plans to add it? In this case the plain C version might be best to stick 
>> with as it would be easier for auto-vectorization to detect and optimize.
>>
>
> No, we do not do any auto-vectorization, and I am not aware of any plan to 
> add this.
>
> 3) Can anyone provide tips / links to help with investigating and 
>> optimizing this kind of thing? Any way of flagging wasm functions for 
>> maximum optimization for benchmarking purposes?
>>
>
> For benchmarking, you should disable the baseline compiler (Liftoff). 
> Chrome has a feature flag for that (
> chrome://flags/#enable-webassembly-baseline), if you experiment with d8 
> directly, then add "--no-liftoff --no-wasm-tier-up". If you want to inspect 
> the generated machine code, you can add the "--print-wasm-code" command 
> line flag. This requires a build with then "v8_enable_disassembler" gn flag 
> enabled, which is the default for debug builds. Since compilation happens 
> concurrently, you need to add "--predictable" (or "--single-threaded" or 
> "--wasm-num-compilation-tasks=0") to get readable output.
>
> Hope that helps,
> Clemens
>
> -- 
>
> Clemens Hammacher
>
> Software Engineer
>
> [email protected] <javascript:>
>
>
> Google Germany GmbH
>
> Erika-Mann-Straße 33
>
> 80636 München
>
> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>
> Registergericht und -nummer: Hamburg, HRB 86891
>
> Sitz der Gesellschaft: Hamburg
>
> Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten 
> haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, 
> löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, 
> dass die E-Mail an die falsche Person gesendet wurde. 
>
>     
>
> This e-mail is confidential. If you received this communication by 
> mistake, please don't forward it to anyone else, please erase all copies 
> and attachments, and please let me know that it has gone to the wrong 
> person.
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/09853b85-6d5d-4726-ad90-c11d566396d1%40googlegroups.com.

Reply via email to