I think I'm starting to see what you're saying, so let me restate it my way to 
be sure:

Code that deals with various loaded blobs must be resilient against them 
getting larger than a 32-bit size can handle and this is the subject of ongoing 
development effort.  Such code may or may not use a Vector.  It definitely must 
use size_t for all of its sizes.  The trouble with Vector using unsigned in its 
API then causes the following troubles:

If the code uses a Vector for the large data: in addition to not being able to 
actually hold all of the data, there will be a risk of truncation from size_t 
to unsigned on the client side of the Vector, which is much harder to prevent 
than a truncation inside Vector itself.  For such code the ideal would be a 
Vector that measures itself using size_t, both in its internal data structure 
and in its API.

If the code does not use Vector for that large data but uses it nonetheless for 
other things: things won't obviously go wrong but you end up in sketchy 
territory: you will have code that is juggling unsigned's for the Vector and 
size_t's for the large data.  Any arithmetic between the two can easily become 
suspect.  Consider a case where you might use a metric of size or offset in the 
large data in some arithmetic that flows into Vector.  If Vector's indexing 
operations use unsigned, then you will have truncation.

These problems fall into two categories:

1) Actually not being able to store the large data but failing gracefully when 
it happens.

2) Truncation leading to some kind of undesirable behavior.

Someone mentioned something about warnings that can tell us about (2).  
Apparently we cannot turn the warning on right now.  The hope is that if we 
convert everything to unsigned then we won't have the warnings in trunk and 
then we can turn the warning on, and catch such bugs.  This seems very 
aspirational.

Also, it means that everyone hacking on code that mixes "sizes of files" with 
"sizes of vectors" will have to scratch their heads about which type to use, 
and then if they do it wrong they will get some warnings.  Of course, the 
warnings can easily be silenced with a cast.  That's super dangerous.  In my 
experience, if the easiest way to make the compiler STFU is by doing the wrong 
thing (static_cast, in this case) then people will do the wrong thing quite 
frequently.

(A particularly funny example of this is checked exceptions in Java.  If you 
call a method declared to throw an exception then the caller must either catch 
it or declare that they throw it, too.  The easiest way to silence the error is 
to catch it with an empty catch block that ignores the exception, since that 
requires the fewest keystrokes.  Someone once did a study of large Java 
codebases and found such empty catch blocks to be a chronic problem - people 
did it a lot and it was responsible for bad bugs.  The lesson: a type error is 
only a good thing if the path of least resistance to making the compiler STFU 
involves actually improving the code.)

An alternate approach is to agree to use size_t for sizes.  This is an easy to 
explain rule and it should be easy for reviewers to catch it.  It might also 
make that compiler warning more meaningful, in the sense that in a world where 
we are mixing size_t and unsigned we will *have* to have some casts between the 
two but in a world where we use size_t uniformly any such cast is an immediate 
red flag.

So, I think I'm starting to agree with the size_t approach.

I wonder what the downsides are to this approach.  Footprint of Vector?

-Filip


> On Nov 20, 2014, at 9:26 AM, Alexey Proskuryakov <a...@webkit.org> wrote:
> 
> 
> 19 нояб. 2014 г., в 14:58, Alexey Proskuryakov <a...@webkit.org 
> <mailto:a...@webkit.org>> написал(а):
> 
>> These and related uses are all over the place - see also Vectors in 
>> FormDataBuilder, data returned from FrameLoader::loadResourceSynchronously, 
>> plug-in code that loads from network, SharedBuffer etc.
> 
> Another way to say this is: we need to deal with large data arrays throughout 
> loading code. We do not really need full size vectors in most other code - 
> it's sort of OK for HTML parser or for image decoder to fail catastrophically 
> when there is too much data fed to it.
> 
> This is somewhat questionable design, but if we are going to stick to it, 
> then magnitude checks should be centralized, not sprinkled throughout the 
> code. We should not make this check explicitly when feeding a network 
> resource to the parser, for example.
> 
> A 64-bit API for Vector solves this nearly flawlessly. We do not perform the 
> checks manually every time we use a Vector, Vector does them for us.
> 
> Other options are:
> 
> - uint64_t everywhere. This way, we'll solve practical problems with large 
> resources once and for all. Also, this may prove to be necessary to solve 
> even YouTube/Google Drive uploads, I do not know that yet.
> 
> - size_t everywhere. Same effect on 64-bit platforms, while 32-bit ones will 
> still be limited. I like this option, because it won't make us pay the memory 
> and performance cost on old crippled 32-bit machines, which are unlikely to 
> be used for manipulating huge volumes of data anyway.
> 
> We'll also need to develop some magic for 53-bit JS bindings, which I'm not 
> sure about.
> 
> - Alexey
> 

_______________________________________________
webkit-dev mailing list
webkit-dev@lists.webkit.org
https://lists.webkit.org/mailman/listinfo/webkit-dev

Reply via email to