Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

'Daniel Vogelheim' via v8-dev Fri, 03 Jun 2022 08:37:20 -0700

On Fri, Jun 3, 2022 at 10:54 AM Leszek Swirski <[email protected]> wrote:


> Ok, if allowlisting vs blocklisting is the heart of the issue, I can
> accept that this is a design requirement.
>
> So, re: parse vs. scan -- I'm not sure this is a sufficient
> simplification. In particular, if memory serves, our parse cost is roughly
> 50% scanner and 50% token interpretation + AST building, so you'll get at
> best a ~2x speedup over a full parse (or over a pre-parse? I don't remember
> the exact breakdown). Particularly there's a cost to identifying keywords
> vs identifiers, but we could probably drop that and ignore keywords.
> Parsing strings and regexp has some cost, but you could maybe make them
> cheaper with stronger approximations (race to closing quotes, that sort of
> thing). Then, I wouldn't check if the token combination is a definitely
> valid one, just whether the tokenizer failed at all + some simple
> token-based heuristics (like brace matching, simple patterns). Tokenizer
> failure would most likely catch almost all binary formats; non-binary
> formats are likely either too-JS compatible (like some raw JSON, a lot of
> YAML, and I think all CSV, is valid JS) and would need to still rely on a
> more blocklist approach with said token heuristics.
>

Thank you. This is a very helpful response.

My main idea at simplification was to reduce the amount of data scanned.
Like, just the first 3 tokens or so. We can either reduce cost by reducing
the cost of the operation (parsing > pre-parsing > only scanning), or by
reducing the input size (whole file > 1kB prefix > just a few bytes). Or
both.

Scanning + brace matching sounds very enticing. That's very doable, and
would indeed filter out pretty much any binary format, and nearly all
"parser breakers".

It'd also be much better than parsing in terms of code complexity. The V8
scanner is one medium-size file + headers, and only loosely coupled to the
rest of the engine. (Mainly the input stream and the AstValueFactory.) The
parser is a good bit larger and tied to much more infrastructure (the whole
AST).

I wonder what simple heuristics one could have. I think most operators
can't be the first token. Or be followed by another operator. Or an
identifier can't be followed by another identifier. Would be good to
validate that, though. I think, a while ago, Nikos had a script to extract
a cover grammar from the TC39 spec. Maybe that can be hacked up to extract
simple, impossible sequences or sets of relevant token classes.


Getting a TC39 approved version of this... well, any spec word is hard. +Shu-yu
> Guo <[email protected]>.
>

Very true, unfortunately. :)


> On Thu, Jun 2, 2022 at 5:36 PM 'Daniel Vogelheim' via v8-dev <
> [email protected]> wrote:
>
>> On Thursday, June 2, 2022 at 9:46:15 AM UTC+2 [email protected] wrote:
>>
>>> Can we not detect these via some magic number sniffing? I'm
>>> fundamentally concerned about an allowlist approach for JS over a blocklist
>>> approach for non-JS.
>>>
>>
>> This is pretty much the heart of the issue: The entire thing of CORB to
>> ORB transition is to go from "blocklist" to "allowlist", based on the
>> observation that block lists ultimately never seem to work. In particular,
>> we don't want to pass things by default, where anything we don't know
>> automatically passes. That does lead us to an allowlist, in some form.
>> Elsewhere, I summarized (my understanding of) the ORB security requirements
>> as this: For "no-cors" requests, we want to have some positive evidence
>> that the data we're receiving is in a format suitable for the request type.
>>
>> Being able to drop unknown stuff by default is really the core benefit of
>> ORB.
>>
>> I do think we have quite a bit of leeway to decide what form of "positive
>> evidence" we'll accept. The current draft specifies a full JS parse, which
>> I think is way over the top. But I do think we need *something* that
>> tells us with some probability whether a given byte sequence looks like JS
>> or not. The only hard criteria is that actually valid JS should pass,
>> because otherwise we'll break websites left and right. (To that end, "while
>> (1);" was arguably a terrible example.) (Caveat: Those are my opinions.
>> Other browsers might have stronger opinions.)
>>
>>
>> IMHO, checking for "parser breakers", the way CORB does, is a convenient
>> temporary solution, because we already know it's web compatible.
>>
>> IMHO, a full parse (in the network process, or triggered by the network
>> process) is crazy, and I'd really like to have something more lightweight.
>>
>> Which leads me to the proposal to only use the scanner to look for a few
>> tokens. And ideally for TC39 to adopt some sort of SmellsLikeJavaScript
>> abstract operation that other standards could point to.
>>
>>
>>
>>>
>>> Note that CSV is sadly valid JS, so that won't be blocked at all.
>>>
>>> On Wed, Jun 1, 2022 at 6:45 PM 'Łukasz Anforowicz' via v8-dev <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 1, 2022 at 8:34 AM Leszek Swirski <[email protected]>
>>>> wrote:
>>>>
>>>>> On Wed, Jun 1, 2022 at 5:17 PM 'Łukasz Anforowicz' via v8-dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Benefit of full JS parse over a list of known non-JS prefixes:
>>>>>> Stricter is-it-JS checking = more non-JS things get blocked = improved
>>>>>> security.  Still, there is a balance here - some heuristics (like the 
>>>>>> ones
>>>>>> proposed by Daniel) are almost as secure as full JS parse (while being
>>>>>> easier to implement and having less of a performance impact).
>>>>>>
>>>>>
>>>>> Makes sense, I'm just asking to make sure that we strike the right
>>>>> balance between security improvements and complexity/performance issues;
>>>>> even a JS tokenizer without a full parser is quite a complexity investment
>>>>> (it needs e.g. a full regexp parser), plus the language grammar is
>>>>> sufficiently broad that I expect exhaustively enumerating all possible
>>>>> combinations of even just 3-5 tokens to be prohibitively large (setting
>>>>> aside maintainability in the face of ever-updating standards).
>>>>>
>>>>> Do we have a measure of how much non-JS coverage the current
>>>>> heuristics give, on real-world examples of JSON files? Or perhaps, a
>>>>> measure of how many different prefixes there are that we could blocklist?
>>>>> Do we know at what point the improved security has diminishing returns?
>>>>>
>>>>
>>>> Examples of a response bodies that we would want to block, but that
>>>> wouldn't get blocked without full JS parsing/verification (assume that the
>>>> responses below are served as text/html or application/octet-stream):
>>>>
>>>>    - PDF
>>>>    - ProtoBuf
>>>>    - Microsoft Word
>>>>    - CSV files
>>>>
>>>>
>>>>> - Leszek
>>>>>
>>>>> --
>>>>> --
>>>>> v8-dev mailing list
>>>>> [email protected]
>>>>> http://groups.google.com/group/v8-dev
>>>>> ---
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "v8-dev" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/v8-dev/CAGRskv9UUNJ9sjW0FvuHyCN90j%3DfbafSOgGVBG19qRe19_%2BO5w%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/v8-dev/CAGRskv9UUNJ9sjW0FvuHyCN90j%3DfbafSOgGVBG19qRe19_%2BO5w%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>>
>>>> Lukasz
>>>>
>>>> --
>>>> --
>>>> v8-dev mailing list
>>>> [email protected]
>>>> http://groups.google.com/group/v8-dev
>>>> ---
>>>>
>>> You received this message because you are subscribed to the Google
>>>> Groups "v8-dev" group.
>>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/v8-dev/CAA_NCUE%3DgtMdPPzFGy-gSuvV62VqesgRdkTkfvpOXNf9xHKpYQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/v8-dev/CAA_NCUE%3DgtMdPPzFGy-gSuvV62VqesgRdkTkfvpOXNf9xHKpYQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> --
>> v8-dev mailing list
>> [email protected]
>> http://groups.google.com/group/v8-dev
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "v8-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/v8-dev/3ab87558-c9ea-484c-b42a-459380e8ad25n%40googlegroups.com
>> <https://groups.google.com/d/msgid/v8-dev/3ab87558-c9ea-484c-b42a-459380e8ad25n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> --
> v8-dev mailing list
> [email protected]
> http://groups.google.com/group/v8-dev
> ---
> You received this message because you are subscribed to the Google Groups
> "v8-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/v8-dev/CAGRskv-koJeiWCti%2B8DgRcDAMMnRoUDN_WtY_VL8diSdxLrM6Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/v8-dev/CAGRskv-koJeiWCti%2B8DgRcDAMMnRoUDN_WtY_VL8diSdxLrM6Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/CALG6KPN%3DDfaMxPtPXbK5ON7a-cXaO8LOJ0aVMDt5o-vpnRQtWw%40mail.gmail.com.

Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

Reply via email to