Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

'Daniel Vogelheim' via v8-dev Tue, 31 May 2022 05:45:19 -0700

Hi all,

Apologies for reviving this thread, but this problem is coming up again. I 
think the answer of parsing in a separate process would work, but I'd 
really like to find a simpler solution. For all I can see, the underlying 
security requirements should be much less strict than the current ORB 
proposal implies. An approximation should do just fine. For example, for 
media formats we just look for a "magic number" (e.g. a 3-byte constant for 
JPEG files); so I don't think we need a full parse of the input.


Here is how I'd like to simplify this:
- Run only the JS scanner. (Including charset + comment processing.)
- Take the first N tokens. I suspect N=3 would be enough.
- Check the token list against a set of permissible token sequences.

Even for small N a complete list of permissible sequences might be rather 
large. It might be worth approximating it.
In either case, this method easily distinguishes valid JS from pretty much 
any of the requirements from Lukasz' earlier mail (except "while(1);", 
which needs N>=5). It does leave some ambiguity towards JSON, but IMHO 
that's tolerable.

Would this make sense from a V8 perspective?

Is it possible to generate a list of possible token sequences from the JS 
grammar, or would one have to do that manually? (For, say, N=3)

The question of standardization has also come up. Could TC39 maybe be 
convinced to adopt such a JavaScript sniffer, since it's fundamentally an 
operation on JS syntax? (That would hopefully prevent the sniffer and the 
actual syntax from getting out of sync as JS evolves.)

Any thoughts?

Daniel

On Wednesday, September 1, 2021 at 5:46:25 PM UTC+2 [email protected] 
wrote:

> Wait, no, we do handle running out of stack in a robust way and the "does 
> this parse" should just return false then (even though the code might be 
> valid Js). Please ignore that part of my comment :)
>
> On Wed, 1 Sep 2021, 16:38 Marja Hölttä, <[email protected]> wrote:
>
>> A random side note: it's also possible to make V8's recursive descent 
>> parser run out of stack using valid JS, e.g., let a = [[[[[..[ 0 ]]]]]..] 
>> or other similar constructs (deep enough). Meaning you prob don't want to 
>> call into the parser in a process where you don't want this to happen.
>>
>> Re: encodings, when I worked on script streaming I noticed it's pretty 
>> common that scripts advertised as UTF-8 are not valid UTF-8 (e.g., have 
>> invalid chars inside comments), and Chrome is currently pretty lenient 
>> about those.
>>
>>
>> On Wed, Aug 18, 2021 at 3:18 PM Toon Verwaest <[email protected]> 
>> wrote:
>>
>>>
>>>
>>> On Wed, Aug 18, 2021 at 2:29 AM 'Łukasz Anforowicz' via v8-dev <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Aug 17, 2021 at 6:59 AM Toon Verwaest <[email protected]> 
>>>> wrote:
>>>>
>>>>> Thinking out loud: One idea could be to have a separate sandboxed 
>>>>> compiler process in which we compile incoming JS code. That could reject 
>>>>> the source if it doesn't compile; or compile it to a script that just 
>>>>> throws with no additional info about the actual source.
>>>>>
>>>>> That process could implement streaming compilation; so we don't block 
>>>>> streaming until later, we don't double parse, we still have a sandbox 
>>>>> (not 
>>>>> in the network process). There might even be benefits for caching as a 
>>>>> compromised renderer cannot look at the compilation artefacts until it 
>>>>> receives them.
>>>>>
>>>>> If we fully compile and create a code cache from the compilation 
>>>>> result we don't need a new API on the V8 side, but do additional 
>>>>> serialization/deserialization work. That should be faster than reparsing 
>>>>> though. The upper limit of the cost would essentially be the cost of 
>>>>> serializing / deserializing a code cache for each script.
>>>>>
>>>>
>>>> This seems like an interesting idea.  I wonder if compilation (no 
>>>> evaluation / running of scripts) would be considered safe enough to handle 
>>>> in a single (not origin/site-bound/locked) process.
>>>>
>>>
>>> The parser/compiler aren't tiny, so it's not unlikely there's a bug. 
>>> It's certainly much less easy to control such bugs than full-blown JS OOB 
>>> access though. I could imagine a security bug replacing scripts in another 
>>> site (assuming it's sandboxed so well that it can't do much else), which 
>>> would be terrible; and it's unclear to me how easy that would be.
>>>  
>>>
>>>>
>>>> One thing that I don't fully understand (For both full-JS-parsing and 
>>>> partial/hackish-non-JS-detection approaches) is if the encoding (e.g. UTF8 
>>>> vs UTF16-LE vs Win-1250) has to be known and communicated upfront to the 
>>>> parser/sniffer?  Or maybe the input to the decoder needs to be already in 
>>>> UTF8?  Or maybe something in //net or //network layers can already handle 
>>>> this aspect of the problem (e.g. ensuring UTF8 in URLLoader::DidRead)?
>>>>
>>>
>>> There's some encoding guessing happening before we streaming compile (
>>> https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/bindings/core/v8/script_streamer.cc;l=584;drc=f0b502c3c977f47c58b49506629b2dd8353e4c59;bpv=1;bpt=1)
>>>  
>>> and some afterwards; and if we initially compiled with the wrong encoding 
>>> we discard and redo iirc. Presumably compilation failed anyway if the 
>>> encoding was wrong; but this presumably also doesn't happen too often.
>>>  
>>>
>>>>
>>>> Also - when trying to explore the partial/hackish-non-JS-detection 
>>>> idea, I wondered if the very first character in a script may only come 
>>>> from 
>>>> a relatively limited set of characters?  Let's assume that the sniffer can 
>>>> skip whitespace (space, tab, CR, LF, LS, PS) and html/xml comments (e.g. 
>>>> <!-- ... -->) - AFAICT the very next character has to be either:
>>>>
>>>>    - The start of a reserved keyword like "if", "let", etc. (all 
>>>>    lowercase ASCII)
>>>>    - The start of an identifier (any Unicode code point with the 
>>>>    Unicode property “ID_Start”)
>>>>    - The start of a unary expression: + - ~ !
>>>>    - The start of a string literal, string template, or a regexp 
>>>>    literal (or non-HTML comment): " ' ` /
>>>>    - The start of a numeric literal: 0-9
>>>>    - An opening paren, bracket or brace: ( [ {
>>>>    - Not quite sure if a dot or an equal sign can appear as the very 
>>>>    first character: . =
>>>>
>>>> This would reject PDFs (starts with %) and HTML/XML (starts with <), 
>>>> but still would accept ZIP files (first character is a 0x50 - capital P) 
>>>> and MSOffice files (first character is a 0xD0 which according to Unicode 
>>>> has ID_Start property set to true).  Rejecting ZIP and MSOffice files 
>>>> would 
>>>> require going beyond the first character - maybe rejecting control 
>>>> characters like 0x11 or 0x03 outside of comments (not sure if at this 
>>>> point 
>>>> the sniffer's heuristics are starting to get too complex).
>>>>
>>>
>>> That was my initial thought too for e.g., PDF. You'd be blacklisting 
>>> files you don't want to leak vs whitelisting JS though, which isn't 
>>> entirely ideal security-wise. It might be better than the alternative 
>>> though; if we either end up spending slowing down the web (repeat parsing, 
>>> interfere with streaming) or potentially have new security issues through a 
>>> shared compiler process.
>>>  
>>>
>>>>
>>>>
>>>>> On Fri, Aug 13, 2021 at 12:26 AM 'Łukasz Anforowicz' via v8-dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> On Thu, Aug 12, 2021 at 3:18 PM Łukasz Anforowicz <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 12, 2021 at 3:11 PM Jakob Kummerow <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> ORB-with-html/json/xml-sniffing shows that some security benefits 
>>>>>>>>> of ORB may be realized without full-fidelity JS sniffing/parsing.  
>>>>>>>>>
>>>>>>>>>
>>>>>>>> You may call it a security benefit to block "obvious" parser 
>>>>>>>> breakers like )]}', but in general, any "when in doubt, don't 
>>>>>>>> block it" strategy won't be much of an obstacle to intentional 
>>>>>>>> attacks. For 
>>>>>>>> instance, once Mr. Bad Guy has learned that the sniffer only looks at 
>>>>>>>> the 
>>>>>>>> first 1024 characters, they can send a response whose first 1024 
>>>>>>>> characters 
>>>>>>>> lead to a "well, it *might* be valid JS" judgement (such as a JS 
>>>>>>>> comment, or long string, or whatever). OTOH any "when in doubt, block 
>>>>>>>> it" 
>>>>>>>> strategy runs the risk of breaking existing websites in those doubtful 
>>>>>>>> cases.
>>>>>>>>
>>>>>>>
>>>>>>> In CORB threat model the attacker does *not* control the responses - 
>>>>>>> CORB tries to prevent https://attacker.com (with either Spectre or 
>>>>>>> a compromised renderer) from being able to read no-cors responses from 
>>>>>>> https://victim.com.
>>>>>>>
>>>>>>>>  
>>>>>>>>
>>>>>>>>>  (Although the JSON object syntax is exactly Javascript's 
>>>>>>>>> object-initializer syntax, a Javascript object-initializer expression 
>>>>>>>>> is 
>>>>>>>>> not valid as a standalone Javascript statement.)
>>>>>>>>
>>>>>>>>
>>>>>>>> There is (at least) one subtlety here: JS is more permissive than 
>>>>>>>> the official JSON spec. The latter requires quotes around property 
>>>>>>>> names, 
>>>>>>>> the former doesn't. I.e. {"foo": is indeed never valid JS, but 
>>>>>>>> {foo: is (the brace opens a code block, and foo is a label). Also, 
>>>>>>>> the colon is essential for rejecting the former snippet, because 
>>>>>>>> {"foo"; is valid JS (code block plus ignored string á la "use 
>>>>>>>> strict";), so this is a concrete example where the 1024-char 
>>>>>>>> prefix issue is relevant.
>>>>>>>>  
>>>>>>>>
>>>>>>>>> When the sniffer sees:
>>>>>>>>>      [ 123, 456, “long string taking X bytes”,
>>>>>>>>> then it should block the response when the Content-Type is a JSON 
>>>>>>>>> MIME type
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't follow. When the Content-Type is JSON, and the actual 
>>>>>>>> contents are valid JSON, why should that be blocked?
>>>>>>>>
>>>>>>>
>>>>>>> Correct.  There is no way to read cross-origin JSON via a "no-cors" 
>>>>>>> fetch.  The only way to read cross-origin JSON is via CORS-mediated 
>>>>>>> fetch 
>>>>>>> (where the victim has to opt-in by responding with 
>>>>>>> "Access-Control-Allow-Origin: ...").
>>>>>>>
>>>>>>
>>>>>> Maybe another way to look at it is:
>>>>>>
>>>>>>    - Only Javascript (and images/audio/video/stylesheets) can be 
>>>>>>    sent in no-cors mode (e.g. without CORS).  Non-Javascript (and 
>>>>>>    non-image/video/etc), no-cors, cross-origin responses can be blocked.
>>>>>>    - If the response sniffs as JSON (Content-Type=JSON and 
>>>>>>    First1024bytes=JSON) then it is *not* Javascript.  Therefore we can 
>>>>>> block 
>>>>>>    the response (and prevent disclosing 
>>>>>>    https://victim.com/secret.json to a no-cors fetch from 
>>>>>>    https://attacker.com).
>>>>>>
>>>>>>  
>>>>>>
>>>>>>>
>>>>>>>> -- 
>>>>>>>> -- 
>>>>>>>> v8-dev mailing list
>>>>>>>> [email protected]
>>>>>>>> http://groups.google.com/group/v8-dev
>>>>>>>> --- 
>>>>>>>> You received this message because you are subscribed to a topic in 
>>>>>>>> the Google Groups "v8-dev" group.
>>>>>>>> To unsubscribe from this topic, visit 
>>>>>>>> https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
>>>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>>>> [email protected].
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/v8-dev/CAKSzg3TNvd1jd3yH8xyD767ZhbCqhEZJMFmm7nQ%2BtcQcXfjt_g%40mail.gmail.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/v8-dev/CAKSzg3TNvd1jd3yH8xyD767ZhbCqhEZJMFmm7nQ%2BtcQcXfjt_g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Lukasz
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> Thanks,
>>>>>>
>>>>>> Lukasz
>>>>>>
>>>>>> -- 
>>>>>> -- 
>>>>>> v8-dev mailing list
>>>>>> [email protected]
>>>>>> http://groups.google.com/group/v8-dev
>>>>>> --- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "v8-dev" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/v8-dev/CAA_NCUHWD5G2G9aHe%3DnM6k-hSZY2ufqx7GwEhmKYSfPN9b%3D9WA%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/v8-dev/CAA_NCUHWD5G2G9aHe%3DnM6k-hSZY2ufqx7GwEhmKYSfPN9b%3D9WA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>>> -- 
>>>>> v8-dev mailing list
>>>>> [email protected]
>>>>> http://groups.google.com/group/v8-dev
>>>>> --- 
>>>>> You received this message because you are subscribed to a topic in the 
>>>>> Google Groups "v8-dev" group.
>>>>> To unsubscribe from this topic, visit 
>>>>> https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>> [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/v8-dev/CANS-YRqhC5Z_XeNuN0-4VNMgOV-bJ6LHd1e%3Daw%2Bn82pjxWJx1Q%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/v8-dev/CANS-YRqhC5Z_XeNuN0-4VNMgOV-bJ6LHd1e%3Daw%2Bn82pjxWJx1Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Thanks,
>>>>
>>>> Lukasz
>>>>
>>>> -- 
>>>> -- 
>>>> v8-dev mailing list
>>>> [email protected]
>>>> http://groups.google.com/group/v8-dev
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "v8-dev" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/v8-dev/CAA_NCUHjjiB9kMbyk%2Bn1ZMEda%2B8Oehr6ukU1VkK0vt9pcW%2B%3DuQ%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/v8-dev/CAA_NCUHjjiB9kMbyk%2Bn1ZMEda%2B8Oehr6ukU1VkK0vt9pcW%2B%3DuQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> -- 
>>> v8-dev mailing list
>>> [email protected]
>>> http://groups.google.com/group/v8-dev
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "v8-dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/v8-dev/CANS-YRqxEZHNcHV%2ByHZLBfoNOCbzQRxjXkfaeo2VCQgvUG9zKg%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/v8-dev/CANS-YRqxEZHNcHV%2ByHZLBfoNOCbzQRxjXkfaeo2VCQgvUG9zKg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> -- 
>>
>>
>> Google Germany GmbH
>>
>> Erika-Mann-Straße 33
>>
>> 80636 München
>>
>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>>
>> Registergericht und -nummer: Hamburg, HRB 86891
>>
>> Sitz der Gesellschaft: Hamburg
>>
>> Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten 
>> haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, 
>> löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, 
>> dass die E-Mail an die falsche Person gesendet wurde. 
>>
>>     
>>
>> This e-mail is confidential. If you received this communication by 
>> mistake, please don't forward it to anyone else, please erase all copies 
>> and attachments, and please let me know that it has gone to the wrong 
>> person.
>>
>

-- 
-- 
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- 
You received this message because you are subscribed to the Google Groups 
"v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-dev/ceb7ce0a-dac1-4634-810b-b35b5b97e1f0n%40googlegroups.com.

Re: [v8-dev] Utility to check if a given stream can parse as Javascript (ORB)

Reply via email to