on Tue Jan 24 2017, Chris Lattner <sabre-AT-nondot.org> wrote: > On Jan 24, 2017, at 12:05 AM, Chris Eidhof via swift-evolution > <[email protected]> wrote: >> >> I agree that being able to implement parsers in a nice way can be a >> huge step forward in being really good at string processing. > > +1 from me as well, I agree with Joe that Swift can learn a lot from > Perl 6 grammar’s and we should take the time to do it right. Below I > say “regex” a lot, but I really mean a more general grammar system > (and even Perl 5 regex’s aren’t regular :-) > >> There are a couple of possibilities that come to mind directly: >> >> 1. Build parsers right into the language (like Perl 6 grammars) >> 2. Provide a parser combinator language >> (e.g. https://github.com/davedufresne/SwiftParsec >> <https://github.com/davedufresne/SwiftParsec>). >> 3. Rely on external tools like bison/yacc/etc. >> 4. Make it easy for people to write hand-written parsers (e.g. by providing >> an NSScanner > alternative). > > My opinion is that #1 is the right path to start with, but it wouldn’t > preclude doing #2. Here’s my rationale / half-baked thought process: > > There are two important use cases for regex's: the literal case > (e.g. /aa+b*/) and the dynamically computed case. The former is > really what we’re talking about here, the latter should obviously be > handled with some sort of Regex type which can be formed from string > values or whatever.
Ideally these patterns interoperate so that you can combine them. > Regex literals in an expression context should default to producing > the Regex type of course. > > This means that when you pass a regex literal into an API call > (e.g. split on a string), it is really just creating something of > Regex type, and passing it down. If you wanted to introduce a parser > combinator DSL, you could totally plug it into the system, by having > the combinators produce something of the Regex type. > > So why bless regex literals with language support at all? I see > several reasons: > > 1. Diagnostics: These will be heavily used by people, and you want to > have good compiler error and warning messages for them. You want to > be able to validate the regex at compile time, not wait until runtime > to detect syntactic mistakes like unbalanced parens. > > 2. Syntax Familiarity: To take advantage of people’s familiarity with > other languages, we should strive to make the basic regex syntax > familiar and obvious. I’d argue that /aa+b*/ should “just work” and > do the thing you think it does. Relying on a combinator library to do > that would be crazy. > > 3. Performance: Many regex’s are actually regular, so they can be > trivially compiled into DFAs. There is a well understood body of work > that can be simply dropped into the compiler to do this. Regex’s that > are not regular can be compiled into hybrid DFA/NFA+backtracking > schemes, and allowing a divide and conquer style of compiler > optimization to do this is the path that makes the most sense (to me > at least). Further, if you switch on a string and have a bunch of > cases that are regex’s, you’d obviously want the compiler to generate > a single state machine (like a lexer), not check each pattern in > series. > > 4. Pattern matching greatness: One of the most obnoxious/error prone > aspects of regex’s in many languages is that when you match a pattern, > the various matches are dumped into numbered result values (often by > the order of the parens in the pattern). This is totally barbaric: it > begs for off by one errors, often breaks as the program is being > evolved/maintained, etc. It is just as bad as printf/scanf! > > You should instead be able to directly bind subexpressions into local > variables. For example if you were trying to match something like > “42: Chris”, you should be able to use straw man syntax like this: > > case /(let id: \d+): (let name: \w+)/: print(id); print(name) This is a good start, but inadequate for handling the kind of recursive grammars to which you want to generalize regexes, because you have to bind the same variable multiple times—often re-entrantly—during the same match. Actually the Kleene star (*) already has this basic problem, without the re-entrancy, but if you want to build real parsers, you need to do more than simply capture the last substring matched by each group. > Unless we were willing to dramatically expand how patterns work, this > requires baking support into the language. I don't understand the "Unless" part of that sentence. It seems obvious that no expansion of how patterns work could make the above work without language changes. > 5. Scanner/“Formatter" integration: Taking the above one step farther, > we could have default patterns for known types (and make it extensible > to user defined types of course). For example, \d+ is the obvious > pattern for integers, so you should be able to write the above like > this (in principle): > > case /(let id: Int): (let name: \w+)/: print(id); print(name) > > In addition to avoiding having to specify \d+ all the time, this > eliminates the need for a “string to int” conversion after the pattern > is matched, because id would be bound as type Int already. Yup. > Anyway, to summarize, I think that getting regex’s into the language > is really important and expect them to be widely used. As such, I > think it is worth burning compiler/language complexity to make them be > truly great in Swift. Thanks for this post, Chris; it clarifies beautifully the reasons that we're not trying to tackle regexes right away. -- -Dave _______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
