Re: [rust-dev] RFC: Syntax for raw string literals
I think string literals should contain exactly what they contain in their source form, without any additional processing. If you want to express characters that are inconvenient to type, you can use control sequences and a (standard) formatting library to produce them. I'm actually very intrigued by the idea of eliminating escape characters altogether in the default string literals. Would follow nicely from how we allow newlines in string literals. We'd have to give up the optional whitespace-chomping behavior around newlines, though, which would make me pretty sad. And are you really willing to force everyone who wants to include a quotation mark in a string to go through a syntax extension to do it? facetious People, please! Using delimiters on string literals is tantamount to checking for null to determine when you've reached the end of a string in memory. We've graduated beyond those barbarous days by explicitly noting the length of each string in the header, so let's just reuse that idea! Behold, Rust's new string literals: fn main() { print(#7hello); print(#2, ); print(#5world); } /facetious On Sun, Sep 22, 2013 at 5:32 PM, Sebastian Sylvan sebastian.syl...@gmail.com wrote: On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. This is one of those things where I feel almost all languages get wrong, and probably mostly for historical reasons. IMO there should *only* be raw string literals on the syntax level. It seems extremely weird to me that languages have this second-level language that gets interpreted within a literal. That kind of higher level processing should be part of a formatting library (e.g. a macro like fmt), rather than an embedded language inside the literal syntax. So, I think string literals should contain exactly what they contain in their source form, without any additional processing. If you want to express characters that are inconvenient to type, you can use control sequences and a (standard) formatting library to produce them. -- Sebastian Sylvan ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On Tue, Sep 24, 2013 at 12:44 PM, Benjamin Striegel ben.strie...@gmail.comwrote: I think string literals should contain exactly what they contain in their source form, without any additional processing. If you want to express characters that are inconvenient to type, you can use control sequences and a (standard) formatting library to produce them. I'm actually very intrigued by the idea of eliminating escape characters altogether in the default string literals. Would follow nicely from how we allow newlines in string literals. We'd have to give up the optional whitespace-chomping behavior around newlines, though, which would make me pretty sad. And are you really willing to force everyone who wants to include a quotation mark in a string to go through a syntax extension to do it? Yes! It seems to me that many/most string literals are used for in conjunction with various formatting functions anyway, so I wouldn't think it would be a big deal in practice. Throwing in a call to fmt isn't a big burden, imo. -- Sebastian Sylvan ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
I also forgot to mention the possibility of putting a filename as the eos string. I think its kind of neat. r##index.html## html ... /html ##index.html## ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Hi everyone, Have we considered syntax similar to Ruby style heredocs? I particularly like the light looking syntax. - The indentation of the block is determined by the indentation of the eos marker. Keeping code flow natural. eos Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eos - Brackets in the eos marker are flipped to allow [[[raw]]] - eoseos causes a literal eos to be inserted. For example a raw string My main concern is that might be a common operator. Perhaps would be ok? Thoughts? On 21/09/2013 4:28 AM, Alex Crichton a...@crichton.co wrote: Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). I think an important thing to keep in mind is that the main reason behind creating a new form of literal is for things like: * Escapes in format! strings * Possible regular expression syntax (this also may be a syntax extension) * Type literal windows paths (escaping \ is hard) * Otherwise long literals which may contain quotes (like html text) With those in mind, although Lua's syntax is sufficient, is it nice to use? If the first thing I saw as an introduction to Rust was: fn main() { println!([[Hello, {}!]], world); } I would be a little confused. Now the [[/]] aren't really necessary in this case, but I'm personally unsure of how usable [[/]] would be throughout the language. Raw literals in languages like C++ and Lua I think aren't intended to be used that often. Instead they should be used only when necessary, and you frequently don't see them in code. For rust, the use cases which are the cause of this discussion are actually fairly common, and I'm not sure that we'd want to see [[/]] all over the place, although of course that's just my opinion :) Skimming back, I haven't seen a suggestion of the backtick character as a delimiter. Go takes this approach, and I don't believe that in Go you can have a backtick anywhere in a backtick literal, and otherwise what you see is what you get. It's at least something to consider, though. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
[rust-dev] RFC: Syntax for raw string literals
Oh right, that's fair enough. I think the indentation/escaping issues can be fixed however the new line issues you mentioned will still exist for strings split over multiple lines using this syntax. Good luck! Steven On Monday, September 23, 2013, Kevin Ballard wrote: Heredocs are primarily intended for multiline strings. Raw strings are intended for strings that have no escapes. Raw strings typically allow newlines, but that is not their primary purpose (and in Rust, regular strings allow newlines anyway). Trying to use a heredoc syntax for raw strings is just a headache (because of indentation, and dealing with the first and/or trailing newline in the heredoc). -Kevin On Sep 22, 2013, at 11:52 AM, Artem Egorkine art...@gmail.com wrote: I must be missing something about ruby heredocs, but the indentation had always been a painful question about them ( http://stackoverflow.com/questions/3772864/how-do-i-remove-leading-whitespace-chars-from-ruby-heredoc). Another thing, of course, it's that they are by no means raw (which of course doesn't stop rust from adopting their syntax for raw strings. I would just say that it would be nice to pick such syntax for raw strings that allows for both single line raw strings and multi-line raw strings to be represented easily. On Sep 22, 2013 1:00 PM, Steven Ashley ste...@ashley.net.nz wrote: Hi everyone, Have we considered syntax similar to Ruby style heredocs? I particularly like the light looking syntax. - The indentation of the block is determined by the indentation of the eos marker. Keeping code flow natural. eos Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eos - Brackets in the eos marker are flipped to allow [[[raw]]] - eoseos causes a literal eos to be inserted. For example a raw string My main concern is that might be a common operator. Perhaps would be ok? Thoughts? On 21/09/2013 4:28 AM, Alex Crichton a...@crichton.co wrote: Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). I think an important thing to keep in mind is that the main reason behind creating a new form of literal is for things like: * Escapes in format! strings * Possible regular expression syntax (this also may be a syntax extension) * Type literal windows paths (escaping \ is hard) * Otherwise long literals which may contain quotes (like html text) With those in mind, although Lua's syntax is sufficient, is it nice to use? If the first thing I saw as an introduction to Rust was: fn main() { println!([[Hello, {}!]], world); } I would be a little confused. Now the [[/]] aren't really necessary in this case, but I'm personally unsure of how usable [[/]] would be throughout the language. Raw literals in languages like C++ and Lua I think aren't intended to be used that often. Instead they should be used only when necessary, and you frequently don't see them in code. For rust, the use cases which are the cause of this discussion are actually fairly common, and I'm not sure that we'd want to see [[/]] all over the place, although of course that's just my opinion :) Skimming back, I haven't seen a suggestion of the backtick character as a delimiter. Go takes this approach, and I don't believe that in Go you can have a backtick anywhere in a backtick literal, and otherwise what you see is what you get. It's at least something to consider, though. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
[rust-dev] RFC: Syntax for raw string literals
I'm in favour of C++11 syntax. On Monday, September 23, 2013, Steven Ashley wrote: Oh right, that's fair enough. I think the indentation/escaping issues can be fixed however the new line issues you mentioned will still exist for strings split over multiple lines using this syntax. Good luck! Steven On Monday, September 23, 2013, Kevin Ballard wrote: Heredocs are primarily intended for multiline strings. Raw strings are intended for strings that have no escapes. Raw strings typically allow newlines, but that is not their primary purpose (and in Rust, regular strings allow newlines anyway). Trying to use a heredoc syntax for raw strings is just a headache (because of indentation, and dealing with the first and/or trailing newline in the heredoc). -Kevin On Sep 22, 2013, at 11:52 AM, Artem Egorkine art...@gmail.com wrote: I must be missing something about ruby heredocs, but the indentation had always been a painful question about them ( http://stackoverflow.com/questions/3772864/how-do-i-remove-leading-whitespace-chars-from-ruby-heredoc). Another thing, of course, it's that they are by no means raw (which of course doesn't stop rust from adopting their syntax for raw strings. I would just say that it would be nice to pick such syntax for raw strings that allows for both single line raw strings and multi-line raw strings to be represented easily. On Sep 22, 2013 1:00 PM, Steven Ashley ste...@ashley.net.nz wrote: Hi everyone, Have we considered syntax similar to Ruby style heredocs? I particularly like the light looking syntax. - The indentation of the block is determined by the indentation of the eos marker. Keeping code flow natural. eos Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eos - Brackets in the eos marker are flipped to allow [[[raw]]] - eoseos causes a literal eos to be inserted. For example a raw string My main concern is that might be a common operator. Perhaps would be ok? Thoughts? On 21/09/2013 4:28 AM, Alex Crichton a...@crichton.co wrote: Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). I think an important thing to keep in mind is that the main reason behind creating a new form of literal is for things like: * Escapes in format! strings * Possible regular expression syntax (this also may be a syntax extension) * Type literal windows paths (escaping \ is hard) * Otherwise long literals which may contain quotes (like html text) With those in mind, although Lua's syntax is sufficient, is it nice to use? If the first thing I saw as an introduction to Rust was: fn main() { println!([[Hello, {}!]], world); } I would be a little confused. Now the [[/]] aren't really necessary in this case, but I'm personally unsure of how usable [[/]] would be throughout the language. Raw literals in languages like C++ and Lua I think aren't intended to be used that often. Instead they should be used only when necessary, and you frequently don't see them in code. For rust, the use cases which are the cause of this discussion are actually fairly common, and I'm not sure that we'd want to see [[/]] all over the place, although of course that's just my opinion :) Skimming back, I haven't seen a suggestion of the backtick character as a delimiter. Go takes this approach, and I don't believe that in Go you can have a backtick anywhere in a backtick literal, and otherwise what you see is what you get. It's at least something to consider, though. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. This is one of those things where I feel almost all languages get wrong, and probably mostly for historical reasons. IMO there should *only* be raw string literals on the syntax level. It seems extremely weird to me that languages have this second-level language that gets interpreted within a literal. That kind of higher level processing should be part of a formatting library (e.g. a macro like fmt), rather than an embedded language inside the literal syntax. So, I think string literals should contain exactly what they contain in their source form, without any additional processing. If you want to express characters that are inconvenient to type, you can use control sequences and a (standard) formatting library to produce them. -- Sebastian Sylvan ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On 09/22/2013 05:40 PM, Kevin Ballard wrote: I've filed a summary of this conversation as an RFC issue on the GitHub issue tracker. https://github.com/mozilla/rust/issues/9411 I've used a variation of the option 10 for my own configuration format's raw strings: delimraw textdelim Where delim was an equivalent of an identifier. If ` is a problem, then maybe using ' works too? 'delimraw textdelim' 'raw text' -SL ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
' doesn't work because 'delim is parsed as a lifetime. -Kevin On Sep 22, 2013, at 3:41 PM, SiegeLord slab...@aim.com wrote: On 09/22/2013 05:40 PM, Kevin Ballard wrote: I've filed a summary of this conversation as an RFC issue on the GitHub issue tracker. https://github.com/mozilla/rust/issues/9411 I've used a variation of the option 10 for my own configuration format's raw strings: delimraw textdelim Where delim was an equivalent of an identifier. If ` is a problem, then maybe using ' works too? 'delimraw textdelim' 'raw text' -SL ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On 09/22/2013 07:10 PM, Kevin Ballard wrote: ' doesn't work because 'delim is parsed as a lifetime. The parser will have to be modified to support raw strings in any of their manifestations. Is it a fact that there is no possible parser than can differentiate between 'delim and 'delim ? I guess it'll give trouble to this current syntax 'fooblah, but it wouldn't be the first place in the grammar where a space was necessary to disambiguate between constructs ( comes to mind). -SL ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
It would require changing the rules for lifetimes, with no benefit (and no clear new rule to use anyway). 'foodelim is perfectly legal today, and I see no reason to change that. -Kevin On Sep 22, 2013, at 4:26 PM, SiegeLord slab...@aim.com wrote: On 09/22/2013 07:10 PM, Kevin Ballard wrote: ' doesn't work because 'delim is parsed as a lifetime. The parser will have to be modified to support raw strings in any of their manifestations. Is it a fact that there is no possible parser than can differentiate between 'delim and 'delim ? I guess it'll give trouble to this current syntax 'fooblah, but it wouldn't be the first place in the grammar where a space was necessary to disambiguate between constructs ( comes to mind). -SL ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On 09/22/2013 07:45 PM, Kevin Ballard wrote: It would require changing the rules for lifetimes, with no benefit (and no clear new rule to use anyway). 'foodelim is perfectly legal today, and I see no reason to change that. It's not as big a change as you make it out to be, but fair enough. Looking at the parser right now, it seems to me that implementing the leading 'R' in C++'s syntax will be just as difficult/easy as doing my delimstuffdelim proposal so I'm sticking to that idea as my 'vote'. If C++ way is chosen, I'd suggest the following permutation of the delimeters, as I think it looks lighter (by virtue of using smaller characters): r'delimraw stringdelim' r'raw string' -SL ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On Sep 22, 2013, at 5:27 PM, SiegeLord slab...@aim.com wrote: On 09/22/2013 07:45 PM, Kevin Ballard wrote: It would require changing the rules for lifetimes, with no benefit (and no clear new rule to use anyway). 'foodelim is perfectly legal today, and I see no reason to change that. It's not as big a change as you make it out to be, but fair enough. Looking at the parser right now, it seems to me that implementing the leading 'R' in C++'s syntax will be just as difficult/easy as doing my delimstuffdelim proposal so I'm sticking to that idea as my 'vote'. With C++11 syntax, `Rfoo` is very obviously the start of a raw string. With your syntax, what about `addfoo`? Is that obviously the start of a raw string, or did the user just forget to type the ( in their function call? They may look the same to a lexer, but I think that being very clear about what starts the raw string is beneficial for reading. If C++ way is chosen, I'd suggest the following permutation of the delimeters, as I think it looks lighter (by virtue of using smaller characters): r'delimraw stringdelim' r'raw string' I'd really rather not overload the meaning of the ' character, if at all possible. Right now it's used for lifetimes, and character literals. Expanding it to also be used in string literals just feels like unnecessary overloading. We already have a perfectly good that means string literal. I suppose you could flip that to rdelim'raw string'delim or r'raw string'. I just don't see why that's any better than Rdelim(raw string)delim or R(raw string). Especially in the r'raw string' case, having lots of little tick marks in a row takes more effort to visually distinguish. I suppose r(raw string) is an option, but if we're that close to C++11 we may as well just go whole hog and be consistent with their syntax. -Kevin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On 20.09.2013 22:35, Benjamin Striegel wrote: As usual, I'm highly resistant to use of the backtick because Markdown uses it pervasively. Not only would this make it very annoying to embed Markdown in strings, it can make it impossible to embed inline Rust code in Markdown editors. Let's leave the backtick as a metasyntactic symbol. I am not so sure the markdown argument stands, because it is only an issue in `inline code blocks` really. Blocks fenced with ``` or 4-space indents can contain backticks just fine, and can typically do in bash scripts. In inline blocks, you can always escape them with \` which sure isn't as nice, but I find it rare to have much more than alpha-numeric identifiers in inline blocks. Cheers -- Jordi Boggiano @seldaek - http://nelm.io/jordi ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Kevin (cc'ing rust-dev)- Of the choices listed here, I prefer the C++11 syntax. Whatever syntax we choose, I would prefer one that has user-selected delimiting character sequences (as illustrated by the cases of D and C++11). From my point-of-view, that is the only way to get a raw string that really means raw string; otherwise, you end up having to select some exceptional case (e.g. the backslashes, doubled-up quotes, etc of the other options Kevin described). Cheers, -Felix - Original Message - From: Kevin Ballard ke...@sb.org To: rust-dev@mozilla.org Sent: Thursday, September 19, 2013 10:36:39 PM Subject: [rust-dev] RFC: Syntax for raw string literals One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to this is some raw text\n. Pros: * Flexible * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding) Cons: * Overly complicated ## Custom syntax There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like Rraw string literal with embedded quotes., which becomes `raw string literal with embedded quotes`. Pros: * Very simple * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding) Cons: * Slightly odd to read
Re: [rust-dev] RFC: Syntax for raw string literals
You always have to have some exceptional case, though, don't you? What if you have a string literal that contains every single character? Or what if you have literals in procedurally generated code that might contain any unknown character? There's always a possibility that a given delimiter (sequence of) character(s) might be duplicated inside the literal. Isn't there? Carl Eastlund On Sat, Sep 21, 2013 at 7:24 AM, Felix Klock pnkfe...@mozilla.com wrote: Kevin (cc'ing rust-dev)- Of the choices listed here, I prefer the C++11 syntax. Whatever syntax we choose, I would prefer one that has user-selected delimiting character sequences (as illustrated by the cases of D and C++11). From my point-of-view, that is the only way to get a raw string that really means raw string; otherwise, you end up having to select some exceptional case (e.g. the backslashes, doubled-up quotes, etc of the other options Kevin described). Cheers, -Felix - Original Message - From: Kevin Ballard ke...@sb.org To: rust-dev@mozilla.org Sent: Thursday, September 19, 2013 10:36:39 PM Subject: [rust-dev] RFC: Syntax for raw string literals One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to this is some raw text\n. Pros: * Flexible * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's
Re: [rust-dev] RFC: Syntax for raw string literals
On Sat, Sep 21, 2013 at 4:52 PM, Carl Eastlund c...@ccs.neu.edu wrote: You always have to have some exceptional case, though, don't you? What if you have a string literal that contains every single character? Or what if you have literals in procedurally generated code that might contain any unknown character? There's always a possibility that a given delimiter (sequence of) character(s) might be duplicated inside the literal. Isn't there? Carl Eastlund A shell script's here document or a C++11 raw string literal gives you the ability to choose the sequence ending the literal. You can always pick an appropriate end delimiter for a given string. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
The delimiter can be whatever you want in C++11 syntax (well, with restrictions on the charset, but among that charset it's freeform). You can _always_ pick a delimiter that isn't found in the text. If you're procedurally generating the text, surely you can also write an algorithm to pick a delimiter. It's not very hard to do so. -Kevin On Sep 21, 2013, at 1:52 PM, Carl Eastlund c...@ccs.neu.edu wrote: You always have to have some exceptional case, though, don't you? What if you have a string literal that contains every single character? Or what if you have literals in procedurally generated code that might contain any unknown character? There's always a possibility that a given delimiter (sequence of) character(s) might be duplicated inside the literal. Isn't there? Carl Eastlund On Sat, Sep 21, 2013 at 7:24 AM, Felix Klock pnkfe...@mozilla.com wrote: Kevin (cc'ing rust-dev)- Of the choices listed here, I prefer the C++11 syntax. Whatever syntax we choose, I would prefer one that has user-selected delimiting character sequences (as illustrated by the cases of D and C++11). From my point-of-view, that is the only way to get a raw string that really means raw string; otherwise, you end up having to select some exceptional case (e.g. the backslashes, doubled-up quotes, etc of the other options Kevin described). Cheers, -Felix - Original Message - From: Kevin Ballard ke...@sb.org To: rust-dev@mozilla.org Sent: Thursday, September 19, 2013 10:36:39 PM Subject: [rust-dev] RFC: Syntax for raw string literals One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must
Re: [rust-dev] RFC: Syntax for raw string literals
The way that Lua does raw strings is also fairly nifty. Check out http://www.lua.org/manual/5.2/manual.html, section 3.1, or, in short: - Strings can be delimited by [===[, with any number of equals signs. The corresponding closing delimiter must match the original number of equals signs. - No escaping is done. - Any kind of end-of-line sequence (i.e. \r and \n in any order) is converted to just a newline. - It can run for multiple lines. --Andrew D On Thu, Sep 19, 2013 at 10:28 PM, Kevin Cantu m...@kevincantu.org wrote: I think designing good traits to support all these text implementations is far more important than whatever hungarian notation is preferred for literals. Kevin On Thu, Sep 19, 2013 at 2:50 PM, Martin DeMello martindeme...@gmail.comwrote: Ah, good point. You could fix it by having a very small whitelist of acceptable delimiters, but that probably takes it into overcomplex territory. martin On Thu, Sep 19, 2013 at 2:46 PM, Kevin Ballard ke...@sb.org wrote: As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`? -Kevin On Sep 19, 2013, at 2:41 PM, Martin DeMello martindeme...@gmail.com wrote: Yes, I figured R followed by a non-alphabetical character could serve the same purpose as ruby's %char. martin On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard ke...@sb.org wrote: I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either. -Kevin On Sep 19, 2013, at 2:10 PM, Martin DeMello martindeme...@gmail.com wrote: How complicated would it be to use R but with arbitrary paired delimiters (the way, for instance, ruby does it)? It's very handy to pick a delimiter you know does not appear in the string, e.g. if you had a string containing ')' you could use R{this is a string with a ) in it} or R|this is a string with a ) in it|. martin On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may
Re: [rust-dev] RFC: Syntax for raw string literals
On 2013-09-19, at 23:45 , Kevin Ballard wrote: Yes I know, but in my (rather limited) experience with Python, triple-quoted strings are typically used for docstrings. It was just an example anyway. They're also commonly used for multiline strings as single-quoted strings don't require it. * The quote-escaping oddness is less of an issue in Python as you can also use single-quotes for delimiting, or use triple-quoted strings (if you need to embed both single and double quotes in rawstrings). If I need to embed both ''' and in a string, I'm out of luck. The chance of that is as remote as can be. I've never seen or heard of it happen. And mind, the issue must happen *in a rawstring* which is even more unlikely. Also, windows file paths windows paths can also use forward slashes so that's not a very interesting justification. Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is). True. Do you expect writing literal UNC paths in Rust to be a common occurrence? There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things). That's not really relevant to a rawstrings proposal, why would a developer embed such a path literally? As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me. I'd have thought it a better idea to use path builders (maybe macros) and avoid embedding literal path separators in order to avoid portability issues. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On Sep 20, 2013, at 1:13 AM, Masklinn maskl...@masklinn.net wrote: Also, windows file paths windows paths can also use forward slashes so that's not a very interesting justification. Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is). True. Do you expect writing literal UNC paths in Rust to be a common occurrence? Maybe not for most people, but I've been writing them a _lot_ lately (I'm rewriting the path module). Regular expressions is really the most common application here. There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things). That's not really relevant to a rawstrings proposal, why would a developer embed such a path literally? Perhaps they want to hard-code a path that refers to something that requires the \\?\ prefix (such as a path that contains / as part of a path component, or is longer than 255 characters). But just in general, \ is the canonical Windows path separator. I don't think use / is particularly great advice. What if this string is intended for displaying? As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me. I'd have thought it a better idea to use path builders (maybe macros) and avoid embedding literal path separators in order to avoid portability issues. People still use literal path separators in strings all the time in languages that support path-building methods. -Kevin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
If I need to embed both ''' and in a string, I'm out of luck. The chance of that is as remote as can be. I've never seen or heard of it happen. And mind, the issue must happen *in a rawstring* which is even more unlikely. You should note that, as soon as you include something in the language itself, that creates meaningful strings (programs in the language) that include the token, which are not likely, at some point, to need to be written as a multiline string in the language itself. (As a related example, as someone writing JavaScript-analyzing code in JavaScript, I've had several bugs caused by the fact that the nonsense, no-one-is-ever-going-to-use-this word __proto__ has a very hard to suppress special meaning, and you *are* going to use it when analyzing the elements in another JavaScript program.) ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On 2013-09-20, at 10:26 , Marijn Haverbeke wrote: If I need to embed both ''' and in a string, I'm out of luck. The chance of that is as remote as can be. I've never seen or heard of it happen. And mind, the issue must happen *in a rawstring* which is even more unlikely. You should note that, as soon as you include something in the language itself, that creates meaningful strings (programs in the language) that include the token, which are not likely, at some point, to need to be written as a multiline string in the language itself. It's already noted, my objections are very much that this is highly unlikely to be an issue as it only comes to a head when needing *triple-quoted rawstrings* to include *their own* delimiters (meaning a triple-quoted rawstring which needs to include both triple-quoted delimiters at the same time). Even unlikelier given python will concatenate string literals during parsing. On 2013-09-20, at 10:25 , Kevin Ballard wrote: Regular expressions is really the most common application here. Right, which was just about all I was saying in the original message. People still use literal path separators in strings all the time in languages that support path-building methods. Something I don't believe should be encouraged. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Out of all the mentioned syntaxes, Python's seems simple and easy (and the corner cases appear to be fairly unlikely for the actual use cases for raw strings), Ruby's seems very powerful and if a couple of restrictions are added could probably fit well, and Lua's seem very well designed by allowing delimiters of arbitrary length. As a user of higher-level languages, all of these seem appealing to me. I don't really feel that rawstring should be complicated to use, and I don't really think the limitations are bad so long as they areexplicitly documented (which is how it should be). On Fri, Sep 20, 2013 at 5:38 AM, Masklinn maskl...@masklinn.net wrote: On 2013-09-20, at 10:26 , Marijn Haverbeke wrote: If I need to embed both ''' and in a string, I'm out of luck. The chance of that is as remote as can be. I've never seen or heard of it happen. And mind, the issue must happen *in a rawstring* which is even more unlikely. You should note that, as soon as you include something in the language itself, that creates meaningful strings (programs in the language) that include the token, which are not likely, at some point, to need to be written as a multiline string in the language itself. It's already noted, my objections are very much that this is highly unlikely to be an issue as it only comes to a head when needing *triple-quoted rawstrings* to include *their own* delimiters (meaning a triple-quoted rawstring which needs to include both triple-quoted delimiters at the same time). Even unlikelier given python will concatenate string literals during parsing. On 2013-09-20, at 10:25 , Kevin Ballard wrote: Regular expressions is really the most common application here. Right, which was just about all I was saying in the original message. People still use literal path separators in strings all the time in languages that support path-building methods. Something I don't believe should be encouraged. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev -- Andrés Osinski http://www.andresosinski.com.ar/ ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Python's has really stupid handling of backslashes, and I really don't like how it cannot represent all valid strings. I'd really prefer not to make that same mistake. Ruby's syntax cannot be used because % lexes as an operator. Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). It seems roughly equivalent to C++11's syntax though, both in ease of use and flexibility. -Kevin On Sep 20, 2013, at 1:41 AM, Andres Osinski andres.osin...@gmail.com wrote: Out of all the mentioned syntaxes, Python's seems simple and easy (and the corner cases appear to be fairly unlikely for the actual use cases for raw strings), Ruby's seems very powerful and if a couple of restrictions are added could probably fit well, and Lua's seem very well designed by allowing delimiters of arbitrary length. As a user of higher-level languages, all of these seem appealing to me. I don't really feel that rawstring should be complicated to use, and I don't really think the limitations are bad so long as they areexplicitly documented (which is how it should be). On Fri, Sep 20, 2013 at 5:38 AM, Masklinn maskl...@masklinn.net wrote: On 2013-09-20, at 10:26 , Marijn Haverbeke wrote: If I need to embed both ''' and in a string, I'm out of luck. The chance of that is as remote as can be. I've never seen or heard of it happen. And mind, the issue must happen *in a rawstring* which is even more unlikely. You should note that, as soon as you include something in the language itself, that creates meaningful strings (programs in the language) that include the token, which are not likely, at some point, to need to be written as a multiline string in the language itself. It's already noted, my objections are very much that this is highly unlikely to be an issue as it only comes to a head when needing *triple-quoted rawstrings* to include *their own* delimiters (meaning a triple-quoted rawstring which needs to include both triple-quoted delimiters at the same time). Even unlikelier given python will concatenate string literals during parsing. On 2013-09-20, at 10:25 , Kevin Ballard wrote: Regular expressions is really the most common application here. Right, which was just about all I was saying in the original message. People still use literal path separators in strings all the time in languages that support path-building methods. Something I don't believe should be encouraged. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev -- Andrés Osinski http://www.andresosinski.com.ar/ ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Does it HAVE to be a single typed char seen on the English 101 keyboard ? History Lesson: The industry in the very early, early days of printing, storing, and processing characters, both English and non-English, came up with a solution around the use of Control Characters. ASCI Char 1 is known as Start Of Header, or abbreviated SOH. ASCII Char 2 is known as Start of Text, or abbreviated STX. ASCII Char 3 is known as End of Text, or abbreviated ETX. It got me thinking of how various industries to this day still use Start of Text and End of Text... what we are discussing as enclosing a String verbatim. Many data operations that I perform with conversion of string fields are actually done by first wrapping with Control Chars [1] to enclose the String LITERALLY. Apple's Enterprise Partner Feed is an example that uses such basic Control Chars to separate fields and interestingly uses multibyte EOL Control Chars to retain even unicode contents (Foreign Language strings, that use quotes of a different nature at times [2] and that sometimes appear in its fields and that need to be retained inside a database field as well.) I am wondering if doing something similar to that the industry does with using Control Chars to represent a STX or ETX would not be even wiser to subplant String Literal ? i.e. do not reinvent the fast spinning wheel that also has built-in never go flat technology. :) [1] http://www.theasciicode.com.ar/ascii-control-characters/start-of-text-ascii-code-2.html [2] http://en.wikipedia.org/wiki/Non-English_usage_of_quotation_marks Thoughts ? -- -Thad Thad on Freebase.com http://www.freebase.com/view/en/thad_guidry Thad on LinkedIn http://www.linkedin.com/in/thadguidry/ ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
As usual, I'm highly resistant to use of the backtick because Markdown uses it pervasively. Not only would this make it very annoying to embed Markdown in strings, it can make it impossible to embed inline Rust code in Markdown editors. Let's leave the backtick as a metasyntactic symbol. On Fri, Sep 20, 2013 at 3:45 PM, Kevin Ballard ke...@sb.org wrote: I considered backtick as well. If that approach is used, I would suggest that a doubled-up backtick represent a single backtick in the string, i.e. `error: path ``{}' failed`. This is pretty much equivalent to just using r as the syntax, although backtick may be a slightly nicer syntax for it. -Kevin On Sep 20, 2013, at 9:27 AM, Alex Crichton a...@crichton.co wrote: Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). I think an important thing to keep in mind is that the main reason behind creating a new form of literal is for things like: * Escapes in format! strings * Possible regular expression syntax (this also may be a syntax extension) * Type literal windows paths (escaping \ is hard) * Otherwise long literals which may contain quotes (like html text) With those in mind, although Lua's syntax is sufficient, is it nice to use? If the first thing I saw as an introduction to Rust was: fn main() { println!([[Hello, {}!]], world); } I would be a little confused. Now the [[/]] aren't really necessary in this case, but I'm personally unsure of how usable [[/]] would be throughout the language. Raw literals in languages like C++ and Lua I think aren't intended to be used that often. Instead they should be used only when necessary, and you frequently don't see them in code. For rust, the use cases which are the cause of this discussion are actually fairly common, and I'm not sure that we'd want to see [[/]] all over the place, although of course that's just my opinion :) Skimming back, I haven't seen a suggestion of the backtick character as a delimiter. Go takes this approach, and I don't believe that in Go you can have a backtick anywhere in a backtick literal, and otherwise what you see is what you get. It's at least something to consider, though. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
[rust-dev] RFC: Syntax for raw string literals
One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to this is some raw text\n. Pros: * Flexible * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding) Cons: * Overly complicated ## Custom syntax There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like Rraw string literal with embedded quotes., which becomes `raw string literal with embedded quotes`. Pros: * Very simple * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding) Cons: * Slightly odd to read ## Conclusion Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes: C++11: R(this is a python docstring) Custom: Rthis is a python docstring Based on this
Re: [rust-dev] RFC: Syntax for raw string literals
On 2013-09-19, at 22:36 , Kevin Ballard wrote: I welcome any comments, criticisms, or suggestions. * C# also has rawstrings, which were not looked at. C#'s rawstrings disable escaping entirely but add a new one: doubling quotes will insert a single quote in the resulting string (similar to quote-escaping in SQL or Smalltalk). * The docstring comment is incorrect, a docstring is a string in the first position of a module, a class statement or a function statement. A single-quoted string at these positions will yield a docstring. The triple-quoting is a string syntax embedding newlines (single-quoted strings can not contain literal newlines in Python, only escaped ones). Obviously, triple-quoted python string can be raw. * The quote-escaping oddness is less of an issue in Python as you can also use single-quotes for delimiting, or use triple-quoted strings (if you need to embed both single and double quotes in rawstrings). * Perl's quotes and quote-like operators would certainly deserve mention. Also, windows file paths windows paths can also use forward slashes so that's not a very interesting justification. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
Just to make sure - how does the C++ syntax behave in the presence of line breaks? Specifically, what does it do with leading (and trailing) white space of each line? My guess is that they would be included in the string, is that correct? At any rate, having some sort of here documents would be very nice. The C++ syntax is reasonable, though I really don't have a strong preference here. It might be more Rust-ish to use a macro notation instead: str!(delimiter.delimiter), or something like that. BTW, I found myself creating (in several languages) an unindent string function that would (1) if the string starts with a line break, remove it; (2) remove the leading white space of the 1st line from all the lines. Applying this to here documents allows indenting them together with the code that includes them. In Rust, the downside of this approach is that the result isn't 'static any more... Not that this warrants making such complex functionality a built-in of the syntax, of course. Oren. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either. -Kevin On Sep 19, 2013, at 2:10 PM, Martin DeMello martindeme...@gmail.com wrote: How complicated would it be to use R but with arbitrary paired delimiters (the way, for instance, ruby does it)? It's very handy to pick a delimiter you know does not appear in the string, e.g. if you had a string containing ')' you could use R{this is a string with a ) in it} or R|this is a string with a ) in it|. martin On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to this is some raw text\n. Pros: * Flexible * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding) Cons: * Overly complicated ## Custom syntax There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like Rraw string literal with embedded quotes.,
Re: [rust-dev] RFC: Syntax for raw string literals
On Sep 19, 2013, at 1:56 PM, Oren Ben-Kiki o...@ben-kiki.org wrote: Just to make sure - how does the C++ syntax behave in the presence of line breaks? Specifically, what does it do with leading (and trailing) white space of each line? My guess is that they would be included in the string, is that correct? It includes every single character that occurs in the source between the delimiters. So cout R(this is a string); will print this is, newline, horizontal tab, a string. At any rate, having some sort of here documents would be very nice. The C++ syntax is reasonable, though I really don't have a strong preference here. It might be more Rust-ish to use a macro notation instead: str!(delimiter.delimiter), or something like that. Not possible. This syntax needs to be part of the lexer, and macros/syntax extensions operate on token trees, not on raw source characters. -Kevin BTW, I found myself creating (in several languages) an unindent string function that would (1) if the string starts with a line break, remove it; (2) remove the leading white space of the 1st line from all the lines. Applying this to here documents allows indenting them together with the code that includes them. In Rust, the downside of this approach is that the result isn't 'static any more... Not that this warrants making such complex functionality a built-in of the syntax, of course. Oren. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
On Sep 19, 2013, at 2:13 PM, Masklinn maskl...@masklinn.net wrote: On 2013-09-19, at 22:36 , Kevin Ballard wrote: I welcome any comments, criticisms, or suggestions. * C# also has rawstrings, which were not looked at. C#'s rawstrings disable escaping entirely but add a new one: doubling quotes will insert a single quote in the resulting string (similar to quote-escaping in SQL or Smalltalk). I've never touched C#. Your description sounds like the custom syntax I described. I figured there were existing languages that did this, but none came to mind (I should have known SQL did it though). * The docstring comment is incorrect, a docstring is a string in the first position of a module, a class statement or a function statement. A single-quoted string at these positions will yield a docstring. The triple-quoting is a string syntax embedding newlines (single-quoted strings can not contain literal newlines in Python, only escaped ones). Obviously, triple-quoted python string can be raw. Yes I know, but in my (rather limited) experience with Python, triple-quoted strings are typically used for docstrings. It was just an example anyway. * The quote-escaping oddness is less of an issue in Python as you can also use single-quotes for delimiting, or use triple-quoted strings (if you need to embed both single and double quotes in rawstrings). If I need to embed both ''' and in a string, I'm out of luck. For example, I cannot represent the following: Triple-quoted strings in Python use the delimiters ''' and . * Perl's quotes and quote-like operators would certainly deserve mention. I'm not a Perl programmer, but IIRC they look like `q{string}`, right? I don't think this is suitable for Rust because how would you lex `do q{foo()}`? Is this the invalid construct `do some-string` or is it calling a function named q with a closure? Also, windows file paths windows paths can also use forward slashes so that's not a very interesting justification. Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is). There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things). As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me. -Kevin ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] RFC: Syntax for raw string literals
As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`? -Kevin On Sep 19, 2013, at 2:41 PM, Martin DeMello martindeme...@gmail.com wrote: Yes, I figured R followed by a non-alphabetical character could serve the same purpose as ruby's %char. martin On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard ke...@sb.org wrote: I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either. -Kevin On Sep 19, 2013, at 2:10 PM, Martin DeMello martindeme...@gmail.com wrote: How complicated would it be to use R but with arbitrary paired delimiters (the way, for instance, ruby does it)? It's very handy to pick a delimiter you know does not appear in the string, e.g. if you had a string containing ')' you could use R{this is a string with a ) in it} or R|this is a string with a ) in it|. martin On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to this is some raw text\n. Pros: * Flexible * Allows for constructing a raw string that
Re: [rust-dev] RFC: Syntax for raw string literals
Ah, good point. You could fix it by having a very small whitelist of acceptable delimiters, but that probably takes it into overcomplex territory. martin On Thu, Sep 19, 2013 at 2:46 PM, Kevin Ballard ke...@sb.org wrote: As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`? -Kevin On Sep 19, 2013, at 2:41 PM, Martin DeMello martindeme...@gmail.com wrote: Yes, I figured R followed by a non-alphabetical character could serve the same purpose as ruby's %char. martin On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard ke...@sb.org wrote: I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either. -Kevin On Sep 19, 2013, at 2:10 PM, Martin DeMello martindeme...@gmail.com wrote: How complicated would it be to use R but with arbitrary paired delimiters (the way, for instance, ruby does it)? It's very handy to pick a delimiter you know does not appear in the string, e.g. if you had a string containing ')' you could use R{this is a string with a ) in it} or R|this is a string with a ) in it|. martin On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard ke...@sb.org wrote: One feature common to many programming languages that Rust lacks is raw string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text. I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros cons. I'm hoping we can come up with a syntax that makes sense for Rust. ## Python syntax: Python supports an r or R prefix on any string literal (both short strings, delimited with a single quote, or long strings, delimited with 3 quotes). The r or R prefix denotes a raw string, and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means rfoo\ evaluates to the string `foo\`, and similarly rfoo\\\ is `foo\\\`, but rfoo\\ is merely the string `foo\\`. Pros: * Simple syntax * Allows for embedding the closing quote character in the raw string Cons: * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it. ## C++11 syntax: C++11 allows for raw strings using a sequence of the form Rseq(raw text)seq. In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R(raw text), which allows for anything in the raw text except for the sequence `)`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text). Pros: * Allows for embedding any character at all (representable in the source file encoding), including the closing quote. * Reasonably straightforward Cons: * Syntax is slightly complicated ## D syntax: D supports three different forms of raw strings. The first two are similar, being rraw text and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms. The first looks like q(raw text) where the ( may be any non-identifier non-whitespace character. If the character is one of [({ then it is a nesting delimiter, and the close delimiter must be the matching ])} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q(foo(bar)) evaluates to foo(bar), but q(foo(bar) and q(foobar)) are both illegal. The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like qdelim this is some raw text delim It's essentially a